Fork me on GitHub

home | news | discuss | issues | license

LURE: num

Incrementally watch a stream of numbers

tim@menzies.us
August'17

Use this code to incrementally monitor the mean and standard deviation of a stream of numbers. Also, use this code for fast statistical tests (to check if two distributions are difference).

As shown below in the update function, this code uses Welford's incremental sd algorithm (thus avoiding the "catastrophic cancellation of precision" seen with other methods).

Simple usage:

local i  = NUM.create()
for _,x in pairs{9,2,5,4,12,7,8,11,9,3,7,
                 4,12,5,4,10,9,6,9,4} do
   NUM.update(i,x) end
print(x.mu, x,sd)

A shortcut:

local i,addi = NUM.watch()
for _,x in pairs{9,2,5,4,12,7,8,11,9,3,7,
                 4,12,5,4,10,9,6,9,4} do
   addi(x) end
print(i.mu, i.sd)

Another shortcut:

local  i = NUM.updates{9,2,5,4,12,7,8,11,9,3,7,
                       4,12,5,4,10,9,6,9,4} 
print(i.mu, i.sd)

Num also implements parametric tests for:

  • statistically significant difference (ttest) and
  • effect size (hedges),
  • as well as a combination of the two (same) that returns true of two distributions are statistically the same.
    • Here, same means that we cannot find significant differences and non-small effect size changes.

Of course, such parametric tests assume Gaussian distributions (bell-shaped, symmetrical, single peak). For stats tests for non-Gaussian's, see cliffsDelta and bootstrap in sample.lua as well as the Scott-Knot test in sk.lua. Note that those non-parametric tests are (slightly) slower so if all you seek is a quick heuristic for finding same and difference distributions, use Nums.

local the=require "config"

Creation

Create a new watcher.

local function create()
    return {n=0,mu=0,m2=0,sd=0,hi=-1e32,lo=1e32,w=1} end

Update

Update a watcher i with one value x.

local function update(i,x)
  if x ~= the.ignore then 
    i.n = i.n + 1
    if x < i.lo then i.lo = x end
    if x > i.hi then i.hi = x end
    local delta = x - i.mu
    i.mu = i.mu + delta / i.n
    i.m2 = i.m2 + delta * (x - i.mu) 
    if i.n > 1 then 
      i.sd = (i.m2 / (i.n - 1))^0.5 end end 
  return i end

local function about(i)
  return {{pos=i.pos},{txt=i.txt},{n=i.n},{mu=i.mu},
          {sd=i.sd},{lo=i.lo},{hi=i.hi}} end

Handy short cut

local function watch()
  local i = create()
  return i, function (x) return update(i,x) end end

Updates

Update a watcher i with many values from t. Optionally:

  • filter every value in t through some function f.
  • If an filter every value in t through some function f.
  • If an all value is supplied, the update all. Else return a new watcher.
local function updates(t,f,all)
  all = all or create()
  f = f or function (z) return z end
  for _,one in pairs(t) do
    update(all, f(one)) end 
  return all end

Normalization

local function norm(i,x)
  if x==the.ignore then return x end
  return (x - i.lo) / (i.hi - i.lo + 1e-32) end

Like

local function like(i,x,_)
  local var   = i.sd^2
  local denom = (2*math.pi*var)^.5
  local num   =  2.7182818^(-1*(x-i.mu)^2/(2*var))
  return num/(denom + 10^-64)  end
 

Distance

Using a watcher i, work of the distance between two values j and k. If either is unknown, the return the max possible distance. If both are unknown, just return nothing.

Returns two numbers x,y where x is the distance and y is 0,1 depending on whether or not we are returning nothing.

local function distance(i,j,k) 
  if j == the.ignore and k == the.ignore then
    return 0,0 
  elseif  j == the.ignore then
    k = norm(i,k)
    j =  k < 0.5 and 1 or 0
  elseif k == the.ignore then
    j = norm(i,j)
    k = j < 0.5 and 1 or 0
  else
    j,k = norm(i,j), norm(i,k)
  end
  return math.abs(j-k)^2,1 end

Map numbers to a range.

i is a list of pairs {x.label, x.most}. Find x's label with this list.

local function discretize(i,x) 
  if x==the.ignore then return x end
  if not i.bins    then return x end
  for _,b in pairs(i.bins) do
    r = b.label
    if x<=b.most then break end end
  return r end

Parametric significance tests

ttest accepts two nums called i,j.

local function ttest1(df,first,last,crit) 
  if df <= first  then
    return crit[first] 
  elseif  df >= last then 
    return crit[last]
  else
    local n1 = first
    while n1 < last do
      local n2=n1*2
      if df >= n1 and df <= n2 then
        local old,new = crit[n1],crit[n2]
        return old + (new-old) * (df-n1)/(n2-n1) end 
      n1=n1*2 end end end

local  function ttest(i,j) -- Debugged using https://goo.gl/CRl1Bz
  local t  = (i.mu - j.mu) / 
              math.sqrt(math.max(10^-64, i.sd^2/i.n + j.sd^2/j.n ))
  local a  = i.sd^2/i.n
  local b  = j.sd^2/j.n
  local df = (a + b)^2 / (10^-64 + a^2/(i.n-1) + b^2/(j.n - 1))
  local c  = ttest1(math.floor( df + 0.5 ), 
                    the.num.first,
                    the.num.last,
                    the.num.criticals[the.num.conf])
  return math.abs(t) > c end

Parametric Effect Size Test

For an explanation of this code, see equations 2,3,4 and Table 9 of V.B. Kampenes et al., A systematic review of effect size in software engineering experiments, Inform. Softw. Technol. (2007), doi:10.1016/j.infsof.2007.02.015.

For a discussion on why effect size is so important, see E. Kocaguneli, T. Zimmermann, C. Bird, N. Nagappan and T. Menzies, Distributed development considered harmful?, 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, 2013, pp. 882-890.

local function hedges(i,j) -- from https://goo.gl/w62iIL
  local nom   = (i.n - 1)*i.sd^2 + (j.n - 1)*j.sd^2
  local denom = (i.n - 1)        + (j.n - 1)
  local sp    = math.sqrt( nom / denom )
  local g     = math.abs(i.mu - j.mu) / sp  
  local c     = 1 - 3.0 / (4*(i.n + j.n - 2) - 1) -- handle small samples
  return g * c > the.num.small end -- Table9, https://goo.gl/jNNCHH says small,medium=0.38,1.0

Statistical difference

Two populations are statistically similar if they differ by less than a trivially small amount; and if they are statistically significantly different.

local function same(i,j)
  return not (hedges(i,j) and ttest(i,j)) end

External interface

return {create=create, watch=watch,
        update=update, updates=updates,
        norm=norm, about=about,
        like=like,
        distance=distance,
        same=same, ttest=ttest, hedges=hedges}

Legal

LURE, Copyright (c) 2017, Tim Menzies All rights reserved, BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.