Fork me on GitHub

home | news | discuss | issues | license

LURE: GUIDE

tim@menzies.us
August'17


Why |Install | What | Guide | Style


Learning Lua

Some great on-line resources:

  • Quick start http://tylerneylon.com/a/learn-lua/
  • Read the book.
    • The 4th edition in on Amazon.
    • The 2nd edition (which is still pretty good) is available on-line.

Before Reading This...

Read notes on the LURE LUA coding style. Note, in particular, that some of the X.lua files have Xok.lua demo/test files.


Overview

For a first pass high-level view of the code:

  • contrasts.luas reports deltas in the nodes of the tree built by...
  • sdtree.lua recursively divided the ranges found by...
  • superrange.lua uses the goal variable to combine spurious ranges generated by...
  • range.lua divides numeric ranges into bins of approximately size sqrt(N) found in ...
  • tbl.lua stores in num.lua and sym.lua summaries of the rows found by...
  • csv.lua converts strings into rows of symbols and numbers.
  • config.lua contains settings that controls all the above.

File Groupings

For a more deetaied view of thos code, As of August 8 2017, the files of LURE divide into the following groups

base    support  stats   table  learners
------  -------  ------  -----  -----------
config  csv      num     row    contrasts
show    id       range   tbl    sdtree    
tests   lists    sample         superrange
        random   spy            trees
        str      sym
                 sk
                 tiles                     

Base code

The following base code should be assumed to be global across all the rest.

  • config.lua: store global options in the global the. These can be changed by the other code, then reset to the default values by defaults().
  • show.lua: changes LUA's default printer such that printing a table also prints its contents (recursively). To avoid printing very long items, give them a keyname starting with `_`.
  • tests.lua: a simple unit test framework.

Support code

Simple stand alone utilities.

  • csv.lua: reads comma seperates values from strings or files. Pass each found row to a function.
  • id.lua: generates uniqie ids;
  • lists.lua: basic lists utilities
  • random.lua: random number generation that is stable across different platforms. This is a nice place to see how a basic LUA module is formed.
  • str.lua: basic string routines: print lists of item, replace characters, etc

Stats code

Code for studying single distributions (and for studying multiple distributions, see table, below).

  • num.lua: watches a stream of numbers, summarized as Gaussians
  • range.lua: divides a list of numbers into a set of breaks. Note that this uses a dumb unsupervised approach. For a smaller approach, that reflects over the class variable, see superranges.lua.
  • sample.lua: watches a stream of numbers, keeps a random sample, never keeps more than (say) a few hundred values.
  • spy.lua: watches numbers and, every so often, prints out the current stats.
  • sym.lua: watches a stream of symbols
  • sk.lua: ranks a list of samples, using a recursive top-down bi-clustering algorithm;
  • tiles.lua: divides a table of numbers into percentiles. Not very smart (for a smarter approach, see range).

Note also that the "watcher" modules (num, sym, spy and sample) all have a very similar protocol:

  • create: make new watcher
  • update: add an item to a watcher
  • updates: add many items to a watcher, optionally filtered through some function f. Returns a new watcher unless an optional third argument is supplied (in which case, the item is addes to this third arg).
  • watch: returns a new watcher and a convenience function for adding values to this watcher.

Another shared protocol is between num and sym:

  • distance between two items (and if one or both are the unknown symbol react appropriately)
  • norm (called by distance) to reduce numbers to the range 0..1 min..max (and for syms, this function just returns the value

Also, we can test is two num and sample distributions are statistically the same.

  • For nums, we use parametric Gaussian effect size and significance tests;
  • For samples, we use non-parametric effect size and significance tests (Scott-Knot in the sk.lua file, bootstrap, and cliff's delta be checked

Table code

One of my core data structures is tbl (table). Such tables are aggregations of rows, nums, and sym.

Tbls are a place to store rows of data.

  • When data comes in from disk. I store it as a tbl;
  • When data in one tbl is divided, the divisions are tbls.
  • When we cluster, each lucster it its own tbl.
  • When we build a denogram (a recursive division of data into sub-data, then subn-s data, then sub-sub-sub data, etc) then each node in that tree is tbl.

Each Row in tbl is its own struct. Such Rows handle comparions between rows (e.g. domination scores, KNN distance measures, Naive Bayes liklihood calculations, etc).

Each column in tbl has a header that is a Num or a Sym. and that header maintains a summary of what was seen in each column.

Tbls are incremental readers of rows of data. As rows are found, we can throw them at a table:

  • If this is other than the first row then Tbl assumes it is a row to be stored in the table. As a side-effect of storage, all the column headers are updated.
  • If this is the first row then Tbl assumes it is a header that lists the names and types of each column.
    • If the name contains ?, then Tbl should ignore this column;
    • If the name contains <,>, then the column can be categoried as a numeric goal to eb minimized or maximized;
    • If the name contains !, then the column can be categorised as a symbolic goal, to be used in classification;
    • If the name contains $, then the column is categorised as a numeric indepedent variable;
    • Otherwise, the column can be categories as a symbolic independent variable.

Note that there is nothing hard-wired in this code about ?<>!$. These can be easily changed in the categories function.

What Tbl does assume is that columns of data can be categoried as :

  • x : the independent columns;
  • y : the dependent columns;
  • all : all columns.

Within all,x,y the columns are further categorised as:

  • nums: the numeric columns;
  • syms: the symbolic columns;
  • cols: all columns.

Note that a column can have multiple categories (see categories). This is done since sometimes we have to (e.g.) process all the numerics together or process all the independent symbolics together etc. For an example of a column in multiple categories, a < column is

  • .all.cols
  • .y.cols
  • .all.nums
  • .goals
  • .less
  • .y.nums

Note that each column gets one, and only one num or sym header structure and that header structure may be stored in multiple categories.


Learner code

  • Superrange reflects over the structures generated by range to combine adjacent ranges where the distribution of the dependent ranges do not change.
  • Sdtree recursively refelects over the [superrange](superrange]s to find nested splits to the data that most reduce the variance in the dependent variable.
  • Contrasts comments on the delta between nodes in the tree found by sdtree. No such constrast sets are generated for nodes where the distribution is not statistically different.
  • Trees is a convenience package that batches up contrasts(sdtree(superrange(range(tbl(csv(config))))))
require "show"


Legal

LURE, Copyright (c) 2017, Tim Menzies All rights reserved, BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.