tim@menzies.us
August'17
Why |Install | What | Guide | Style
Learning Lua
Some great on-line resources:
- Quick start http://tylerneylon.com/a/learn-lua/
- Read the book.
- The 4th edition in on Amazon.
- The 2nd edition (which is still pretty good) is available on-line.
Before Reading This...
Read notes on the LURE LUA coding style. Note, in particular, that some of the
X.lua files have Xok.lua demo/test files.
Overview
For a first pass high-level view of the code:
- contrasts.luas reports deltas in the nodes of the tree built by...
- sdtree.lua recursively divided the ranges found by...
- superrange.lua uses the goal variable to combine spurious ranges generated by...
- range.lua divides numeric ranges into bins of approximately size sqrt(N) found in ...
- tbl.lua stores in num.lua and sym.lua summaries of the rows found by...
- csv.lua converts strings into rows of symbols and numbers.
- config.lua contains settings that controls all the above.
File Groupings
For a more deetaied view of thos code,
As of August 8 2017, the files of LURE divide into the following groups
base support stats table learners
------ ------- ------ ----- -----------
config csv num row contrasts
show id range tbl sdtree
tests lists sample superrange
random spy trees
str sym
sk
tiles
Base code
The following base code should be assumed to be global across all the rest.
- config.lua: store global options in the global
the . These can be changed by the other code, then reset
to the default values by defaults() .
- show.lua: changes LUA's default printer such that printing a table also prints
its contents (recursively). To avoid printing very long items,
give them a keyname starting with `_`.
- tests.lua: a simple unit test framework.
Support code
Simple stand alone utilities.
- csv.lua: reads comma seperates values from strings or files. Pass each found row to a function.
- id.lua: generates uniqie ids;
- lists.lua: basic lists utilities
- random.lua: random number generation that is stable across different platforms.
This is a nice place to see how a basic LUA module is formed.
- str.lua: basic string routines: print lists of item, replace characters, etc
Stats code
Code for studying single distributions (and for studying multiple distributions, see table, below).
- num.lua: watches a stream of numbers, summarized as Gaussians
- range.lua: divides a list of numbers into a set of breaks. Note that this uses a dumb
unsupervised
approach. For a smaller approach, that reflects over the class variable, see superranges.lua.
- sample.lua: watches a stream of numbers, keeps a random sample, never keeps more than (say)
a few hundred values.
- spy.lua: watches numbers and, every so often, prints out the current stats.
- sym.lua: watches a stream of symbols
- sk.lua: ranks a list of
sample s, using a recursive top-down bi-clustering algorithm;
- tiles.lua: divides a table of numbers into percentiles. Not very smart (for a smarter approach, see
range ).
Note also that the "watcher" modules (num , sym , spy and sample ) all have a very similar protocol:
create : make new watcher
update : add an item to a watcher
updates : add many items to a watcher, optionally filtered through some function f . Returns a new
watcher unless an optional third argument is supplied (in which case, the item is addes to this third arg).
watch : returns a new watcher and a convenience function for adding values to this watcher.
Another shared protocol is between num and sym :
distance between two items (and if one or both are the unknown symbol react appropriately)
norm (called by distance ) to reduce numbers to the range 0..1 min..max (and for sym s, this function just
returns the value
Also, we can test is two num and sample distributions are statistically the same.
- For
num s, we use parametric Gaussian effect size and significance tests;
- For
sample s, we use non-parametric effect size and significance tests (Scott-Knot in the sk.lua file,
bootstrap, and cliff's delta be checked
Table code
One of my core data structures is tbl (table). Such tables are aggregations of
rows, nums, and sym.
Tbl s are a place to store
row s of data.
- When data comes in from disk. I store it as a
tbl ;
- When data in one
tbl is divided, the divisions are tbl s.
- When we cluster, each lucster it its own
tbl .
- When we build a denogram (a recursive division of data into sub-data, then subn-s data, then sub-sub-sub data, etc)
then each node in that tree is
tbl .
Each Row in tbl is its own struct. Such Rows handle comparions between rows
(e.g. domination scores, KNN distance measures, Naive Bayes liklihood calculations, etc).
Each column in tbl has a header that is a Num or a Sym.
and that header maintains a summary of what was seen in each column.
Tbl s are incremental readers of rows of data. As rows are found, we can throw them at a table:
- If this is other than the first
row then Tbl assumes it is a row to be stored in the table.
As a side-effect of storage, all the column headers are updated.
- If this is the first
row
then Tbl assumes it is a header that lists the names and types of each column.
- If the name contains
? , then Tbl should ignore this column;
- If the name contains
<,> , then the column can be categoried as a numeric goal to eb minimized or maximized;
- If the name contains
! , then the column can be categorised as a symbolic goal, to be used in classification;
- If the name contains
$ , then the column is categorised as a numeric indepedent variable;
- Otherwise, the column can be categories as a symbolic independent variable.
Note that there is nothing hard-wired in this code about ?<>!$ . These can be easily changed in
the categories function.
What Tbl does assume is that columns of data can be categoried as :
x : the independent columns;
y : the dependent columns;
all : all columns.
Within all,x,y the columns are further categorised as:
nums : the numeric columns;
syms : the symbolic columns;
cols : all columns.
Note that a column can have multiple categories (see categories ). This is done
since sometimes we have to (e.g.) process all the numerics together or process all
the independent symbolics together etc. For an example of a column in multiple categories, a < column is
- .all.cols
- .y.cols
- .all.nums
- .goals
- .less
- .y.nums
Note that each column gets one, and only one num or sym header structure and that header structure
may be stored in multiple categories.
Learner code
- Superrange reflects over the structures generated by range to combine adjacent ranges
where the distribution of the dependent ranges do not change.
- Sdtree recursively refelects over the [superrange](superrange]s to find nested
splits to the data that most reduce the variance in the dependent variable.
- Contrasts comments on the delta between nodes in the tree found by sdtree.
No such constrast sets are generated for nodes where the distribution is not statistically different.
- Trees is a convenience package that batches up
contrasts(sdtree(superrange(range(tbl(csv(config))))))
|
|
Legal
LURE, Copyright (c) 2017, Tim Menzies
All rights reserved, BSD 3-Clause License
Redistribution and use in source and binary forms, with
or without modification, are permitted provided that
the following conditions are met:
- Redistributions of source code must retain the above
copyright notice, this list of conditions and the
following disclaimer.
- Redistributions in binary form must reproduce the
above copyright notice, this list of conditions and the
following disclaimer in the documentation and/or other
materials provided with the distribution.
- Neither the name of the copyright holder nor the names
of its contributors may be used to endorse or promote
products derived from this software without specific
prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
|
|