tim@menzies.us
August'18
Why | Install | What | Guide | Style
What is this code all about?
I claim data science is about science; i.e. it is about
a community carefully curating and improving a collecting of idea. So, in my view, it is not
enough to merely make conclusions. Rather, those conclusions need to be monitoried and updated (when appropriate).
So LURE implements the following set of operators that I say need to be part of any data mining toolkit that supports
science.
THere is no claim here of completeness of these tools.
You should be very critical of the technical
choices I made in that implementation. What simplifications did I
make? What better technologies should I use? What did I overlook?
If you think you can handle the above in (e.g.)
TensorFlow
or Torch or using 100 other methods, I would
lean forward and say "yes? really? show me how".
Comprehension:
- Something we can read, argue with
- Essential for communities critiquing ideas. If the only person reading a model is a carburetor, then we can expect little push back. But if your models are about policies that humans have to implement, then I take it as axiomatic that humans will want to read and critique the models.
Fast:
- Not a CPU hog
- Reproducing and improving an old ideas means that you can reproduce that old result. Also, certifying that new ideas often means multiple runs over many sub-samples of the data. Such reproducibility and certification is impractical when such reproduction is impractically slow
Light:
- Small memory footprint
- Again, reproducing an old data mining experiment or certifying a new result means that the resources required for reproduction are not exorbitant.
Goal-aware:
- Different goals means different models. AND multiple goals = no problem!
- This is important since most data miners build models that optimizer for a single goal (e.g. minimize error or least-square error) yet business users often want their data miners to achieve many goals.
Humble :
- Can publish succinct certification envelope (so we know when not to trust)
- Delivered data mined models should be able to recognize when new data is out-of-scope of anything they've seen before. This means, at runtime, having access to the data used to build that model. Note that phrase succinct here: certification envelopes cannot include all the data relating to a model, otherwise every hard drive in the world will soon fill up.
Privacy-aware:
- Can hide an individual's data
- This is essential when sharing a certification envelope
Shareable:
- Knows how to transfer models, data, between contexts.
- Such transfer usually requires some transformation of the source data to the target data.
Context-aware:
- Knows that local parts of data generate different models.
- While general principles are good, so too is how to handle particular contexts. For example, in general, exercise is good for maintaining healthy. However, in the particular context of patients who have just had cardiac surgery, then that general principle has to be carefully tailored to particular patients.
ideas need to be updated.
Self-tuning:
- And can do it quickly
- Many experiments show that we can't just use data miners off-the-shelf. Rather, if their control parameters are tuned, then we can get much better data mining results.
Anomaly-aware:
- Can detect when new inputs differ from old training data
- This is the trigger for when old
Incremental:
- Can update old models with new data
- Anomaly detectors tell us something has to change. Incremental learners tell us what to change.
Legal
LURE, Copyright (c) 2017, Tim Menzies
All rights reserved, BSD 3-Clause License
Redistribution and use in source and binary forms, with
or without modification, are permitted provided that
the following conditions are met:
- Redistributions of source code must retain the above
copyright notice, this list of conditions and the
following disclaimer.
- Redistributions in binary form must reproduce the
above copyright notice, this list of conditions and the
following disclaimer in the documentation and/or other
materials provided with the distribution.
- Neither the name of the copyright holder nor the names
of its contributors may be used to endorse or promote
products derived from this software without specific
prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
|
|