home \| news \| discuss \| issues \| license LURE: WHAT
¶ tim@menzies.us August'18 Why \| Install \| What \| Guide \| Style What is this code all about? I claim data science is about science; i.e. it is about a community carefully curating and improving a collecting of idea. So, in my view, it is not enough to merely make conclusions. Rather, those conclusions need to be monitoried and updated (when appropriate). So LURE implements the following set of operators that I say need to be part of any data mining toolkit that supports science. THere is no claim here of completeness of these tools. You should be very critical of the technical choices I made in that implementation. What simplifications did I make? What better technologies should I use? What did I overlook? If you think you can handle the above in (e.g.) TensorFlow or Torch or using 100 other methods, I would lean forward and say "yes? really? show me how". Comprehension: Something we can read, argue with Essential for communities critiquing ideas. If the only person reading a model is a carburetor, then we can expect little push back. But if your models are about policies that humans have to implement, then I take it as axiomatic that humans will want to read and critique the models. Fast: Not a CPU hog Reproducing and improving an old ideas means that you can reproduce that old result. Also, certifying that new ideas often means multiple runs over many sub-samples of the data. Such reproducibility and certification is impractical when such reproduction is impractically slow Light: Small memory footprint Again, reproducing an old data mining experiment or certifying a new result means that the resources required for reproduction are not exorbitant. Goal-aware: Different goals means different models. AND multiple goals = no problem! This is important since most data miners build models that optimizer for a single goal (e.g. minimize error or least-square error) yet business users often want their data miners to achieve many goals. Humble : Can publish succinct certification envelope (so we know when not to trust) Delivered data mined models should be able to recognize when new data is out-of-scope of anything they've seen before. This means, at runtime, having access to the data used to build that model. Note that phrase succinct here: certification envelopes cannot include all the data relating to a model, otherwise every hard drive in the world will soon fill up. Privacy-aware: Can hide an individual's data This is essential when sharing a certification envelope Shareable: Knows how to transfer models, data, between contexts. Such transfer usually requires some transformation of the source data to the target data. Context-aware: Knows that local parts of data generate different models. While general principles are good, so too is how to handle particular contexts. For example, in general, exercise is good for maintaining healthy. However, in the particular context of patients who have just had cardiac surgery, then that general principle has to be carefully tailored to particular patients. ideas need to be updated. Self-tuning: And can do it quickly Many experiments show that we can't just use data miners off-the-shelf. Rather, if their control parameters are tuned, then we can get much better data mining results. Anomaly-aware: Can detect when new inputs differ from old training data This is the trigger for when old Incremental: Can update old models with new data Anomaly detectors tell us something has to change. Incremental learners tell us what to change. Legal LURE, Copyright (c) 2017, Tim Menzies All rights reserved, BSD 3-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

home | news | discuss | issues | license

LURE: WHAT

tim@menzies.us
August'18

What is this code all about?

I claim data science is about science; i.e. it is about a community carefully curating and improving a collecting of idea. So, in my view, it is not enough to merely make conclusions. Rather, those conclusions need to be monitoried and updated (when appropriate).

So LURE implements the following set of operators that I say need to be part of any data mining toolkit that supports science. THere is no claim here of completeness of these tools. You should be very critical of the technical choices I made in that implementation. What simplifications did I make? What better technologies should I use? What did I overlook? If you think you can handle the above in (e.g.) TensorFlow or Torch or using 100 other methods, I would lean forward and say "yes? really? show me how".

Comprehension:

Something we can read, argue with
Essential for communities critiquing ideas. If the only person reading a model is a carburetor, then we can expect little push back. But if your models are about policies that humans have to implement, then I take it as axiomatic that humans will want to read and critique the models.

Fast:

Not a CPU hog
Reproducing and improving an old ideas means that you can reproduce that old result. Also, certifying that new ideas often means multiple runs over many sub-samples of the data. Such reproducibility and certification is impractical when such reproduction is impractically slow

Light:

Small memory footprint
Again, reproducing an old data mining experiment or certifying a new result means that the resources required for reproduction are not exorbitant.

Goal-aware:

Different goals means different models. AND multiple goals = no problem!
This is important since most data miners build models that optimizer for a single goal (e.g. minimize error or least-square error) yet business users often want their data miners to achieve many goals.

Humble :

Can publish succinct certification envelope (so we know when not to trust)
Delivered data mined models should be able to recognize when new data is out-of-scope of anything they've seen before. This means, at runtime, having access to the data used to build that model. Note that phrase succinct here: certification envelopes cannot include all the data relating to a model, otherwise every hard drive in the world will soon fill up.

Privacy-aware:

Can hide an individual's data
This is essential when sharing a certification envelope

Shareable:

Knows how to transfer models, data, between contexts.
Such transfer usually requires some transformation of the source data to the target data.

Context-aware:

Knows that local parts of data generate different models.
While general principles are good, so too is how to handle particular contexts. For example, in general, exercise is good for maintaining healthy. However, in the particular context of patients who have just had cardiac surgery, then that general principle has to be carefully tailored to particular patients. ideas need to be updated.

Self-tuning:

And can do it quickly
Many experiments show that we can't just use data miners off-the-shelf. Rather, if their control parameters are tuned, then we can get much better data mining results.

Anomaly-aware:

Can detect when new inputs differ from old training data
This is the trigger for when old

Incremental:

Can update old models with new data
Anomaly detectors tell us something has to change. Incremental learners tell us what to change.

Legal

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.