Project Plan

From Machine Learning 4 Apache

Jump to: navigation, search

Contents

Project Plan

Mission

Create a commercial friendly, stable, scalable suite of machine learning tools with an Apache Software Foundation license, hosted by the ASF as a Top Level Project (TLP).

Main focus is on scalability. Applicationwise we concentrate on text mining, at least at first integrating closely with projects like Nutch/Hadoop. The framework should be designed for high throughput and be capable of handling massive datasets both during training and application - in case this distinction exists. To gain these features we intend to parallelize algorithms in a Hadoop framework. At least in this respect there is still room for research, although first publications are already available (http://nips.cc/Conferences/2006/Program/event.php?ID=408).

Overall Process:

  • Fill out this document
  • Create an Incubator proposal
  • Find a mentor or two
  • Write some code (can be done throughout)
  • Build community

People

Plan

Position

Please sort these ExampleTasks into our goals, non-goals and nice-to-have categories.

Goals

  • Algorithms for text data
  • soft criteria (can anyone please fill in the correct term here?)
    • extensible (should we support plugins or simple export APIs? At which modules?)
    • stable (against false data, what else?)
    • fast (how fast?)
    • can cope with a lot data (how much - numbers)
  • Applications - sample applications
    • classify mail spam
    • cluster web search results
    • tag web pages and add additional information as fields to nutch/lucene
    • classify web pages into categories for enhanced retrieval
    • n-based language recognizer

Nice to have

Forget it - we never want to support that

  • Integrate each and every learning algorithm available.

Background knowledge

Motivating examples

Copy arguments from our first mails here

  • Use Case 1 - the ambitious researcher

He wants a tool that is easy to extend and supports all the mathematical transformations necessary for this problem. He also needs a suit that contains all learning algorithms he needs to compare against. Ideally after an experiment he has a configurable graphical presentation of the performance of his algorithm against all competitors.

  • Use Case 2 - the lost engineer

He sees machine learning as a tool for understanding his problems better. He is used to being able to understand each and every processing step, nothing depends on chance. Machine learning for him is a way to understand his problem better - so in the end the problem won't be solved by some ML solution but by an implementation based on the knowledge gained from his experiments. This guy only wants some statistics, maybe some easy to understand rules but not decision based on statistics alone.

  • Use Case 3 - the ambitious engineer

He knows the benefit and limitations of machine learning algorithms. Usually he does not know and has not time to learn in depth how to tune the parameters. He want to use ML to solve his problems, wants to plug some algorithm into his project to solve a well defined task. This guy needs fast and reliable implementations.

Existing solutions - why don't they solve our problem?

Existing_Learning_Tools

Links to related publications

  • Pegasos: Primal Estimated sub-GrAdient SOlver for SVM (ICML 2007)
Books
  • Data Mining: Practical Machine Learning Tools and Techniques (Second Edition): Ian H. Witten, Eibe Frank
  • Programming Collective Intelligence: Toby Segaran
  • Pattern Classification: Duda, Hart, Stork
  • Data Mining: Jiawei Han, Micheline Kamber
  • Foundations of Statistical Natural Language Processing: C. Manning & H. Schütze
  • Introduction to Machine Learning; Ethem Alpaydin -- Pretty good first year grad text book. Lots of math but also has pseudocode, etc. To quote the preface: "The aim is to have all learning algorithms sufficiently explained so it will be a small step from the equations given in the book to a computer program"
Webpages/ Blogs

Problems we might encounter

  • Which kinds of problems should we look out for? Any patent stuff? Any licensing issues? (e.g.)

Implementations

Design

  • Data structures
  • Modules/ Components - What goes in, what happens to the data, what comes out?
  • Interfaces within the system
  • Interfaces visible to users of the system
  • Algorithms we use - put at least one paper for each learning algorithm we use

Monitoring

  • Learning can take quite a while. Should users be able to monitor progress? If, how?

Logging

  • We already noticed that this will be very important. How do we enable finding out about the exact training data used, the parameters set, the pre processing steps taken in a principled way - given some model.

Threading

  • Learning algorithms can take pretty long to finish. Do we support threading? We should integrate with Hadoop. How exactly?

Formats

Community Building

Where do we find people interested in us?

  • Open Source ML workshop @ NIPS
  • Chaos Communication Camp @ Finowfurt/Germany - anyone there: Isabel
  • Froscon @ Bonn/Germany - anyone ther?
  • 24C3 @ Berlin/Germany - anyone there?
  • Fosdem @ Brüssel/Belgium - anyone there?
  • Nutch/Hadoop community
  • company formerly kown as MediaStyle in Halle/Germany
  • Google/Y!/Microsoft
  • PrudSys (ML company in Chemnitz/Germany)
  • TU Chemnitz
  • Could the SpamAssassin people be interested in us?

Documentation

  • Which kind for which audiance? Who is responsible? Something for new developers? For users?

Time estimates

  • How long will it take until we accomplish the individual steps? When should we launch? What should work before doing so?
Personal tools