Software Requirements for Machine Learning

Software Requirement Specification for Machine Learning:

1. Weka Toolkit: Weka (Waikato Environment for Knowledge Acquisition) is a machine learning
and data mining toolkit written in Java by the University of Waikato in New Zealand. It provides a suite of tools for learning and visualization via the supplied workbench program or the command line. Weka also enables you to
retrieve data from existing data sources that have a JDBC driver. With Weka you can do the following:

    i. Preprocessing data
    ii. Clustering
    iii. Classification
    iv. Regression
    v. Association rules

The Weka toolkit is widely used and now supports the Big Data aspects by interfacing with Hadoop for clustered data mining.

Mahout: The Mahout machine learning libraries are an open-source project that is part of the Apache project. The key feature of Mahout is its scalability; it works either on a single node or a cluster of machines. It has tight integration with the Hadoop Map/Reduce paradigm to enable large-scale processing.

Mahout supports several algorithms including:
i. Naive Bayes Classifier
ii. K Means Clustering
iii. Recommendation Engines
iv. Random Forest Decision Trees
v. Logistic Regression Classifier

SpringXD: Whereas Weka and Mahout concentrate on algorithms and producing the knowledge you need, you must also think about acquiring and processing data. Spring XD is it a “data ingestion engine” that reads in, processes, and stores raw data? It’s highly customizable with the ability to create processing units. It also integrates with all the other tools.

Spring XD is relatively new, but it’s certainly useful. It not only relates to Internet-based data but can also ingest network and system messages across a cluster of machines.

Hadoop: Unless you’ve been living on some secluded island without power and an Internet connection, you will have heard about the saviour of Big Data: Hadoop. Hadoop is very good for processing Big Data, but it’s not a required tool.

Hadoop is a framework for processing data in parallel. It does this using the MapReduce pattern, where work is divided into blocks and distributed across a cluster of machines.