Ready Steady Go: Machine Learning in practice

getready web - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Before starting with Machine Learning, there are possible open questions to clarify: How do I apply Machine Learning in practice? Which tools fit my problem? And where can I find further information to apply ML in practice? This post from the everyday life of ML2R (now the Lamarr Institute) researchers is part of our “ML Basics” series and is intended to introduce the practice and pointers to further resources. We explain how to proceed in three steps.

Step 1: Mindset

Machine Learning is probably first associated with advanced mathematics and programming expertise, but Machine Learning has become more accessible – even without much programming experience or math study. A vegetable farmer from Japan (Link in German) is one example of this. Makoto Koike has automated the sorting process of cucumbers with the help of a ML platform. His basis: a strong interest and 7000 cucumber images. But what does it take to apply Machine Learning independently?

Realistically, some understanding of calculus, linear algebra, and programming is needed to leave “base camp” and apply more than logistic regression to a basic data set. We believe that this basic understanding is achievable with the appropriate resources and time. We have included a small selection of accessible resources at the end of this post. Rather, we want to emphasize here first and foremost that mindset plays a significant role and highlight two points:

Machine Learning is not magic.
It is wishful thinking to get desired results with the push of a button. The application, especially the data preparation, is where 80% of the work is involved. One should be prepared for this to avoid false expectations.
Machine Learning is a tool.
Behind the application, there is always a problem that needs to be solved. Sometimes the best solution is a simple rule-based or an operational approach. For each approach there are different tools that should be chosen appropriately.

Step 1 in one sentence: Mastering Machine Learning is achievable, and it is important to stay realistic about the goals

Step 2: Structure

A systematic process is needed for robust and application-oriented solutions. Even for small projects, it pays to follow a process to avoid running into a dead end. A process can be developed by trial-and-error, or one can follow existing, standardized processes. The CRISP-DM Cycle (Cross Industry Standard Process for Data Mining) is a process that forms a good basic framework and which we almost always take up in projects.

CRISP DM.drawio - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI) — © Lamarr Institute

The CRISP-DM consists of the following six phases:

Business Understanding – Identify the problem.
Data Understanding – Gather and examine data.
Data Preparation – Preprocess the data.
Modeling – Select methods, train models, and test them.
Evaluation – Check the results against the task.
Deployment – Report, model integration, etc.

The process is flexible in its execution. For example, it is possible to find during the Evaluation phase that the results do not yet meet the objectives. In this case, one may jump back to the Business Understanding phase to explore further options, such as whether additional data is available that can be used to improve the models. The process serves to identify obstacles at an early stage: Is data available or does it need to be obtained or even generated first? Is the data quality sufficient? Does this model performance meet the requirements? With a clear process, these questions can be addressed early.

Step 2 in one sentence: A process facilitates the application.

Step 3: Tool

To leave the “base camp”, you need the right tool in addition to the right approach and a process. ML tools range from editors to programming languages to development environments and business intelligence tools. In principle, any programming language can be used for ML applications. Some programming languages offer suitable libraries and thus facilitate the implementation. If one has little programming experience, the following tools, among others, are suitable for getting started: WEKA (open source), RapidMiner (all-rounder), and KNIME (especially for data mining).

Programming languages: According to GitHub, Python is the most used programming language in ML. In second place is C++, and third and fourth place are occupied by JavaScript and Java:

1 Python
2 C++
3 JavaScript
4 Java
5 C#
6 Julia
7 Shell
8 R
9 TypeScript
10 Scala

Development environments: For anything beyond simple trial and error, a development environment is recommended. According to a report by kaggle, the most popular development environment is JupyterLab, followed by Visual Studio Code and PyCharm.

Entwicklungsumgebung - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI) — © Lamarr Institute
The results of the kaggle survey on what development environments are used by respondents.

The Library Foundation: Once you have chosen a development environment, some programming languages offer matching libraries. In the following, we will go over popular Python libraries that we also use in the context of ML2R (now the Lamarr Institute).

Pandas – for data processing, handy for editing “DataFrames” and reading in and out csv files.
NumPy – for arrays and matrices.
SciPy – for calculations like matrix multiplication and optimization functions.
Scikit-learn – for classical algorithms (classification, regression, clustering, dimension reduction).
TensorFlow and PyTorch – for deep neural networks.

Application-specific tools: Depending on the use case and complexity, additional tools are necessary. Visualization is an essential task for a data scientist, serving to (a) gain clarity on the properties of the input data and (b) make results of an ML algorithm more tangible. For visualization, matplotlib and plotly are recommended. In addition, streamlit, for example, offers the possibility to quickly create an app. For the analysis of image data OpenCV is essential, while for the analysis of text data spaCy is recommended. If it can already be estimated that many model versions will be created, MLFlow is worth considering. And if you need to handle large amounts of data, Apache Spark as a framework for GPU computing and parallel data processing is a good place to start. The libraries offer an entry point and represent only a small area of the overall offering. For further requirements, you will quickly find what you are looking for in the wealth of library offerings.

Step 3 in one sentence: The suitable tool is relevant to the objective.

The path to the ML application

After reading this post, don’t forget the following three steps:

Step 1: Mastering Machine Learning is achievable, and it’s important to stay realistic with your goals.
Step 2: A process facilitates the application.
Step 3: The right tool is relevant to the goal.

With these three steps, that is, with the mindset, a process, and the appropriate tool, you are equipped to enter the ML playing field. Then, there’s only one thing left to do: just get started! There are numerous resources available online for further learning and experimentation.

Here is a brief collection of helpful further resources:

How to get started (the basis for the presented top-down approach): Step-by-Step Guides
Books: among others, Best free Data-Science Books und Best Machine Learning Textbooks for Data Scientists
Schulungen und Trainings: MOOCs, z.B. Coursera und Udemy Data Scientist Schulungen

Katharina Beckh,

13. January 2021

Ready Steady Go: Machine Learning in practice

Step 1: Mindset

Step 2: Structure

Step 3: Tool

The path to the ML application

Topics

Tags

Katharina Beckh

More blog posts