Machine Learning Workflow System

This subject is proposed as part of the ROCKFlows project.

Context

For many years, Machine Learning research has been focusing on designing new algorithms for solving similar kinds of problem instances (Kotthoff, 2016). However, Researchers have long ago recognized that a single algorithm will not give the best performance across all problem instances, e.g. the No-Free-Lunch-Theorem (Wolpert, 1996) states that the best classifier will not be the same on every dataset. Consequently, the “winner-take-all” approach should not lead to neglect some algorithms that, while uncompetitive on average, may offer excellent performances on particular problem instances. In 1976, Rice characterized this as the “algorithm selection problem” (Rice, 1976).

To support automatic selection of algorithms, Portfolio approaches aim at performing per-instance algorithm selection (Leyton et al., 2003). When a portfolio refers to more complex products than algorithms (i.e. not only a set of software components but the composition of a set of consistent software components), Software Product Line (SPL) is a successful approach to increase the product portfolio with up to an order of magnitude and provide consistent user experience across product portfolio (Bosch J, 2009). Software Product Line engineering is concerned with systematically reusing development assets in an application domain (Clements et al., 2001)(Pohl et al., 2005).

A Machine Learning (ML) Workflow can be defined as a tuple (h,p,c) where h represents hyper-parameter tuning strategy, p represents a set of preprocessing techniques applied on the dataset, and c is a ML algorithm used to learn a model from the processed data and to predict then over new data. The construction of a Machine Learning Workflow depends upon two main aspects:

  • The structural characteristics (size, quality, and nature) of the collected data
  • How the results will be used.

This task is highly complex because of the increasing number of available algorithms, the difficulty in choosing the correct preprocessing techniques together with the right algorithms as well as the correct tuning of their parameters (Serban at al, 2013). To decide which algorithm to choose, data scientists often consider families of algorithms in which they are experts, and can leave aside algorithms that are more “exotic” to them, but could perform better for the problem they are trying to solve.

ROCKFlows is a project aiming at helping users to create their own Machine Learning Workflows by simply describing their dataset and objectives.

The approach is thus positioned differently from the platforms that help select the workflows components like Weka (Hall et al, 2009), Orange (Demsar et al, 2004), KNIME (Berthold et al, 2007), RapidMiner (Mierswa et al, 2006) or ClowdFlows (Kranjc et al, 2012). Indeed, such platforms have a lot of components that can be selected to create the desired workflows, but if these systems are useful for data scientists, they can be too complex and overwhelming for non-expert users. For such users, it may be more helpful to use a system that either provides them directly with the workflow to use, or at least suggests the components of the workflow to use for their specific problem.

The interest of such a system is so valuable that big companies have proposed their own knowledge flows. The IBM Watson platform offers an interface to analyze unstructured data (for example text files) and takes as input, questions in plain English. Amazon’s product is a black box similar to the IBM’s platform but it is focused on supervised Machine Learning, in particular on classification and regression tasks. The workflow is built automatically by the platform by analyzing the input data. On the other hand, the solution proposed by Microsoft Azure, instead of choosing the solution automatically for the user, provides advices to the users on which components to use to form the workflow. These advices are based both on the best practices used in Machine Learning and on the algorithms available on the platform, that is, they cover a limited part of possible Machine Learning workflows. For example, to solve the clustering problem, only the K-means algorithm is proposed, while the limitations of this technique are well-known and several families of clustering algorithms have been proposed to overcome these.

Objectives

The main objective of this thesis is to explore the alliance between a portfolio and a SPL to automatically propose ML workflows according to end-user problems. So the SPL is the link between the portfolio and the end-user. It manages the identification of the end-user problem. It proposes solutions among which the end-user chooses according to her own criteria. It generates the corresponding codes and, it could launch the experiment. It must be able to collect the results of the experiments to get feedbacks and eventually to enrich the platform.

The thesis must address the following challenges: Relevance and quality of predictions and Scalability to manage the huge mass of ML workflows. To meet these challenges, attention should be paid to the following aspects:

  • Handling Variabilities: Variability of compositions (e.g. identifying dominated workflows, managing requirements between WF components); Variability of performance metrics (e.g. dependencies among metrics); Variability of Data Sets (e.g. images, text) and consequently how we represent them (meta features); Variability of platforms; Variability of algorithms and preprocessing algorithms (i.e. characterization to distinguish and automate the compositions (Salvador et al, 2016)); Variability of hyper-parameter tuning strategies (i.e. dependency with workflows); etc.
  • Architecture of the portfolio : automatically manage (1) experiment running, (2) collecting of experiment results, (3) analyzis of results, (4) evolution of algorithm base. It must support the management of execution errors, incremental analyzes, identifying context of experiments.
  • Handling Scalability of the Portfolio: Selecting discriminating data sets; Detecting “deprecated” algorithms and WF from experiments and literature revues; Dealing with information from scientific literature without deteriorating portfolio computed knowledge.
  • Ensuring global consistency of the Portfolio and Software Product Line. Such a system is enriched by additions to the portfolio and experiment feedbacks. As “knowledge” evolves (e.g., new data types, new metrics), the entire system needs to be updated. It is therefore to find abstractions not only to manage these changes but also to optimize them (Bischl et al. 2016).

We have a two-year experience on this subject which has enabled us to (I) eliminate some approaches (e.g. modeling knowledge as a system of constraints because it generates on our current basis more than 6 billion constraints), (ii) lay the foundations for a platform for collecting experiences and presenting to the user (Camillieri et al., 2016) (see http:// http://rockflows.i3s.unice.fr/), (iii) study the ML workflows to predict workflows (Master internships Luca Parisi, Miguel Fabian Romero Rondon and Melissa Sanabria Rosas), (iv) address platform evolution by introducing deep learning workflows (see Melissa’s Report).

The thesis must investigate the research around the selection of algorithms, considering the automatic composition of workflows and supporting dynamic evolutions. It is therefore a thesis in software engineering research but to address one of the current most central problems in machine learning.

Bibliographie

Berthold MR, Cebron N, Dill F, Gabriel TR, T Kotter TM, Ohl P, C Sieb K, Wiswedel B (2007) Knime: The konstanz information miner. Studies in Classification, Data Analysis, and Knowledge Organization pp 319–326

Bischl B, Kerschke P, Kotthoff L, Lindauer MT, Malitsky Y, Fréchette A, Hoos HH, Hutter F, Leyton-Brown K, Tierney K, Vanschoren J (2016) ASlib: {A} benchmark library for algorithm selection. Artif Intell 237:41–58. doi: 10.1016/j.artint.2016.04.003

Bosch J (2009) From software product lines to software ecosystems. Proc 13th Int Softw Prod Line Conf 111–119. doi: 10.1016/j.jss.2012.03.039

Bourque P, Fairley RE (2014) {SWEBOK}: Guide to the Software Engineering Body of Knowledge, Version 3. IEEE Computer Society, Los Alamitos, CA

Camillieri C, Parisi L, Blay-Fornarino M, Precioso F, Riveill M, Cancela Vaz J (2016) Towards a Software Product Line for Machine Learning Workflows: Focus on Sup- porting Evolution. In: Proc. 10th Work. Model. Evol. co-located with ACM/IEEE 19th Int. Conf. Model Driven Eng. Lang. Syst. (MODELS 2016), Saint-Malo, France, p 65–70

P. Clements and L. M. Northrop. Software Product Lines: Practices and Patterns. Addison-Wesley Professional, 2001.

Demsar J, Zupan B, Leban G, Cur T (2004) Orange: From experimental machine learning to interactive data mining. PKDD, Lecture Notes in Computer Science 3202:537–539

Gomes CP, Selman B (2001) Algorithm portfolios. Artif Intell 126:43–62. doi: 10.1016/S0004-3702(00)00081-3

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explorations 11(1), URL http://www.cs.waikato.ac.nz/ml/weka/

Kranjc J, Podpe V, Lavra N (2012) Clowdflows: A cloud based scientific workflow platform. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-KDD), pp 816–819

Kotthoff L (2016) Algorithm Selection for Combinatorial Search Problems: A Survey. In: Bessiere C, Raedt L De, Kotthoff L, Nijssen S, O’Sullivan B, Pedreschi D (eds) Data Min. Constraint Program. - Found. a Cross-Disciplinary Approach. Springer, pp 149–190

Leyton-Brown K, Nudelman E, Andrew G, McFadden J, Shoham Y (2003) A Portfolio Approach to Algorithm Selection. Int. Jt. Conf. Artif. Intell.

Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) Yale: Rapid prototyping for complex data mining tasks. In: ACM SIGKDD international conference on Knowledge discovery and data mining, pp 935–940

Pohl K, Böckle G, van der Linden FJ (2005) Software Product Line Engineering: Foundations, Principles and Techniques. Springer-Verlag

Rice JR (1976) The Algorithm Selection Problem. Adv Comput 15:65–118.

Martin Salvador M, Budka M, Gabrys B (2016) Towards automatic composition of multicomponent predictive systems. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). doi: 10.1007/978-3-319-32034-2_3

Serban F, Vanschoren J, Kietz J-U, Bernstein A (2013) A survey of intelligent assistants for data analysis. ACM Comput Surv. doi: 10.1145/2480741.2480748

Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Computation 8(7):1341–1390