Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
students:phd_mlws [2017/05/20 22:47]
blay [Context]
students:phd_mlws [2017/05/28 20:03] (current)
blay [Context]
Line 1: Line 1:
 ====== Machine Learning Workflow System ====== ====== Machine Learning Workflow System ======
  
-This subject is proposed as part of the [[http://​rockflows.i3s.unice.fr/​|ROCKFlows]] project ​involving the following researchers:​ [[http://​mireilleblayfornarino.i3s.unice.fr|Mireille Blay-Fornarino]],​ [[http://​www.i3s.unice.fr/​~mosser/​start|Sébastien Mosser]] and [[http://​www.i3s.unice.fr/​~precioso/​|Frédéric Precioso]].+This subject is proposed as part of the [[http://​rockflows.i3s.unice.fr/​|ROCKFlows]] project.
 ===== Context ===== ===== Context =====
 For many years, Machine Learning research has been focusing on designing new algorithms for solving similar kinds of problem instances (Kotthoff, 2016). However, Researchers have long ago recognized that a single algorithm will not give the best performance across all problem instances, e.g. the No-Free-Lunch-Theorem (Wolpert, 1996) states that the best classifier will not be the same on every dataset. Consequently,​ the “winner-take-all” approach should not lead to neglect some algorithms that, while uncompetitive on average, may offer excellent performances on particular problem instances. In 1976, Rice characterized this as the "​algorithm selection problem"​ (Rice, 1976). ​ For many years, Machine Learning research has been focusing on designing new algorithms for solving similar kinds of problem instances (Kotthoff, 2016). However, Researchers have long ago recognized that a single algorithm will not give the best performance across all problem instances, e.g. the No-Free-Lunch-Theorem (Wolpert, 1996) states that the best classifier will not be the same on every dataset. Consequently,​ the “winner-take-all” approach should not lead to neglect some algorithms that, while uncompetitive on average, may offer excellent performances on particular problem instances. In 1976, Rice characterized this as the "​algorithm selection problem"​ (Rice, 1976). ​
Line 11: Line 11:
          * The structural characteristics (size, quality, and nature) of the collected data          * The structural characteristics (size, quality, and nature) of the collected data
          * How the results will be used.          * How the results will be used.
-This task is highly complex because of the increasing number of available algorithms, the difficulty in choosing the correct preprocessing techniques together with the right algorithms as well as the correct tuning of their parameters. To decide which algorithm to choose, data scientists often consider families of algorithms in which they are experts, and can leave aside algorithms that are more “exotic” to them, but could perform better for the problem they are trying to solve.+This task is highly complex because of the increasing number of available algorithms, the difficulty in choosing the correct preprocessing techniques together with the right algorithms as well as the correct tuning of their parameters ​(Serban at al, 2013). To decide which algorithm to choose, data scientists often consider families of algorithms in which they are experts, and can leave aside algorithms that are more “exotic” to them, but could perform better for the problem they are trying to solve.
  
 ROCKFlows ​ is a project aiming at helping users to create their own Machine Learning Workflows by simply describing their dataset and objectives.  ​ ROCKFlows ​ is a project aiming at helping users to create their own Machine Learning Workflows by simply describing their dataset and objectives.  ​
Line 21: Line 21:
 ===== Objectives ===== ===== Objectives =====
  
-The main objective of this thesis is to explore the alliance between a portfolio and a SPL to automatically propose ML workflows according to end-user problems. So the SPL is the link between the portfolio and the end-user. It manages the identification of the end-user problem. It proposes solutions among which end-user chooses according to her own criteria. It generates the corresponding codes and, it could launch the experiment. It must be able to collect the results of the experiments to get feedbacks and eventually to enrich the platform. ​+The main objective of this thesis is to explore the alliance between a portfolio and a SPL to automatically propose ML workflows according to end-user problems. So the SPL is the link between the portfolio and the end-user. It manages the identification of the end-user problem. It proposes solutions among which the end-user chooses according to her own criteria. It generates the corresponding codes and, it could launch the experiment. It must be able to collect the results of the experiments to get feedbacks and eventually to enrich the platform. ​
  
 The thesis must address the following challenges: Relevance and quality of predictions and Scalability to manage the huge mass of ML workflows. ​ The thesis must address the following challenges: Relevance and quality of predictions and Scalability to manage the huge mass of ML workflows. ​
 To meet these challenges, attention should be paid to the following aspects: ​ To meet these challenges, attention should be paid to the following aspects: ​
-        * //Handling Variabilities:​ // Variability of compositions (e.g. identifying dominated workflows, managing requirements between WF components);​ Variability of performance metrics (e.g. dependencies among metrics); Variability of Data Sets (e.g. images, text) and consequently meta features; Variability of platforms; Variability of algorithms and preprocessing algorithms (i.e. characterization to distinguish and automate the compositions);​ Variability of hyper-parameter tuning strategies (i.e. dependency with workflows); etc. +        * //Handling Variabilities:​ // Variability of compositions (e.g. identifying dominated workflows, managing requirements between WF components);​ Variability of performance metrics (e.g. dependencies among metrics); Variability of Data Sets (e.g. images, text) and consequently ​how we represent them (meta features); Variability of platforms; Variability of algorithms and preprocessing algorithms (i.e. characterization to distinguish and automate the compositions ​(Salvador et al, 2016)); Variability of hyper-parameter tuning strategies (i.e. dependency with workflows); etc. 
-        *// Architecture of portfolio// ​to automatically manage (1) experiment running, (2) collect ​of experiment results, (3) analyze ​of results, (4) evolution of algorithm base. It must support the management of execution errors, incremental analyzes, identifying context of experiments.  +        *// Architecture of the portfolio ​// automatically manage (1) experiment running, (2) collecting ​of experiment results, (3) analyzis ​of results, (4) evolution of algorithm base. It must support the management of execution errors, incremental analyzes, identifying context of experiments.  
-        * //Handling Scalability of Portfolio: ​S//electing ​discriminating data sets; Detecting “deprecated” algorithms and WF from experiments and literature revues; Dealing with information from scientific literature without deteriorating portfolio computed knowledge.  +        * //Handling Scalability of the Portfolio: //Selecting ​discriminating data sets; Detecting “deprecated” algorithms and WF from experiments and literature revues; Dealing with information from scientific literature without deteriorating portfolio computed knowledge.  
-        * //Ensuring global consistency//​ of Portfolio and Software Product Line. Such a system is enriched by additions to the portfolio and experiment feedbacks. As "​knowledge"​ evolves (e.g., new data types, new metrics), the entire system needs to be updated. It is therefore to find abstractions not only to manage these changes but also to optimize them (Bischl et al. 2016).+        * //Ensuring global consistency//​ of the Portfolio and Software Product Line. Such a system is enriched by additions to the portfolio and experiment feedbacks. As "​knowledge"​ evolves (e.g., new data types, new metrics), the entire system needs to be updated. It is therefore to find abstractions not only to manage these changes but also to optimize them (Bischl et al. 2016).
  
-We have a two-year experience on this subject which has enabled us to (I) eliminate some approaches (e.g. modeling knowledge as a system of constraints because it generates on our current basis more than 6 billion constraints),​ (ii) lay the foundations for a platform for collecting experiences and presenting to the user (Camillieri et al., 2016) (see [[http:// http://​rockflows.i3s.unice.fr/​]]),​ (iii) study the ML workflows to predict workflows (Master internships Luca Parisi, Miguel Fabian Romero Rondon and Melissa Sanabria Rosas), (iv) address platform evolution introducing deep learning workflows (see Melissa’s Report). ​+We have a two-year experience on this subject which has enabled us to (I) eliminate some approaches (e.g. modeling knowledge as a system of constraints because it generates on our current basis more than 6 billion constraints),​ (ii) lay the foundations for a platform for collecting experiences and presenting to the user (Camillieri et al., 2016) (see [[http:// http://​rockflows.i3s.unice.fr/​]]),​ (iii) study the ML workflows to predict workflows (Master internships Luca Parisi, Miguel Fabian Romero Rondon and Melissa Sanabria Rosas), (iv) address platform evolution ​by introducing deep learning workflows (see Melissa’s Report). ​
  
 The thesis must investigate the research around the selection of algorithms, considering the automatic composition of workflows and supporting dynamic evolutions. It is therefore a thesis in software engineering research but to address one of the current most central problems in machine learning. The thesis must investigate the research around the selection of algorithms, considering the automatic composition of workflows and supporting dynamic evolutions. It is therefore a thesis in software engineering research but to address one of the current most central problems in machine learning.
Line 66: Line 66:
  
 Rice JR (1976) The Algorithm Selection Problem. Adv Comput 15:​65–118. Rice JR (1976) The Algorithm Selection Problem. Adv Comput 15:​65–118.
 +
 +Martin Salvador M, Budka M, Gabrys B (2016) Towards automatic composition of multicomponent predictive systems. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). doi: 10.1007/​978-3-319-32034-2_3
 +
 +Serban F, Vanschoren J, Kietz J-U, Bernstein A (2013) A survey of intelligent assistants for data analysis. ACM Comput Surv. doi: 10.1145/​2480741.2480748
  
 Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Computation 8(7):​1341–1390 ​ Wolpert D (1996) The lack of a priori distinctions between learning algorithms. Neural Computation 8(7):​1341–1390 ​