User Tools

Site Tools


students:phd_mlws

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
students:phd_mlws [2017/05/20 09:10]
blay [Context]
students:phd_mlws [2017/05/20 22:47]
blay [Context]
Line 1: Line 1:
 ====== Machine Learning Workflow System ====== ====== Machine Learning Workflow System ======
  
 +This subject is proposed as part of the [[http://​rockflows.i3s.unice.fr/​|ROCKFlows]] project involving the following researchers:​ [[http://​mireilleblayfornarino.i3s.unice.fr|Mireille Blay-Fornarino]],​ [[http://​www.i3s.unice.fr/​~mosser/​start|Sébastien Mosser]] and [[http://​www.i3s.unice.fr/​~precioso/​|Frédéric Precioso]].
 ===== Context ===== ===== Context =====
 For many years, Machine Learning research has been focusing on designing new algorithms for solving similar kinds of problem instances (Kotthoff, 2016). However, Researchers have long ago recognized that a single algorithm will not give the best performance across all problem instances, e.g. the No-Free-Lunch-Theorem (Wolpert, 1996) states that the best classifier will not be the same on every dataset. Consequently,​ the “winner-take-all” approach should not lead to neglect some algorithms that, while uncompetitive on average, may offer excellent performances on particular problem instances. In 1976, Rice characterized this as the "​algorithm selection problem"​ (Rice, 1976). ​ For many years, Machine Learning research has been focusing on designing new algorithms for solving similar kinds of problem instances (Kotthoff, 2016). However, Researchers have long ago recognized that a single algorithm will not give the best performance across all problem instances, e.g. the No-Free-Lunch-Theorem (Wolpert, 1996) states that the best classifier will not be the same on every dataset. Consequently,​ the “winner-take-all” approach should not lead to neglect some algorithms that, while uncompetitive on average, may offer excellent performances on particular problem instances. In 1976, Rice characterized this as the "​algorithm selection problem"​ (Rice, 1976). ​
  
-To support automatic selection of algorithms, Portfolio approaches aim at performing per-instance algorithm selection (Leyton et al., 2003). When portfolio refers to more complex products than algorithms (i.e. not only a set of software components but the composition of a set of consistent software components),​ Software Product Line (SPL) is a successful approach to increase the product portfolio with up to an order of magnitude and provide consistent user experience across product portfolio (Bosch J, 2009). Software Product Line engineering is concerned with systematically reusing development assets in an application domain (Clements et al., 2001)(Pohl et al., 2005). ​+To support automatic selection of algorithms, Portfolio approaches aim at performing per-instance algorithm selection (Leyton et al., 2003). When portfolio refers to more complex products than algorithms (i.e. not only a set of software components but the composition of a set of consistent software components),​ Software Product Line (SPL) is a successful approach to increase the product portfolio with up to an order of magnitude and provide consistent user experience across product portfolio (Bosch J, 2009). Software Product Line engineering is concerned with systematically reusing development assets in an application domain (Clements et al., 2001)(Pohl et al., 2005). ​
  
 A Machine Learning (ML) Workflow can be defined as a tuple (h,p,c) where h represents hyper-parameter tuning strategy, ​ p represents a set of preprocessing techniques applied on the dataset, and c is a ML algorithm used to learn a model from the processed data and to predict then over new data. A Machine Learning (ML) Workflow can be defined as a tuple (h,p,c) where h represents hyper-parameter tuning strategy, ​ p represents a set of preprocessing techniques applied on the dataset, and c is a ML algorithm used to learn a model from the processed data and to predict then over new data.
 The construction of a Machine Learning Workflow depends upon two main aspects: The construction of a Machine Learning Workflow depends upon two main aspects:
-   ​* The structural characteristics (size, quality, and nature) of the collected data +         * The structural characteristics (size, quality, and nature) of the collected data 
-          * How the results will be used+         ​* How the results will be used.
 This task is highly complex because of the increasing number of available algorithms, the difficulty in choosing the correct preprocessing techniques together with the right algorithms as well as the correct tuning of their parameters. To decide which algorithm to choose, data scientists often consider families of algorithms in which they are experts, and can leave aside algorithms that are more “exotic” to them, but could perform better for the problem they are trying to solve. This task is highly complex because of the increasing number of available algorithms, the difficulty in choosing the correct preprocessing techniques together with the right algorithms as well as the correct tuning of their parameters. To decide which algorithm to choose, data scientists often consider families of algorithms in which they are experts, and can leave aside algorithms that are more “exotic” to them, but could perform better for the problem they are trying to solve.
  
Line 16: Line 17:
 The approach is thus positioned differently from the platforms that help select the workflows components like Weka (Hall et al, 2009), Orange (Demsar et al, 2004), KNIME (Berthold et al, 2007), RapidMiner (Mierswa et al, 2006) or ClowdFlows (Kranjc et al, 2012). Indeed, such platforms have a lot of components that can be selected to create the desired workflows, but if these systems are useful for data scientists, they can be too complex and overwhelming for non-expert users. For such users, it may be more helpful to use a system that either provides them directly with the workflow to use, or at least suggests the components of the workflow to use for their specific problem. ​ The approach is thus positioned differently from the platforms that help select the workflows components like Weka (Hall et al, 2009), Orange (Demsar et al, 2004), KNIME (Berthold et al, 2007), RapidMiner (Mierswa et al, 2006) or ClowdFlows (Kranjc et al, 2012). Indeed, such platforms have a lot of components that can be selected to create the desired workflows, but if these systems are useful for data scientists, they can be too complex and overwhelming for non-expert users. For such users, it may be more helpful to use a system that either provides them directly with the workflow to use, or at least suggests the components of the workflow to use for their specific problem. ​
  
-The interest of such a system is so valuable that big companies have proposed their own knowledge flows. The IBM Watson platform offers an interface to analyze unstructured data (for example text files) and takes as input, questions in plain English. Amazon’s product is a black box similar to the IBM’s platform but it is focused on supervised Machine Learning, in particular on classification and regression tasks. The workflow is built automatically by the platform by analyzing the input data. On the other hand, the solution proposed by Microsoft Azure, instead of choosing the solution automatically for the user, it provides advices to the users on which components to use to form the workflow. These advices are based both on the best practices used in Machine Learning and on the algorithms available on the platform, that is, they cover a limited part of possible Machine Learning workflows. For example, to solve the clustering problem, only the K-means algorithm is proposed, while the limitations of this technique are well-known and several families of clustering algorithms have been proposed to overcome these. ​+The interest of such a system is so valuable that big companies have proposed their own knowledge flows. The IBM Watson platform offers an interface to analyze unstructured data (for example text files) and takes as input, questions in plain English. Amazon’s product is a black box similar to the IBM’s platform but it is focused on supervised Machine Learning, in particular on classification and regression tasks. The workflow is built automatically by the platform by analyzing the input data. On the other hand, the solution proposed by Microsoft Azure, instead of choosing the solution automatically for the user, provides advices to the users on which components to use to form the workflow. These advices are based both on the best practices used in Machine Learning and on the algorithms available on the platform, that is, they cover a limited part of possible Machine Learning workflows. For example, to solve the clustering problem, only the K-means algorithm is proposed, while the limitations of this technique are well-known and several families of clustering algorithms have been proposed to overcome these. ​
  
 ===== Objectives ===== ===== Objectives =====
students/phd_mlws.txt · Last modified: 2017/05/28 20:03 by blay