Data scientists perform services ranging from data infrastructure design and management, to data analysis of various kinds. In my consulting I focus on the data analysis side, and in particular on so-called machine learning tasks, which is what I will discuss here.

Generally, machine learning refers to the software implementation of mathematical models that extract from historical data the information needed to respond, usually automatically, to new data instances in some way.

For example, spam filters predict whether to designate a new email as “junk”, based on how a user previously designated a large number of previous messages. A property valuation site suggests the sale price for a new home, given its location and other attributes, based on a database of previous sales.

Machine learning models are usually adaptive in the sense that new data is being continually assimilated to improve its performance over time.

Typical machine learning tasks

Here is a non-exhaustive list of tasks a data scientist can help you carry out:

  • Sales forecasting for inventory management

  • Recommender systems, e.g., for movies

  • Credit scoring and next-best offers

  • Pricing of goods and services, e.g., property and insurance policies

  • Sales forecasting for inventory management

  • Real-time selection of web-page advertising

  • Text-based sentiment analysis

  • Content relevance tasks, such as web search results

  • Fraud detection

  • Risk assessment

  • Prediction of equipment failures

  • Network intrusion detection

  • Automated message filtering/prioritisation

  • Market segmentation

The data scientist’s edge

It is difficult to implement machine learning tasks without a detailed knowledge of learning algorithms (and their limitations!) guided by a proper theoretical grounding in statistical theory.

Any experienced data scientist will tell you that no one model or algorithm is the best for all tasks. While an out-of-the-box solution may sometimes be appropriate, most models benefit significantly from careful tuning and a data scientist may try several models before settling on particular one. In an advanced technique known as stacking a data scientist may even combine several models together to gain that extra edge.

A crucial role played by the data scientist is in the interpretation of a model’s output. For a start, a poor choice of improperly trained model can easily generate garbage responses, and troubleshooting is difficult without an understanding of the model’s inner workings. For a subtler example, consider the problem of selecting a home-valuation model for a bank. One might be tempted to choose the model with the lowest error, as estimated on some test data. However, from a risk-management perspective, the bank may actually prefer a model with a larger average error, if one can associate greater certainty with the error estimate. Addressing issues such as these requires some mathematical and statistical sophistication.

To get an idea of the complexity of model-selection, check out this (somewhat outdated) cheat-sheet from ScikitLearn (a suite of Python machine learning algorithms) originally created by Andreas Mueller:

cheat-sheet

Why the hype now?

Data science is a relatively new field because:

  • The collection of the the large amounts data needed to effectively train models has been previously limited to a small number of large enterprises. Nowadays, any operation with a web site is already routinely collecting data.

  • The computational resources needed to train (or even develop) many models was not historically available. This meant that data analysis tasks (such as credit scoring or insurance pricing, for example) relied heavily on specialised domain knowledge, ad hoc approaches, and simple (less accurate) models that were computationally inexpensive.