Stories by Ahmed El Deeb on Medium

So what is Machine Learning good for anyway?

Ahmed El Deeb — Mon, 20 Mar 2017 00:03:22 GMT

A few weeks ago, a friend and colleague (Alex G.) asked me this question. He was aware of recent breakthroughs in the world of Machine Learning, esp. where it came to computer vision. Popular media picked up these relatively recent successes and brought them to the center of attention and then started projecting a future where machines are going to render humans obsolete. So he asked me: “Other than the recent breakthroughs in computer vision, what are the other success stories for machine learning?”

At first I was quite surprised by the question. I have been using ML techniques for several years now and was thoroughly convinced of their usefulness and feasibility, I thought: surely, everyone must see it the same way! But as I started formulating my answer, I was struck by how hard I found it to come up with a satisfying answer. I though it must be one of these brain-freeze moments where you can’t seem to capture the simplest of words or ideas, but at the moment, my own answer sounded unconvincing to my own ears. Being a generous man, he didn’t call me out on my terrible answer, but went on to ask: “… and what is it that makes a particular problem more or less amenable to ML techniques?”

The State of Machine Learning

So what is machine learning good for? What problems has it solved for us? In what fields has it proven itself worth the investment and the hype? And should we expect it to succeed in the future at solving more problems in other domains, or is this just naive optimism? Where are we in the hype cycle of “Deep Learning”? Are we heading towards a new AI winter?

I’ll try to make the case that machine learning has already paid off, and that optimism in its future is well warranted (even if it is currently somewhat overhyped).

Spam or ham?

One of the early successes of machine learning was in the domain of text “understanding”, or more precisely, document classification. The simple task of filtering out spam email, without which email would not really be viable and the world as we know today would not exist. Most of the techniques used for document classification don’t actually rely on any kind of language understanding, but rely instead on the occurrence of various words and phrases.

Are we feeling lucky?

Robust and adaptive spam filters allowed us to take email back from the spammers, but an arguably more important success was the use of Machine Learning in search engines. Early search engine ranking systems relied on hand-crafted formulas, but very quickly machine learning and statistical models became the most important building blocks of search engines, from the filtering of spammy pages, to spell-correcting user queries and augmenting them with extra keywords, to the actual ranking of web results, to extracting relevant parts of page contents for snippets. Without machine learning, Google would not be the search engine we know today, and reaching information about anything would be at least twice as hard. It’s hard to overstate how big an impact this had on our modern world.

The damn ads!

We all hate ads, but there had to be a way for companies like Google to make money. And make money they do! Once the internet became more efficient than old-fashioned media at targeting specific user groups, say, based on interest in some specific type of content, it started to really pay of. And that was just the beginning, with Machine Learning based ad targeting and selection systems, and platforms that, like Facebook, Google, and Amazon that had information-rich user profiles, ad targeting became more and more efficient. In fact, in most major internet companies, you’ll find a significant portion of the Machine Learning talent syphoned into the teams building the ad targeting and optimization platforms. Which can be seen as unfortunate, but ads are less annoying when they are actually relevant, so let’s hope for better, privacy-preserving ad-targeting systems that enable a better balance of content to ads and make the web a better place for everyone.

A tad too much?

Now you see me …

More recently and maybe more famously, computers became good at seeing things and “understanding” them. Some applications of turning written numbers or words into digital form have been around for decades and quite useful in their own right, but with recent advances is Machine Learning algorithms, computers started becoming good at detecting objects in images. Detecting faces in an image and where they are, detecting smiles and frowns, detecting cats and dogs and cars, detecting several objects from a single image and characterizing the spatial relationships between them. These algorithms had immediate applications for social networks and camera software, but the more momentous consequence was enabling advances in autonomous vehicles. Autonomous vehicles require more than just seeing and understanding images, and is not yet quite a success. But it’s only a matter of time, and the impact it would have on our world is going to be massive, even by conservative estimates. The effects on transportation is going to be significant, but the effect on jobs and employment is going to be more disruptive and more dangerous. It’s hard to estimate how quickly our societies are going to cope with such a big and quick change.

In a parallel vein, similar algorithms and techniques were used to make computers hear the spoken word and convert speech into readable, digitized text. Converting speech to text is not quite as good as actually understanding the content of the text though.

What does the future hold?

General AI is still out of our reach. There are maybe three to five major steps that AI needs to make before general AI becomes within reach. The first such step or building block is language understanding. By language understanding I mean the ability to read a paragraph of text and build a representation of its meaning in an actionable way that is comparable to how humans understand language. It is interesting to note that we have been relatively more successful in understanding image than understanding natural language. I attribute that to the fact that images as a representation usually have more redundancy, while language is usually constructed with just enough redundancy to make it discernable to an intelligent human.

Another step or building block is long-term strategic planning. We have a form of that for playing games like chess and go. But achieving the same for general situations where the rules of the game are more ambiguous is still out of reach. Recent breakthroughs have reinforcement learning algorithms play video games without knowing the rules a priori , but just by looking at the screen’s visual information like humans learn to play such games. True, these games are usually tactical and short-term, but I can’t help but feel that we’re getting close to achieving this one.

Mass coordination and hive intelligence is not something we often think about when we think about AI. But the potential for machines coordinating on a much grater scale than humans is one of machines’ few edges on humans, and may lead to fundamentally different ways of solving problems than traditional human intelligence would.

One of the final steps is perhaps developing a unified system of rewards and incentives. Or even a system of laws and bounds. This aspect has been in the realm of science fiction for decades and may be the aspect we should be focusing on the most. It’s not a requirement of AI as much as a requirement for avoiding that apocalyptic future where machines take over the world.

So what is Machine Learning good for anyway? was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

What to do with “small” data?

Ahmed El Deeb — Tue, 06 Oct 2015 05:18:39 GMT

By Ahmed El Deeb

Many technology companies now have teams of smart data-scientists, versed in big-data infrastructure tools and machine learning algorithms, but every now and then, a data set with very few data points turns up and none of these algorithms seem to be working properly anymore. What the hell is happening? What can you do about it?

Where do small data come from?

Most data science, relevance, and machine learning activities in technology companies have been focused around “Big Data” and scenarios with huge data sets. Sets where the rows represent documents, users, files, queries, songs, images, etc. Things that are in the thousands, hundreds of thousands, millions or even billions. The infrastructure, tools, and algorithms to deal with these kinds of data sets have been evolving very quickly and improving continuously during the last decade or so. And most data scientists and machine learning practitioners have gained experience is such situations, have grown accustomed to the appropriate algorithms, and gained good intuitions about the usual trade-offs (bias-variance, flexibility-stability, hand-crafted features vs. feature learning, etc.). But small data sets still arise in the wild every now and then, and often, they are trickier to handle, require a different set of algorithms and a different set of skills. Small data sets arise is several situations:

Enterprise Solutions: when you try to make a solution for an enterprise of a relatively limited members instead of a single solution for thousands of users, or if you are making a solution which companies instead of individuals are the focus of the experience
Time Series: Time is in short supply! Esp. in comparison with users, queries, sessions, documents, etc. This obviously depends on the time unit or sampling rate, but it’s not always easy to increase the sampling rate effectively, and if your ground truth is a daily number, then you have one data point for each day.
Aggregate modeling of states, countries, sports teams, or any situation where the population itself is limited (or sampling is really expensive).
Modeling of rare phenomena of any kind: Earthquakes, floods, etc.

Small Data problems

Problems of small-data are numerous, but mainly revolve around high variance:

Over-fitting becomes much harder to avoid
You don’t only over-fit to your training data, but sometimes you over-fit to your validation set as well.
Outliers become much more dangerous.
Noise in general becomes a real issue, be it in your target variable or in some of the features.

So what to do in these situation?

1- Hire a statistician

I’m not kidding! Statisticians are the original data scientists. The field of statistics was developed when data was much harder to come by, and as such was very aware of small-sample problems. Statistical tests, parametric models, bootstrapping, and other useful mathematical tools are the domain of classical statistics, not modern machine learning. Lacking a good general-purpose statistician, get a marine-biologist, a zoologist, a psychologist, or anyone who was trained in a domain that deals with small sample experiments. The closer to your domain the better. If you don’t want to hire a statistician full time on your team, make it a temporary consultation. But hiring a classically trained statistician could be a very good investment.

2- Stick to simple models

More precisely: stick to a limited set of hypotheses. One way to look at predictive modeling is as a search problem. From an initial set of possible models, which is the most appropriate model to fit our data? In a way, each data point we use for fitting down-votes all models that make it unlikely, or up-vote models that agree with it. When you have heaps of data, you can afford to explore huge sets of models/hypotheses effectively and end up with one that is suitable. When you don’t have so many data points to begin with, you need to start from a fairly small set of possible hypotheses (e.g. the set of all linear models with 3 non-zero weights, the set of decision trees with depth <= 4, the set of histograms with 10 equally-spaced bins). This means that you rule out complex hypotheses like those that deal with non-linearity or feature interactions. This also means that you can’t afford to fit models with too many degrees of freedom (too many weights or parameters). Whenever appropriate, use strong assumptions (e.g. no negative weights, no interaction between features, specific distributions, etc.) to restrict the space of possible hypotheses.

Any crazy model can fit this single data point (drawn from distribution around yellow curve)

As we have more points, less and less models can reasonably explain them together.

The figures were taken from Chris Bishop’s book Pattern Recognition and Machine Learning

3- Pool data when possible

Are you building a personalized spam filter? Try building it on top of a universal model trained for all users. Are you modeling GDP for a specific country? Try fitting your models on GDP for all countries for which you can get data, maybe using importance sampling to emphasize the country you’re interested in. Are you trying to predict the eruptions of a specific volcano? … you get the idea.

4- Limit Experimentation

Don’t over-use your validation set. If you try too many different techniques, and use a hold-out set to compare between them, be aware of the statistical power of the results you are getting, and be aware that the performance you are getting on this set is not a good estimator for out of sample performance.

5- Do clean up your data

With small data sets, noise and outliers are especially troublesome. Cleaning up your data could be crucial here to get sensible models. Alternatively you can restrict your modeling to techniques especially designed to be robust to outliers. (e.g. Quantile Regression)

6- Do perform feature selection

I am not a big fan of explicit feature selection. I typically go for regularization and model averaging (next two points) to avoid over-fitting. But if the data is truly limiting, sometimes explicit feature selection is essential. Wherever possible, use domain expertise to do feature selection or elimination, as brute force approaches (e.g. all subsets or greedy forward selection) are as likely to cause over-fitting as including all features.

7- Do use Regularization

Regularization is an almost-magical solution that constraints model fitting and reduces the effective degrees of freedom without reducing the actual number of parameters in the model. L1 regularization produces models with fewer non-zero parameters, effectively performing implicit feature selection, which could be desirable for explainability of performance in production, while L2 regularization produces models with more conservative (closer to zero) parameters and is effectively similar to having strong zero-centered priors for the parameters (in the Bayesian world). L2 is usually better for prediction accuracy than L1.

L1 Regularization can push most model parameters to zero

8- Do use Model Averaging

Model averaging has similar effects to regularization is that it reduces variance and enhances generalization, but it is a generic technique that can be used with any type of models or even with heterogeneous sets of models. The downside here is that you end up with huge collections of models, which could be slow to evaluate or awkward to deploy to a production system. Two very reasonable forms of model averaging are Bagging and Bayesian model averaging.

Each of the red curves is a model fitted on a few data points

But averaging all these high variance models gets us a smooth output that is remarkably close to the original distribution Pattern Recognition and Machine Learning

9- Try Bayesian Modeling and Model Averaging

Again, not a favorite technique of mine, but Bayesian inference may be well suited for dealing with smaller data sets, especially if you can use domain expertise to construct sensible priors.

10- Prefer Confidence Intervals to Point Estimates

It is usually a good idea to get an estimate of confidence in your prediction in addition to producing the prediction itself. For regression analysis this usually takes the form of predicting a range of values that is calibrated to cover the true value 95% of the time or in the case of classification it could be just a matter of producing class probabilities. This becomes more crucial with small data sets as it becomes more likely that certain regions in your feature space are less represented than others. Model averaging as referred to in the previous two points allows us to do that pretty easily in a generic way for regression, classification and density estimation. It is also useful to do that when evaluating your models. Producing confidence intervals on the metrics you are using to compare model performance is likely to save you from jumping to many wrong conclusions.

Parts of the feature space are likely to be less covered by your data and prediction confidence within these regions should reflect that

Bootstrapped performance charts from ROCR

Summary

This could be a somewhat long list of things to do or try, but they all revolve around three main themes: constrained modeling, smoothing and quantification of uncertainty.

Most figures used in this post were taken from the book “Pattern Recognition and Machine Learning” by Christopher Bishop.

What to do with “small” data? was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

Interview Questions for Data Scientist Positions (Part II)

Ahmed El Deeb — Tue, 30 Jun 2015 19:50:39 GMT

In a previous post, I argued for the need of a different kind of interview questions for data science and machine learning engineers. I then listed some questions I thought good for gauging data science knowledge and cleverness. I extend the list here with some more question:

On some online stores, you notice reviews for multiple-installment novels follow a peculiar trend that goes slightly up for each installment, even though the number of reviews goes down. What do you think is happening here?

Selection bias. Plain and simple. Mostly those who liked the first installment (or who are hard-core fans of the author) go on to read subsequent installments, and hence are more likely to have a favorable opinion of the book and the writer in general.

OK, that makes sense. Now what can we do to de-bias these ratings and get a score on which we can compare novels on an equal footing.

This could be solved by stratified re-sampling, or by creating a new scoring system that combines average rating with number of reviews into a single score.

You are chief scientist for the air forces in WW II and you are tasked with making air strikes safer for fighter pilots (i.e. you want more of them to come back). You personally inspect damaged planes after coming back from battle (say 70% of the planes make it back on average, and 20% are damaged). You find that bullet damage is distributed in a highly non-uniform way (e.g. way more bullets in the wings region than is merited by their area). What could be the reason for this? What would you do to make planes less prone.

This actually happened during WW II and the protagonist was Abraham Wald:

Abraham Wald - Wikipedia, the free encyclopedia

The United States Chess Federation (USCF) invites you to devise their new ranking system that will replace Elo. You are free to devise enhancements to the current system or propose a completely new ranking algorithm.

Many possible ways to do this. makes for a good discussion question. Also whether the candidate decides to extend Elo or to start from scratch tells something about her/his character.

Elo rating system - Wikipedia

How would you go about building an ensemble of hundreds of highly diverse models? (resulting from different algorithms and different parameters)

This opens the room for the candidate to show off knowledge about bagging and boosting and the benefits of each, but for that scale of diverse models to be beneficial, stacking is a natural choice. Essential here is the awareness that stacking requires an extra hold-out set, and to show resourcefulness in adapting the stacking scheme if that holdout set is fairly small.

How would you sample uniformly from a continuous stream of data?

Reservoir Sampling.

Assume you already have a classification model with great ROC curve, but the model produces arbitrary scores that do not map to probability estimates, how would you go about calibrating the scores into probabilities?

There are a few methods for doing that. Interesting to see if the candidate understand that the calibration process biases some metrics on the calibration data set.

Isotonic regression - Wikipedia, the free encyclopedia

Explain the bootstrap sampling method and when it can be useful.

Interview Questions for Data Scientist Positions (Part II) was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

Smarter Parameter Sweeps (or Why Grid Search Is Plain Stupid)

Ahmed El Deeb — Mon, 22 Jun 2015 08:32:10 GMT

Anyone that ever had to train a machine learning model had to go through some parameter sweeping (a.k.a. hyper-parameter optimization) to find a sweet spot for algorithm parameters. For random forests the parameters in need of optimization could be the number of trees in the model and the number of features considered at each split, for a neural network, there is the learning rate, the number of hidden layers, the number of hidden units in each layer, and several other parameters.

Hyper-parameter optimization requires the use (and maybe the abuse) of a validation set on which you can’t trust your performance metrics anymore. In this sense it is like a second phase of learning, or an extension to the learning algorithm itself. The performance metric (or the objective function) can be visualized as a heat-map in the n-dimensional parameter-space or as a surface in an n+1-dimensional space (the dimension n+1 being the value of that objective function). The bumpier this surface is (the more local minima and saddle points it has), the harder it becomes to optimize these parameters. Here are a couple of illustrations for two such surfaces defined by two parameters, the first one is mostly well behaved:

While the second one is more bumpy and riddled with several local minima:

The most common method at selecting algorithm parameters is by far the ubiquitous grid-search. In fact, the word “parameter sweep” actually refers to performing a grid search but has also become synonymous with performing parameter optimization. Grid-search is performed by simply picking a list of values for each parameter, and trying out all possible combinations of these values. This might look methodical and exhaustive. But in truth even a random search of the parameter space can be MUCH more effective than a grid search!

This amazing paper by Bergstra et al. claims that a random search of the parameter space is guaranteed to be more effective than grid search (and quite competitive in comparison with more sophisticated techniques).

Surprising, ha? Why should random search be better than the much more robust-looking grid-search? Here is why:

The idea is that in most cases the bumpy surface of the objective function is not as bumpy in all dimensions. Some parameters have much less effect on the cost function than others, if the importance of each parameter is known, this can be encoded in the number of values picked for each parameter in the grid-search. But that’s not typically the case, and anyway, just using random search allows the exploration of more values for each parameter, given the same amount of trials:

(The beautiful illustration is taken from the same paper referenced above)

More elaborate ways of optimizing algorithm hyper-parameters exist, in fact whole start-ups have been built around the idea (one of them recently acquired by twitter). A couple of libraries and several research papers tackle the problem, but for me, random sweeps are good enough for now.

Smarter Parameter Sweeps (or Why Grid Search Is Plain Stupid) was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Unreasonable Effectiveness of Random Forests

Ahmed El Deeb — Thu, 18 Jun 2015 01:12:08 GMT

It’s very common for machine learning practitioners to have favorite algorithms. It’s a bit irrational, since no algorithm strictly dominates in all applications, the performance of ML algorithms varies wildly depending on the application and the dimensionality of the dataset. And even for a given problem and a given dataset, any single model will likely be beaten by an ensemble of diverse models trained by diverse algorithms anyway. But people have favorites nevertheless. Some like SVMs for the elegance of their formulation or the quality of the available implementations, some like decision rules for their simplicity and interpretability, and some are crazy about neural networks for their flexibility.

My favorite out-of-the-box algorithm is (as you might have guessed) the Random Forest, and it’s the second modeling technique I typically try on any given data set (after a linear model).

Here’s why:

Random Forests require almost no input preparation. They can handle binary features, categorical features, numerical features without any need for scaling.
Random Forests perform implicit feature selection and provide a pretty good indicator of feature importance.
Random Forests are very quick to train. It’s a stroke of brilliance when a performance optimization happens to enhance model precision, or vice versa. The random feature sub-setting that aims at diversifying individual trees, is at the same time a great performance optimization! Tuning down the fraction of features that is considered at any given node can let you easily work on datasets with thousands of features. (The same is applicable for row sampling if your dataset has lots of rows)
Random Forests are pretty tough to beat. Although you can typically find a model that beats RFs for any given dataset (typically a neural net or some boosting algorithm), it’s never by much, and it usually takes much longer to build and tune said model than it took to build the Random Forest. This is why they make for excellent benchmark models.
It’s really hard to build a bad Random Forest! Since random forests are not very sensitive to the specific hyper-parameters used, they don’t require a lot of tweaking and fiddling to get a decent model, just use a large number of trees and things won’t go terribly awry. Most Random Forest implementations have sensible defaults for the rest of the parameters.
Versatility. Random Forest are applicable to a wide variety of modeling tasks, they work well for regression tasks, work very well for classification taks(and even produce decently calibrated probability scores), and even though I’ve never tried it myself, they can be used for cluster analysis.
Simplicity. If not of the resulting model, then of the learning algorithm itself. The basic RF learning algorithm can be written in a few lines of code. There’s a certain irony about that. But a sense of elegance as well.
Lots of excellent, free, and open-source implementations. You can find a good implementation in almost all major ML libraries and toolkits. R, scikit-learn and Weka jump to mind for having exceptionally good implementations.
As if all of that is not enough, Random Forests can be easily grown in parallel. The same cannot be said about boosted models or large neural networks.

This beautiful visualization from scikit-learn illustrates the modelling capacity of a decision forest:

Visualization from scikit-learn.org illustrating decision boundaries and modeling capacity of a single decision tree, a random forest and some other techniques.

Drawbacks?

The main drawback of Random Forests is the model size. You could easily end up with a forest that takes hundreds of megabytes of memory and is slow to evaluate.
Another point that some might find a concern is that random forest models are black boxes that are very hard to interpret.

Some References:

Here’s a paper by Leo Breiman, the inventor of the algorithms describing random forests.

Here’s another amazing paper by Rich Caruana et al. evaluating several supervised learning algorithms on many different datasets.

The Unreasonable Effectiveness of Random Forests was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

Interview Questions for Data Scientist Positions

Ahmed El Deeb — Mon, 08 Jun 2015 20:02:09 GMT

There are loads of books on “cracking” the programming interview, and every computer scientist or software engineer has spent some time hunting down and trying to solve interesting interview problems. But the typical interview problems are not any good for assessing the aptitude of a data scientist. I've personally seen brilliant programmers and software engineers struggle for years with wrapping their minds around machine learning concepts and statistical analysis techniques. It is clear then that the job interview for a data scientist needs to have questions and problems specifically designed to gauge these abilities.

These are some questions I came up with when I was asked to conduct interviews for “Research Engineer” positions, please feel free to give feedback and send your own questions to augment this list.

Modeling:

What is the simplest possible classification model you can learn from data?

I've seen time and again that some ML practitioners are used to using sophisticated algorithms (e.g. SVMs, Gradient Boosted Trees, etc.) and have very tenuous grasp of simpler modeling techniques. I believe this is a critical blind spot. Simple modeling techniques serve as good, solid baselines, are less prone to overfitting and are easier to implement on a large to huge scale in online environments. The simplest classification model that can be learned from data is a simple threshold on a single feature. The next step in complication is a linear model linking the target variable to multiple predictors or a single decision tree. A candidate should be able to write the algorithm to tune any of these models in 10 minutes or so.

What are your favorite Machine Learning algorithms and why?

This is an inherently biased question, since every machine learning practitioner has his own set of algorithms and if the candidate’s picks match those of the interviewer, he’ll definitely get his sympathy. But the goal of the question is really the “why” part. Whatever the candidates’ favorite algorithms are, they should be able to justify their choices convincingly. This question can also allow the candidate to show actual passion and enthusiasm about the field, something I believe crucial for the successful data scientist.

Why is feature selection an important step in modeling and what’s your favorite method of doing it?

This is kind of a trick question (at least coming from me) since I don’t really believe that feature selection is all that important. Not in most cases anyway. But it’s treated heavily in literature, and I would love to see that the candidate is not just doing things in a certain way because it’s how other people usually do it. Anyway, even if the candidate does believe in the importance of feature selection, the way he would go about it and whether he understands it’s costs would tell a lot about his caliber.

How do you go about tuning algorithm specific hyper-parameters?

What I’m looking for here is basically any method smarter than the mindless grid-search.

How do you know that your model is over-fitting and what do you do about it?

Simple. Straight-forward. Still an essential question.

Metrics and experimentation:

You inherited a patch of land from your uncle. The first year under your management, land yield goes down to half what it was the prior year, you investigate and find out that you uncle had a secret recipe that he didn’t pass on. There are three possible types of seeds, four types of fertilizers, and two types of pesticide. How would you go about re-discovering you late uncle’s formula?

Well, … randomized experiments with small land patches assigned randomly to treatments is a good start, including treatments that the lack pesticide and fertilizer, assessing main effects and interactions, getting confidence intervals and possibly comparing finalist treatments in a subsequent round (depending on statistical significance of results), … something along these lines.

What kind of metrics would you track for you music streaming website?

No single good answer to this question of course but I’d be looking to assess candidate’s grasp on metrics and their importance and the fact that most metrics have blind spots and how to combine several metrics into one “success” metric and the drawbacks of doing that, and why it might be a good idea to change that metric every now and then, and so on and so forth.

If you were training a classifier, which metrics would you use for model selection and why?

How many time have I seen slides filled with precision/recall numbers that were completely useless for comparing models?! For this question I expect either a metric that compares classifier efficacy along the whole score range like area under ROC curve, or at least comparing recall at a preset precision point or something equally sensible.

You get a weekly spam message predicting the outcome of one football game each week, the spammer claims he has insider information and will let you in on it for a significant fee. You ignore it of course, but you keep getting the weekly message and it keeps guessing the game outcome correctly for 10 weeks in a row, should you pay him? What’s going on here?

…

This list is by no means exhaustive, in fact I left whole areas and skills totally un-covered (esp. if I believe the typical programming interview covers it). So I’d love to hear some suggestions to expand this list and make it more rounded.

(read the second part of this article)

Interview Questions for Data Scientist Positions was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.

7 Ways to Improve your Predictive Models

Ahmed El Deeb — Thu, 28 May 2015 18:09:59 GMT

This is a figure I dug up from an old slide deck I prepared years ago for a workshop on predictive modeling. it illustrates what I think of as the “war horse” of model tuning (‘cause you know, it kind of looks like a horse, with an extra spear). It also is a kind of map for navigating the Bias-Variance space.

Bias and variance are the two components of imprecision in predictive models, and in general there is a trade-off between them, so normally reducing one tends to increase the other. Bias in predictive models is a measure of model rigidity andinflexibility, and means that your model is not capturing all the signal it could from the data. Bias is also known as under-fitting. Variance on the other hand is a measure of model inconsistency, high variance models tend to perform very well on some data points and really bad on others. This is also known as over-fittingand means that your model is too flexible for the amount of training data you have and ends up picking up noise in addition to the signal, learning random patterns that happen by chance and do not generalize beyond your training data.

The simplest way to determine if your model is suffering more from bias or from variance is the following rule of thumb:

If your model is performing really well on the training set, but much poorer on the hold-out set, then it’s suffering from high variance. On the other hand if your model is performing poorly on both training and test data sets, it is suffering from high bias.

Depending on the performance of your current model and whether it is suffering more from high bias or high variance, you can resort to one or more of these seven techniques to bring your model where you want it to be:

Add More Data! Of course! This is almost always a good idea if you can afford it. It drives variance down (without a trade-off in bias) and allows you to use more flexible models.
Add More Features! This is almost always a good idea too. Again, if you can afford it. Adding new features increases model flexibility and decreases bias(on the expense of variance). The only time when it’s not a good idea to add new features is when your data set is small in terms of data points and you can’t invest in #1 above.
Do Feature Selection. Well, … only do it if you have a lot of features and not enough data points. Feature selection is almost the inverse of #2 above, and pulls your model in the opposite direction (decreasing variance on the expense of some bias) but the trade-off can be good if you do the feature selection methodically and only remove noisy and in-informative features. If you have enough data, most models can automatically handle noisy and uninformative features and you don’t need to do explicit feature selection. In this day and age of “Big Data” the need for explicit feature selection rarely arises. It is also worth noting that proper feature selection is non-trivial and computationally intensive.
Use Regularization. This is the neater version of #3 and amounts to implicit feature selection. The specifics are beyond the scope for this post, but regularization tells your algorithm to try to use as few features as possible, or to not trust any single feature too much. Regularization relies on smart implementations of training algorithms and is usually the much preferred version of feature selection.
Bagging is short for Bootstrap Aggregation. It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
Boosting is a slightly more complicated concept and relies on training several models successively each trying to learn from the errors of the models preceding it. Boosting decreases bias and hardly affects variance (unless you are very sloppy). Again the price is computation time and memory size.
Use a more different class of models! Of course you don’t have to do all the above if there is another type of models that is more suitable to your data set out-of-the-box. Changing the model class (e.g. from linear model to neural network) moves you to a different point in the space above. Some algorithms are just better suited to some data sets than others. Identifying the right type of models could be really tricky though!

It should be noted though that model accuracy (being as far to the bottom left as possible) is not the only objective. Some highly accurate models could be very hard to deploy in production environments and are usually black boxes that are very hard to interpret or debug, so many production systems opt for simpler, less accurate model that are less resource-intensive, easier to deploy and debug.

7 Ways to Improve your Predictive Models was originally published in Rants on Machine Learning on Medium, where people are continuing the conversation by highlighting and responding to this story.