Chris Stucchio

Boosting as a scheme for transfer learning

2022-12-29T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Here's a scenario that I believe to be common. I've got a dataset I've been collecting over time, with features $x_1, \ldots, x_m$ This dataset will generally represent decisions I want to make at a certain time. This data is not a timeseries, it's just data I happen to have collected over time. I will call this the old data and old features.

The problem I'm interested in right now is the situation where some of the data was not collected since the beginning of time. I will assume that on a much smaller dataset - e.g. the most recent 10-20% of the data - I have collected additional features $x_{m+1} \ldots x_m$ which I have reason to believe are useful for predictions. I will call this the new data and new features.

The challenge here is that if we combine the training data into a single table, it looks something like this:

x1	x2	Xn+1	Xn+2	y
5	-3	null	null	False
7	2	null	null	True
.	0	.	.	.
3	5	0.2	-0.5	False
4	-6	0.9	1.2	True

This dataset is not representative of future data; future data will not have nulls in the columns representing the new features.

It is my desire to build a scheme for transfer learning using non-neural network methods such as XGB. Specifically, I want to train a model on the large dataset using only the old features, then transfer knowledge gained in this way to a model trained on the small dataset and including the new features. In this blog post I provide a proof of concept on that classical boosting provides a mechanism for accomplishing this knowledge transfer.

Concrete example: standard stock market data, e.g. end of day prices, along with data pulled out of assorted EDGAR documents. This data is generally available as far back as I'd like, provided I'm willing to parse old SEC files or pay a data provider for it. I also have some alternative data which only goes back a couple of years. I believe it is customary to pretend this is satellite photos of cars in walmart parking lots rather than badly anonymized data about consumers purchased from VISA or T-Mobile which consumers would generally expect to be kept confidential.

Standard options include:

Training a model on all rows, including rows for which the new data is null.
Training a model only on the new data.

A significant problem with (1) is that the model can learn spurious correlations between missingness and outcomes - e.g. if one starts collecting new data during a bear market, the model may learn "data present = stonks go down". The key problem with (2) is that the size of the new dataset might just be too small.

Alternate setup: Trunk and Branch Models

Another way to think about this setup for the dataset is via the concept of "trunk and branch" models. One has a large dataset - in the neural network context this is typically a large corpus of language or images. One also has a smaller set of data, either significantly more specialized or ith additional data (such as annotations) and which is representative of the problem one actually wishes to solve.

Concrete example: the large dataset might be a corpus of all kinds of scraped data - journalism, blog posts, tumblr, etc. The smaller dataset might be language from a very specific domain with more limited language and a narrow goal - SEC reports, restaurant reviews, etc. The trunk and branch model involves training a trunk model on the full dataset and a branch model on the narrow dataset.

The purpose of the trunk model is to learn things of general applicability - how the English language works. The purpose of the branch model is to actually predict an outcome of interest - e.g. XYZ company has reported impairment of goodwill, material control weaknesses, or perhaps behaviors indicating earnings management. Additional features present in the narrow dataset may be things like various annotations and structured data which SEC reports tend to contain (SEC-flavored XBRL FTW!).

In mathematical terms this is identical to the setup above. The "new" data is the narrow dataset on which we actually want to make useful predictions, whereas the "old" data is the larger corpus of language in general. Although I am not making any use of this neural network mathematical framework here, this general idea did inspire my approach a bit.

My proposed method: boosting across datasets

I propose a method I haven't seen described elsewhere, but which I've been having some success with: boosting a model trained on old data to the new data. Code used in this blog post is available on github. The idea is as follows.

First, train a model on old data, which by necessity doesn't use new features. Call this the "old model".

Second, compute model scores on the new data. Then compute model weights which are large in places where the old model got things wrong, low otherwise. Adaboost weighting provides a great template for this.

Finally, train a model on the new data using the weights from the previous step. Final scores are merely the sum of the scores of the individual models.

This procedure probably seems familiar because it is - it's almost exactly the same as ordinary boosting. But with classical boosting we're taking a weak learner and repeatedly training new models on the same data to address weaknesses in the learner. In contrast, I'm taking a (hopefully!) stronger learner and training a single new model on different data to address weaknesses in the dataset itself.

Avoiding confusion: For clarity of exposition I am being intentionall vague about which specific model I'm using. That's because the model I choose to use is gradient boosting and I don't want to conflate boosting across datasets with boosting on a fixed dataset. Additionally, in the example code I provide, that's all handled within libraries and there's no compelling reason it couldn't be swapped out for a different method.

Concrete details

I will now include some (slightly oversimplified) pseudocode to illustrate the details. Real code is in the github repo.

The weight computation is a simplified version of adaboost weights taken directly from the sklearn implementation, and simplified slightly to the case of binary classifiers:

def adaboost_weights(estimator, X, y, learning_rate=0.5, sample_weight=None):
    """Implement weights for a single boost using the SAMME.R real algorithm."""

    y_predict_proba = estimator.predict_proba(X)
    if sample_weight is None:
        sample_weight = np.ones(shape=y.shape) / len(y)

    n_classes = 2
    classes = np.array([0,1])
    y_codes = np.array([-1.0 / (n_classes - 1), 1.0])
    y_coding = y_codes.take(classes == y[:, np.newaxis])

    # Displace zero probabilities so the log is defined.
    # Also fix negative elements which may occur with
    # negative sample weights.
    proba = y_predict_proba  # alias for readability
    np.clip(proba, np.finfo(proba.dtype).eps, None, out=proba)

    # Boost weight using multi-class AdaBoost SAMME.R alg
    estimator_weight = (
        -1.0
        * learning_rate
        * ((n_classes - 1.0) / n_classes)
        * xlogy(y_coding, y_predict_proba).sum(axis=1)
    )

    # Only boost positive weights
    sample_weight *= np.exp(
        estimator_weight * ((sample_weight > 0) | (estimator_weight < 0))
    )

    return sample_weight

Note that the learning_rate parameter is just something I picked arbitrarily after fiddling around in a notebook. It could almost certainly be chosen more robustly.

The model computation is then pretty straightforward, exactly as described above:

old_idx = (train_X['old_data'] == 1)

pipe1 = build_pipeline(train_X[old_idx])
pipe1.fit(train_X[old_idx], train_y[old_idx])

sample_weight = adaboost_weights(
    pipe1, train_X[~old_idx].copy(),
    train_y[~old_idx].copy(),
    learning_rate=0.25,
    sample_weight=base_weights,
)

pipe2 = build_pipeline(train_X[~old_idx])
pipe2.fit(
    train_X[~old_idx],
    train_y[~old_idx],
    final_estimator__sample_weight=sample_weight
)

boosted_pipe = SummedPredictors([pipe1, pipe2], [1.0, 1.0])

Similarly, the weight on the second predictor is chosen arbitrarily here to be 1.0. This is again a parameter which should be tweaked to see if improvements can be gained.

The exact code can be found in the notebook, this version removes some irrelevant details for brevity.

The function build_pipeline in the code sample above can be found on github and is basically just the minimal sklearn pipeline needed to pipe the dataset into sklearn.ensemble.HistGradientBoostingClassifier.

If you are a reader considering using this method for your own dataset, I encourage you to simply use your own pipeline in place of my build_pipeline method - it will almost certainly work better for your data. There's one modification I would suggest making - if your current pipeline uses some variant of sklearn.impute.MissingIndicator to handle the nulls in the new features, I would suggest removing it for obvious reasons.

How I'll test this method

I've constructed several datasets, some synthetic, some taken from kaggle competitions or standard sklearn test datasets. All datasets are binary classifiers, i.e. the target is in {0,1}.

name	data_source	data drift	num_rows
santander	https://www.kaggle.com/competitions/santander-customer-satisfaction	False	38010
car_insurance	https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification	False	58592
tabular_playground	https://www.kaggle.com/competitions/tabular-playground-series-aug-2022/data?select=train.csv	False	26570
synthetic_1	sklearn.datasets.make_classification	False	100000
synthetic_2_dataset_shift	sklearn.datasets.make_classification	True	100000
cover_type_dataset_shift	sklearn.datasets.fetch_covtype, modified	True	581012

In all cases I'm then modifying the data along the following lines (with minor changes made for a few datasets):

X['old_data'] = bernoulli(old_data_size).rvs(len(X))
...
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75)
for c in train_X.columns:
    if (bernoulli(old_data_drop_frac).rvs() == 1) and (c != 'old_data'):
        train_X.loc[train_X['old_data'] == 1, c] = None

Specifically I'm declaring a significant fraction of the data to be "old". Then in the old data I'm nullifying some columns.

In all cases I've adjusted the datasets to have 2 classes, even though some have more (e.g. fetch_covtype has 7). In these cases I'm arbitrarily mapping some of the outcomes to 0 and some to 1. This is for convenience, there's nothing fundamental to the method which requires this.

In the performance simulations I am then varying several parameters:

I vary the fraction of the dataset corresponding to the old data, from 10% to 95%.
I vary which columns are dropped at random.

This means that for each parameter choice, I retrain multiple models on different datasets.

Dataset drift

In some cases I've introduced data drift as well. Data drift is modeled by introduced a third class in the data generation. Rows corresponding to this class are present only in the new data (and test data). This is intended to model the scenario when in addition to collecting new data, the dataset also changes with time and the training set is not perfectly representative of the test data.

For example, in the sklearn.datasets.fetch_covtype, modified data source, I mapped cover types {0,1} -> 0 and {2,3,4,5,6}->1. Cover type 6 was used for dataset drift and was excluded from old data.

Computing "regret"

In order to get a baseline on how accurate a model could be absent all this dataset fuckery, I also trained a model on the same datasets but without dropping any data. I am defining the "regret" as the difference in roc_auc between this "best possible" model and the results of models trained on data with columns partially dropped.

Note that this differs a bit from the concept of decision theoretic regret. I am sure there might be a better name for it, I just don't know what it is.

Results

Here's a graph of simulations illustrating what one might expect from this method, across various scenarios.

The x-axis of the graph consists of different sizes of the old dataset, ranging from 10% to 95%. The Y-axis represents the delta in model performance (roc_auc) between the boosted model and the two baselines - training on the full dataset (blue) and training on new data only (orange).

Different dots at the same x-value correspond to different (randomized) choices of which columns should be deleted.

There is of course a wide range of performance, as one might expect. If randomly nullifying columns deletes important model features then we expect performance to go down significantly, giving dataset boosting a greater opportunity to improve things. However if the randomly deleted columns are simply noise we expect it to be harmless to baseline model performance and dataset boosting will just add further noise.

The pattern that can be seen in most of the graphs is an asymmetric benefit to using this new boosting scheme. When boosting improves performance, it does so by a large margin. When it hinders performance it's mostly by a much smaller margin.

The only exception here is tabular_playground where it underperfors in all cases. It is interesting to note that on this dataset - unlike all the others I'm using - logistic regression performs as well as gradient boosting. It is fairly easy to prove that when a linear model works there is no advantage to boosting, so this should not be surprising.

scenario	win_base	win_new_only	delta_base	delta_new_only	boosted_regret	base_regret	new_only_regret
car_insurance	0.600	0.633	0.016	0.006	0.011	0.028	0.017
cover_type_dataset_shift	0.989	0.378	0.014	-0.000	0.003	0.016	0.003
santander	0.500	0.744	0.011	0.010	0.008	0.019	0.018
synthetic_1	0.789	0.856	0.008	0.018	0.007	0.014	0.024
synthetic_2_dataset_shift	0.911	0.844	0.015	0.015	0.018	0.033	0.033
tabular_playground	0.056	0.522	-0.015	0.000	0.017	0.001	0.017

The columns win_base and win_new_only represent the fraction of times when my dataset boosting scheme outperforms training the model on either the full dataset (win_base) or the new rows only (win_new_only). Thedelta_base/delta_new_onlycolumns represent the average lift. Finally, the?_regret` columns represent the difference in performance between a model trained on the full dataset (i.e. without dropping any data) and the model trained on the censored dataset. This is averaged across all simulations.

Relevance of the censored features

A natural question arises - how does the benefit of dataset boosting relate to the importance of the features that are missing in the old data?

The meaning of importance: It is important to clarify that by "importance", I mean the importance of the features to a model trained on full uncensored data. I have evaluated this by training a model on the full data (the same ones used to compute regret) and then using sklearn.inspection.permutation_importance to compute the importance of each feature.

Some reasoning:

As the importance of the censored features increases we would expect the benefit of boosting relative to training on the full dataset to increase. This is because in the full model (trained on all data), the fraction of the dataset which contains the high importance features is very low. Whereas in the boosted model, we have constructed a model so that these features are treated as if they are fully available, which they are.
As the importance of uncensored features decreases, we would expect the base model (trained on censored data only) to provide little/no useful information. hus, in this situation, the model trained on new data only become the "best possible" model and the dataset boosting scheme will simply be adding noise.

To test this theory I ran a similar simulation to what is described above. However this time I kept the fraction of the dataset which is old constant (at 85%) while varying the number of features which were dropped.

As can be seen, on most datasets for which boosting is helpful, the pattern I speculated about above does seem to hold out.

Conclusion

In many of the simple examples I've tested it with here, this boosting scheme seems to be a significant improvement over the most obvious alternative approaches.

Dataset boosting is generally beneficial when:

The old data comprises a large fraction (80% or more) of the dataset.
The new data is not large enough to train an accurate model by restricting solely to the new data.
The features present in the old data and the features missing from the old data both have significant importance.

It is not beneficial in other cases and mostly seems to add noise, diminishing accuracy.

This shouldn't be very surprising since the theoretical justification and practical applicability of boosting has been known for a long time. But nevertheless, I've not seen this approach used to address issues related to dataset completeness, and I'm hoping a kind reader might point me to a body of literature addressing this or persuade someone to research this in greater generality.

Adapted Tree Boosting for Transfer Learning, some Alipay guys using a similar idea for a somewhat different purpose.
Multiclass adaboost original paper on the topic.

Calibrating a classifier when the base rate changes

2020-07-07T08:30:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

In a previous job, I built a machine learning system to detect financial fraud. Fraud was a big problem at the time - for simplicity of having nice round numbers, suppose 10% of attempted transactions were fraudulent. My machine learning system worked great - as a further set of made-up round numbers, lets describe it as having a precision and recall of 50% each. All this resulted in a fantastic bite taken out of the fraud problem.

It worked so well that fraud dropped by well over 50% - because of the effort involved in getting past the system, fraudsters just gave up and stopped trying to scam us.

Suddenly the system's performance tanked - recall stayed at 50% but precision dropped to 8%! After some diagnosis, I discovered the cause was the following - all the fraudsters had gone away. For every fraud attempt, the system had a 50% chance of flagging it. For every non-fraudulent transaction, the system had a 5.5% chance of flagging it.

Early on, fraud attempts made up 10% of our transactions. Thus, for every 1000 transactions, we would flag 50 of the 100 fraudulent transactions and 50 of the 900 good transactions. This means that for every 10 flags, 5 are correct - hence a precision of 50%.

Once the fraudsters fucked off, fraud attempts dropped to perhaps 1% of our transactions. For every 1000 transactions, only 10 were fraudulent. We would flag 5 of them, along with 5.5% x 990 legitimate transactions = 54 transactions. The net result is that only 5 of the 59 transactions we flagged as fraudulent actually were, for a precision of 8%.

This phenomenon is called label shift. The problem with label shift is that the base rate for the target class changes with time and this significantly affects the precision of the classifier.

Typical characteristics of the problem

In general, the following are characteristics of the problem that I'm generally interested in:

$N$ not too large - potentially under 100k.
$\alpha$ in the ballpark of 0.1% to 5%.

These kinds of problems are typical in security, fraud prevention, medicine, and other situations of attempting to detect harmful anomalous behavior.

Precision, Risk Thresholds and Loss Functions

For most classifiers the ultimate goal is to make a decision. The decision is taken in order to minimize some loss function which represents the real world cost of making a mistake.

Consider as an example a clasifier $f: \mathbb{R}^K \rightarrow \mathbb{R}$ used to predict a disease. Let us define $\vec{x} \in \mathbb{R}^K$ to be our feature vector, $z \in [0,1]$ to be our risk score and $y \in 0,1$ whether or not the patient actually has the disease.

A loss function might represent the loss in QUALYs from making an error. Concretely, suppose that a failure to diagnose a disease results in the immediate death of the patient - this is a loss of 78 - patient's age QUALYs. On the flip side, treatment is also risky - perhaps 5% of patients are allergic and also die instantly. This is a loss of 5% x (78 - patient's age) [1]. Represented mathematically, our loss function is:

\begin{equation*} L(\textrm{treat}; 0) = 0.05 \end{equation*}

\begin{equation*} L(\textrm{treat}; 1) = 0 \end{equation*}

\begin{equation*} L(\textrm{no~treat}; 0) = 0 \end{equation*}

\begin{equation*} L(\textrm{no~treat}; 1) = 1 \end{equation*}

Let us also suppose that we have a calibrated risk score, i.e. a monotonically increasing function $c: [0,1]->[0,1]$ with the property that $c(z)=P(y=1|z)$. For a given patient, the expected loss from treatment is therefore:

\begin{equation*} E[L(\textrm{treat}; y)] = (1-c(z)) \cdot 0.05 \end{equation*}

while the loss from non-treatment is:

\begin{equation*} E[L(\textrm{no~treat}; y)] = c(z) \cdot 1 = c(z) \end{equation*}

The expected loss from treatment exceeds the expected loss from non-treatment when $c(z) > 0.05/1.05 \approx 0.0526$, so the optimal decision rule is to treat every patient with a (calibrated) risk score larger than 0.0526 while letting the others go untreated.

The effect of label shift on calibration

Let's study this from the perspective of score distributions. Suppose that $f_0(z)$ is the pdf of the distribution $z | y=0$ and $f_1(z)$ is the pdf of the distribution $z | y=1$. For simplicity, assume these distributions are monotonic.

Suppose now that the base rate is $P(y=1)=\alpha$. In this framework, a label shift can be represented simply as a change in $\alpha$.

It is straightforward to calculate the calibration curve (as a function $\alpha$) as:

\begin{equation*} P(y=1|z,\alpha) = c(z;\alpha) = \frac{ \alpha f_1(z) }{ (1-\alpha)f_0(z) + \alpha f_1(z) } \end{equation*}

As is apparent from this formula, a change in $\alpha$ will result in a change in calibration. The following graph provides an example:

Illustration of calibration curves changing with base rate

Let's consider the effect of this on decisionmaking. Going back to our disease example above, suppose that at model training/calibration time, $\alpha=0.01$. Then a disease outbreak occurs and $\alpha=0.1$. The decision rule being used based on the training data (with $\alpha=0.01$) says to treat any patient with raw $z$ score of 0.65 or greater.

But once $\alpha=0.1$, the actual infection probability of a person with $z=0.65$ is nearly 40%. As per the loss function calculation earlier, we want to treat any patient with a 5.26% or greater chance of being sick!

In the literature, when making batch predictions, there's a known technique for solving this (see discussion [2]). The basic idea is the following. For a set of raw risk scores $z_i, i=1\ldots N$, we know they are drawn from the distribution:

\begin{equation*} z_i \leftarrow [(1-\alpha)f_0(Z) + \alpha f_1(Z)] dz \end{equation*}

Thus, one can estimate $\alpha$ via the maximum likelihood principle (although the literature describes a slightly different approach [3]):

\begin{equation*} \alpha = \textrm{arg} \max \prod_{i=1}^N \left[(1-\alpha)f_0(z_i) + \alpha f_1(z_i)\right] \end{equation*}

Maximizing this is straightforward - take logs, compute $\frac{\partial L}{\partial \alpha}$, use scipy.optimize.minimize.

What happens when the distribution changes?

The method described above is strongly sensitive to the assumption that the shape of the distribution of the positive class $f_1(z)$ does not change, only it's amplitude $\alpha$.

However in practice, we often discover that $f_1(Z)$ changes with time as well. For example, consider again the example of disease prediction - a new strain of a known disease may have a somewhat different symptom distribution in the future than in the past. However it is a reasonable assumption to make that the shape of $f_0(z)$ remains the same; healthy people do not change their health profile until they become infected.

Thus, the more general situation I'm considering is a mix of label shift/base rate changes, together with small to moderate changes in the distribution of the exceptional class only. By "exceptional class", I mean "sick" (in disease prediction), "fraud" (in fraud prevention), essentially the uncommon label which corresponds to something anomalous.

In general, it is impossible to solve this problem [5]. However, if we stay away from this degenerate case (see footnote [5]), it's actually quite possible to solve this problem and estimate both the new shape of $f_1(z)$ and $\alpha$. The main restriction is that $f_1(z)$ is not too different from the old value, but right now I don't have a good characterization of what "not too different" actually means.

Formal statement of the setup

In the training phase, we have a labelled data set $(\vec{x}_i, y_i), i=1\ldots N$ on which we can train any sort of model that generates risk scores $z_i, i=1 \ldots N$. We will assume that in this data set, the risk scores $z_i$ are drawn from $f_0(z)$ if $y_i=0$ and $f_1(z)$ if $y_i=1$.

In the prediction phase we will consider batch predictions. We receive a new set of $\vec{x}$ and we can of course use the already trained classifier to generate risk scores $z_i$. Our goal is for each data point $z_i$ to generate a calibrated risk score $c(z_i) \approx P(y_i=1|z_i)$.

Without label shift there is a standard approach to this that is implemented in sklearn as sklearn.calibration.CalibratedClassifierCV. Typically this involves running isotonic regression on a subset of the training data and the mapping $c(z)$ is the result of this.

That does not work in this case because $c(z)$ computed in the training phase will be for the wrong distribution. The figure Illustration of calibration curves changing with base rate illustrates this - isotonic calibration may correctly fit the curve $c(z; \alpha=0.01)$ in the training phase. But if the right curve in the prediction phase is $c(z; \alpha=0.1)$, that fit is not actually correct. This blog post aims to address that problem.

My method

The approach I'm taking is upgrading the maximum likelihood estimation to a max-aposteriori estimation.

I first parameterize the shape of the exceptional label $f_1(z;\vec{q})$ with $\vec{q} \in \mathbb{R}^m$. I then construct a Bayesian prior on it which is clustered near $f_1(z)$. It is straightforwardly derived from Bayes rule that x:

\begin{equation*} P\left(\alpha, \vec{q} | \{z_i\}_{1}^{N} \right) \propto P\left(\{z_i\}_{1}^{N}|\alpha, \vec{q}\right) P(\alpha, \vec{q}) = P(\alpha, \vec{q}) \prod_{i=1}^N \left[(1-\alpha)f_0(z_i) + \alpha f_1(z_i; \vec{q})\right] \end{equation*}

For simplicity I'm taking $P(\alpha, \vec{q}) = P(\vec{q})$, a uniform prior on $\alpha$.

Once the posterior is computed, we can replace maximum likelihood with max-aposteriori estimation. This provides a plausible point estimate for $(\alpha, \vec{q})$ which we can then use for calibration.

Kernel Density Estimation on [0,1]

The first step is doing kernel density estimation in 1-dimension in a manner that respects the domain of the function. Gaussian KDE does NOT fit the bill here because the support of a gaussian kernel is $\mathbb{R}$, not $[0,1]$. One approach (which is somewhat technical and I couldn't make performant) is using beta-function KDE instead [4]. An additional technical challenge with using traditional KDE approaches on this problem is that whatever approach is taken, it also needs to be fit into a max-likelihood/max-aposteriori type method.

I took a simpler approach and simply used linear splines in a manner that's easy to work with in scipy. Suppose we have node points $\zeta_0=0, \zeta_1, \ldots, \zeta_m=1$. Then let us define the distribution $f_1(z; \vec{q})$ as a normal piecewise linear function:

\begin{equation*} f_1(z; \vec{q}) = \frac{z-\zeta_k}{\zeta_{k+1}-\zeta_k} h_{k} + \frac{\zeta_{k+1} - z}{\zeta_{k+1}-\zeta_k} h_{k+1} \end{equation*}

for $z \in [\zeta_k,\zeta_{k+1}]$ with $h_i$ defined as

\begin{equation*} h_i=e^{q_i} / M(\vec{q}), i=1\ldots N \end{equation*}

\begin{equation*} h_0=1/M(\vec{q}) \end{equation*}

and

\begin{equation*} M(\vec{q}) = \int_a^b p(x) dx = \sum_{i=1}^N \frac{1}{2} (e^{q_i}+e^{q_{i-1}})(\zeta_i-\zeta_{i-1}). \end{equation*}

I chose this parameterization because scipy.optimize.minimize doesn't do constrained optimization very well. With this parameterization, all values $\vec{q} \in \mathbb{R}^m$ yield a valid probability distribution on $[0,1]$.

Python code implementing this is available in the linked notebook, implemented as PiecewiseLinearKDE. Calculations of $\nabla_{\vec{q}} h_i(\vec{q})$ - used in numerical optimization - can also be found in that notebook. Most of it is straightforward.

Fitting a piecewise linear distribution to data is only a few lines of code:

from scipy.optimize import minimize
def objective(q):
    p = PiecewiseLinearKDE(zz, q)
    return -1*np.log(p(z)+reg).sum() / len(z)

def deriv(q):
    p = PiecewiseLinearKDE(zz, q)
    return -1*p.grad_q(z) @ (1.0/(p(z)+reg)) / len(z)

result = minimize(objective, jac=deriv, x0=np.zeros(shape=(len(zeta)-1,)), method='tnc', tol=1e-6, options={'maxiter': 10000})
result = PiecewiseLinearKDE(zeta, result.x)

The result is approximately what one might expect.

One useful coding trick to take away from this is our use of np.interp inside a number of methods of PiecewiseLinearKDE. Since the curve itself is computed as np.interp(x, self.nodes, self.h()), gradients of this w.r.t. q can then then be computed by applying np.interp(x, self.nodes, grad_h) where grad_h is the gradient of $\vec{h}$ w.r.t. $\vec{q}$. This then allows the efficient calculation of gradients of likelihood functions as seen in deriv above, simplifying what might otherwise be index-heavy code.

Computing a posterior and max-aposteriori estimation

Defining a prior on a function space - e.g. the space of all probability distributions on [0,1] - is not a simple matter. However, once we've chosen a parameterization for $f_1(z; \vec{q})$, it becomes straightforward. Since $\vec{q} \in \mathbb{R}^m$, the restriction of any reasonable prior onto this space is absolutely continuous w.r.t. Lebesgue measure, thereby eliminating any theoretical concerns.

The situation we are attempting to model is a small to moderate change in the distribution of $f_1(z)$, particularly in regions where $f_0(z)$ is small. So we will define the (unnormalized) prior to be:

\begin{equation*} \textrm{prior}(\vec{q}) \sim \textrm{exp} \left( -\beta \int_0^1 \left[g(f_1(z; \vec{q}) - f_1(z))\right]^p f_0(z) dz \right) \end{equation*}

where $g(x) = \sqrt{1+x^2}-1$ is a basically just a smoothed out (differentiable) version of $|x|$. We need a smooth version of $|x|$ simply because when we do max-aposteriori later, a smooth curve makes numerical minimization easier.

This prior should not be thought of as a principled Bayesian prior, but merely one chosen for convenience and because it regularizes the method. If we ignore the smoothing, this is analogous to a prior that penalizes deviation from $f_1(z)$ in the $L^p(f_0(z) dz)$ metric. The measure $f_0(z) dz$ is used to penalize deviation more in areas where $f_0(z)$ is large. The parameter $\beta$ represents the strength of the prior - larger $\beta$ means that $f_1(z; \vec{q})$ will remain closer to $f_1(z)$.

One important note about the power $p$. Because $g(x) = O(x^2)$ as $x \rightarrow 0$, choosing $p=1$ does NOT actually generate any kind of sparsity penalty, in contrast to using $|x|^1$.

The likelihood is (as per the above):

\begin{equation*} \textrm{likelihood}(\alpha, \vec{q}) \equiv P(\{z_i\}_{1}^{N}|\alpha, \vec{q}) = \prod_{i=1}^N \left[(1-\alpha)f_0(z_i) + \alpha f_1(z_i; \vec{q})\right] \end{equation*}

Computing the log of likelihood times prior (neglecting the normalization term from Bayes rule), we obtain:

\begin{equation*} \log[\textrm{prior}(\vec{q})\textrm{likelihood}(\alpha, \vec{q})] = \left( \sum_{i=1}^N \log \left[(1-\alpha)f_0(z_i) + \alpha f_1(z_i; \vec{q})\right] \right) \end{equation*}

\begin{equation*} - \beta \int_0^1 \left[g(f_1(z; \vec{q}) - f_1(z))\right]^p f_0(z) dz \end{equation*}

The gradient of this with respect to $(\alpha, \vec{q})$ is:

\begin{equation*} \frac{ \partial \ln(P(\{z_i\}_{1}^{N}|\vec{q},\alpha)}{\partial \alpha} = \sum_{i=1}^N \frac{ f_1(z_i; \vec{q}) - f_0(z_i) }{(1-\alpha)f_0(z_i) + \alpha f_1(z_i; \vec{q}) } \end{equation*}

\begin{equation*} \nabla_{\vec{q}} \ln(P(\{z_i\}_{1}^{N}|\vec{q},\alpha) = \sum_{i=1}^N \frac{\alpha}{(1-\alpha)f_0(z_i) + \alpha f_1(z_i; \vec{q}) } \nabla_{\vec{q}} f_1(z_i; \vec{q}) \end{equation*}

\begin{equation*} - \beta \int_0^1 p g( f_1(\zeta; \vec{q}) - f_1(\zeta) )^{p-1} \frac{[f_1(\zeta; \vec{q}) - f_1(\zeta)] }{\sqrt{1+[f_1(\zeta; \vec{q}) - f_1(\zeta)]^2}}\left[ \nabla_{\vec{q}} f_1(\zeta;\vec{q}) \right] f_0(\zeta) d\zeta \end{equation*}

Using this objective function and gradient, it is straightforward to use scipy.optimize.minimize to simultaneously find both $\vec{q}$ and $\alpha$.

Examples

Note: All of the examples here are computed in this Jupyter notebook. For more specific details on how they were performed, the notebook is the place to look.

Here's an example. I took a distribution of 97.7% negative samples, with a relatively simple prior distribution. I simulated a significant change of shape in the distribution of $z$ scores of the positive class, which is illustrated in red in the graph below. As can be seen, the approximation (the orange line) is reasonably good. Moreover, we recover $\alpha$ with reasonable accuracy - the measured $\alpha$ was 0.0225 while the true $\alpha$ was 0.0234.

(The histograms in the graph illustrate the actual samples drawn.)

Using the fitted curve to compute calibration seems to work reasonably well, although simple isotonic regression is another way to do it.

The advantage of using this method is on out of sample data with a significantly different distribution of positive cases. I repeated this experiment, but with $\alpha=0.011$ and a marginally different distribution of positive cases.

The dynamically calculated calibration curve (the green) still behaves well, while the isotonic fit calculated for a different $\alpha$ (unsurprisingly) does not provide good calibration.

Note that recalculating the isotonic fit is not possible, since that requires outcome data which is not yet available.

Estimating Bayes loss

The major use case for this method of calibration is reducing the loss of a decision rule due to model miscalibration. Consider a loss function which penalizes false positives and false negatives. Without loss of generality [6], such a loss function takes this form:

\begin{equation*} L(1; 0) = 1/(1-T) \end{equation*}

\begin{equation*} L(1; 1) = 0 \end{equation*}

\begin{equation*} L(0; 0) = 0 \end{equation*}

\begin{equation*} L(0; 1) = 1/T \end{equation*}

With this loss function, the optimal decision rule is to choose 1 (positive) whenever $c(z) >= T$, otherwise choose 0 (negative).

Using the same example as above, we can compute the result of applying this decision rule using either isotonic calibration (static) or our dynamic rule to the test set. For almost every choice of threshold $T$, the loss is significantly lower when using the dynamic calibration.

Other metrics

A method such as this should NOT be expected to improve ROC_AUC, and in fact in empirical tests this method does not. This is because ROC_AUC is based primarily on ordering of risk scores, and our calibration rule does not change the ordering.

The Brier Score - an explicit metric of calibration - does tend to increase with this method. This is of course completely expected. In my experiments, this method is less effective at generating a low Brier score than Isotonic calibration at least until either $\alpha$ or $\vec{q}$ changes.

The average precision score also tends to increase over multiple batches with different $\alpha, \vec{q}$.

Comparison to more standard label shift methods

Another approach (the approach of papers linked in footnote [2]) is to simply fit $\alpha$ and do not allow $f_1(z)$ to change.

In experiments, I've noticed that fitting $\alpha$ without allowing $f_1(z)$ to change generally produces a more accurate estimate of $\alpha$, even in situations where the true distribution differs significantly from $f_1(z)$.

However, in spite of a more accurate estimate of $\alpha$, the resulting calibration curves from fitting only $\alpha$ do not tend to be as accurate. The curve that comes from fitting $\alpha, \vec{q}$ is more accurate than the fit of $\alpha$ alone:

Future work

At this stage I do not consider this method in any sense "production ready". I do not have a great grasp on the conditions when this method works or fails. I've also observed that very frequently, scipy.optimize.minimize fails to converge, yet returns a useful result anyway. Most likely I'm looking for too high a tolerance.

I've also tried a couple of other ways to parameterize the probability distributions and the method seems quite sensitive to them. For example, I included an unnecessary parameter in an earlier variation - $h_0=e^{q_0}/M(\vec{q})$ - and this completely caused the method to fail to converge. I'm not entirely sure why.

There is a corresponding Jupyter notebook which has the code to do this this. If anyone finds this useful and is able to move it forward, please let me know! As a warning, playing around with the code in the notebook will make the warts of the method fairly visible - e.g. once in a while, a cell will fail to converge, or just converge to something a bit weird.

However, overall I am encouraged by this. I believe it's a promising approach to dynamically adjusting calibration curves and better using prediction models in a context when the distribution of the positive class is highly variable.

Active learning

As one additional note, I'll mention that I have some work (which I'll write about soon) suggesting that if we can request labels for a subset of the data points, we can do reasonably efficient active learning of calibration curves. This appears to significantly improve accuracy and reduce the number of samples needed.

Notes

[1]	In reality 78 should be replaced with life expectancy at the time of diagnosis, which is typically larger than the mean population life expectancy. This is a technical detail irrelevant for this post.

[2]

(1, 2, 3) Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure, by Marco Saerens, Patrice Latinne & Christine Decaestecker. Another useful paper is EM with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation which compares the maximum likelihood method with other more complex methods and finds it's generally competitive. This paper also suggests max likelihood type methods are usually the best.

[3]

The approach taken in the papers cited in [2] are a bit different - they do expectation maximization and actually generate parameters representing outcome variables, requiring use of expectation maximization. The approach I'm describing just represents likelihoods of z-scores and ignores outcomes. But in principle these approaches are quite similar, and in testing the version I use tends to be a bit simpler and still works.

[4]	Adaptive Estimation of a Density Function Using Beta Kernels by Karine Bertin and Nicolas Klutchnikoff.

[5]

(1, 2) Suppose that the distribution $f_1(z)$ changes so that $f_1(z)=f_0(z)$. Then for all $\alpha_0, \alpha_1 \in [0,1]$, $[(1-\alpha_0)f_0(Z) + \alpha_0 f_1(Z)] \equiv [(1-\alpha_1)f_0(Z) + \alpha_1 f_1(Z)]$ and therefore it is impossible to distinguish between different values of $\alpha$ from the distribution of $z$ alone.

[6]

Suppose we had an arbitrary loss function with a false positive cost of $A$ and a false negative cost of $B$. Then define $T=(A/B)/(1+A/B)$ and $C=BT$. This is equivalent to a loss function with penalties $C/(1-T)$ for false positives and $C/T$ for false negatives, which differs from our choice of loss function only by a multiplicative constant $C$.

Shareholder Short-Termism Theory has Died of COVID-19

2020-06-17T08:30:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

It's become a popular meme that "shareholders only care about the next quarter". Lots of people make arguments like this - for example, Jamie Dimon and Warren Buffet. As the meme goes, shareholders only care about the next quarter of earnings, and CEOs make decisions accordingly - sacrificing long term profitability to meet quarterly expectations.

But is this meme true?

Coronavirus gives us a great empirical test of this theory.

Formalizing the theory

The first step in answering this question is to formalize the theory. The most straightforward way I can think of to do this is through the lens of net present value, albeit with a modified discount rate.

This framework says that the value of any cash generating asset is given by:

\begin{equation*} V = \sum_{t=1}^\infty R_t d_t \end{equation*}

In this sum, $R_t$ is the cash flow in time period $t$ and $d_t$ is the discount factor of time $t$.

Here's a fairly simple example - a US treasury bill guaranteed to pay a $100 coupon for 3 periods and then to pay a final $10,000 in the 4'th. In tabular form:

date	R
2020-06-30	100
2020-09-30	100
2020-12-31	100
2021-03-31	10000

To complete the calculation, we need to time discount each cash payment. This is typically done by taking the risk free interest rate - say 5% - and applying that to each time period. For example:

date	R	d	R*d
2020-06-30	100	1.00	100.00
2020-09-30	100	0.99	98.75
2020-12-31	100	0.98	97.51
2021-03-31	10000	0.96	9631.85

Finally, the value of the bond is the sum of the R*d column, which is $9928.12 in this example.

Short-Termism in this framework

In this framework, short-termism can be straightforwardly represented by the d column - specifically, d will rapidly decrease over time. For instance, a very short term valuation of the same bond (a 25% discount rate) might be described as:

date	R	d	R*d
2020-06-30	100	1.00	100.00
2020-09-30	100	0.94	93.90
2020-12-31	100	0.88	88.17
2021-03-31	10000	0.83	8289.90

which yields a valuation of $8571.97.

Given that Treasury valuations do not look anything like this, we can certainly see that bond investors are not vulnerable to the short-termism that stock investors purportedly suffer from.

The straw man version of "shareholders only care about the next quarter" would mean that d=0 for all quarters past the next one.

I will examine this model for mathematical understanding, though I don't think it's a particularly fair thing to do.

Modeling an uncertain future

Now let us consider a stock rather than a bond - specifically, a pharma company with a single drug in the final phase of clinical trials which end in 1 year.

The cashflow is quite certain for the next 1 year - R[0:4] == 0, i.e. the company loses money to run the clinical trial and pays nothing to shareholders. After 1 year, there are two possible outcomes:

The good outcome. R_good = +1000, the drug works, everyone buys it for 17 years, company is valuable.
The bad outcome. R_bad = 0, the drug does not work, company is worthless.

date	R_good	R_bad	d
2020-06-30	0	0	1.0000
2020-09-30	0	0	0.9987
2020-12-31	0	0	0.9975
2021-03-31	0	0	0.9963
2021-06-30	1000	0	0.9950
2021-09-30	1000	0	0.9938
.	.	.	.
2036-06-30	1000	0	0.9231
2036-09-30	1000	0	0.9220
2036-12-31	1000	0	0.9208
2037-03-31	1000	0	0.9197

The company has two eventual valuations (at a long-termist 0.5% discount rate), depending on whether we believe the R_good or R_bad column represents the future - $61,238 in the first case and $0 in the second.

If we assume a 60% chance of the drug getting through clinical trials, then the value of the company would be 0.6 * $61238 + 0.4 * 0 = 36742.90.

Note that in the straw man case of literally only the next quarter matters, this company is worth $0 in all possible scenarios - it's first actual profit is 1 year out.

Long term investors appear short term

Lets now consider a long term investor who is evaluating a blue chip, highly stable stock. This stock regularly has earnings of $100. Then one quarter, it misses earnings and only reports $75!

An investor infected by short-termism will significantly cut their evaluation of the company - since d=0 for all future periods, the value drops from $100 to $75, a 25% decrease.

Let us now consider a long term investor.

date	R	d
2020-06-30	75	1.0000
2020-09-30	100	0.9987
2020-12-31	100	0.9975
2021-03-31	100	0.9963
2021-06-30	100	0.9950
.	.	.

Over 18 years, the value of this revenue stream works out to be $6498. In contrast, had earnings for one quarter not been missed, it would be $6523, a difference of 0.4%. Thus, if there is a drop in share price of significantly more than 0.4%, one might hypothesize that this is due to the market taking a short termist view.

Let us now consider a long term investor who actively tries to think through cause and effect. Earnings decreased, and there must be some reason for it! The question to ask is therefore whether this reduction in a single quarter's earnings will continue into the future. We encounter a situation similar to the pharma stock discussed earlier:

date	R_good	R_bad	d
2020-06-30	75	75	1.0000
2020-09-30	100	75	0.9987
.	.	.	.
2036-09-30	100	75	0.9220
2036-12-31	100	75	0.9208
2037-03-31	100	75	0.9197

In the R_bad scenario, the company will only be worth $4892 (a 25% decrease from it's previous value).

If the long term investor believes that there is a 40% chance of this occurring, then the value of the stock decreases to $5855.75, a 10% drop!

Even though the long term investor doesn't care much about a single quarter's earnings, he cares a lot about whether this predicts many more quarters of reduced earnings. This means that even long term investors behave in the manner that others describe as "short-termist".

As a result, both the short-termism theory and the long-termism theory make very similar predictions. The fact that stock prices move significantly in response to missed earnings estimates is insufficient to distinguish between these two theories.

Coronavirus as a natural experiment

Coronavirus provides a great natural experiment to help disambiguate test this theory. A large number of stocks are going to have several quarters of significantly reduced earnings, but there is very little uncertainty as to the reason.

We know that right now people don't go to $SBUX, $MCD or $CMG. This will be a big hit to their earnings. But on the flip side we know exactly why - people haven't suddenly decided they dislike Big Macs, they just don't want to catch coronavirus.

Meanwhile, we're being treated to articles like The Great Divide Between Stocks & The Economy which discuss a significant deviation between stock market performance and short term economic indicators.

One thing is pretty clear: the next couple of quarters of earnings are going to suck. But unlike many earnings misses, we know exactly why. And for most blue chip stocks in the S&P 500, we have great reason to believe that this is strictly temporary, and once coronavirus is dealt with things will go back to normal.

SPY over time.

Although SPY is nowadays driven predominantly by technology companies, even brick and mortar companies that are likely to survive the crisis have a similar share price: This is even true of solid brick and mortar companies such as McDonald's or Starbucks.

MCD over time.

SBUX over time.

If shareholders only care about the next quarter, there is no case to make that $SBUX is worth holding.

However, if shareholders are looking to the long term, then $SBUX and $MCD are pretty solid stocks to hold. If investors have a low discount rate, the impact of the next couple of quarters on LTV is not that big. As long as investors are convinced that there actually is a long term for the company, $SBUX and $MCD remain solid investments. The calculus of a long term investor is exactly as described before:

date	R_SBUX	d
2020-06-30	50	1.0000
2020-09-30	50	0.9987
.	.	.
2036-09-30	90	0.9220
2036-12-31	90	0.9208
2037-03-31	90	0.9197

They may not be quite as large after the crisis as before - some locations will certainly close - but overall the revenue stream is likely to continue.

In contrast, other stocks such as $CCL (Carnival Cruise Lines) do not have such a rosy outlook. Unlike $SBUX, $CCL may simply go bankrupt - in that case, shareholders get nothing.

CCL over time.

$CCL is has a very uncertain future. It is likely that it's business will remain solid if it can survive through the crisis - it's August cruises are already fully booked. However, it is not clear that $CCL can actually survive the crisis without going through bankruptcy first. Ships are expensive to store and CCL has financing costs, and it's not clear it can do this without incoming revenue.

This makes it's valuation look a lot more uncertain, with both a good and bad outcome:

date	R_good	R_bad	d
2020-06-30	0	0	1.0000
2020-09-30	0	0	0.9987
.	.	.	.
2036-09-30	100	0	0.9220
2036-12-31	100	0	0.9208
2037-03-31	100	0	0.9197

Thus, the stock price of $CCL is remaining low because it is unclear that it ever will come back. There is a high probability assigned to the bad outcome and thus the LTV is low.

These results are completely inconsistent with short-termism

These results are completely inconsistent with a short term "only the next quarter matters" view of the market.

Consider a short term view of $CCL. In the short term view the only quarters which matter are the next few ones, and these quarters are all zero in both the good and bad scenarios. Thus, $CCL is completely worthless for anything besides option value in all possible worlds.

But the same thing is also true for $SBUX and $MCD. For an investor with only short term horizons, $SBUX and $MCD should also be worthless. Yet they have almost completely recovered in value, along with a wide variety of other companies that have a great long term value prop but a terrible short term.

All about liquidity injections

If I had a comment section, I would fully expect someone to talk about how this is all invalid because of the Fed.

But this betrays a fundamental misunderstanding of accounting. When "money printer go brrr", the actual mechanism by which this happens is that new debt is issued at favorable terms. Taking out a loan is not earnings so the short term forecast for a company impacted by Coronavirus is unchanged. A short term investor will be completely unmoved by this!

The money printer does have one very important effect - it increases the odds significantly that a company will survive the crisis and have a good long term returns. The next couple of quarters will suck, but the long term value of the company remains.

The Short-Termism theory has died of COVID

Coronavirus is a great natural experiment for a lot of things.

One of the most important things we can take away from it is the conclusion that equity markets are fundamentally focused on the long term value of the companies being traded. There are fast responses to problems with next quarter earnings, but these are primarily driven by the fact that problems in the short run tend to be indicative of more fundamental issues.

Now that we have a systematic example where we know that short run problems are strictly short run, we can safely disambiguate between short termism and long termism. The result is very clear; the market is predominantly focused on the long term.

Disclosure: Long $SBUX, $CCL.

Scalably Detecting Odd-looking Histograms

2020-03-24T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

A lot of suspicious behavior can be detected simply by looking at a histogram. Here's a nice example. There's a paper Distributions of p-values smaller than .05 in Psychology: What is going on? which attempts to characterize the level of data manipulation performed in academic psychology. Now under normal circumstances, one would expect a nice smooth distribution of p-values resulting from honest statistical analysis.

What actually shows up when they measure it is something else entirely:

Another example happened to me when I was doing credit underwriting. A front-line team came to me with concerns that some of our customers might not be genuine, and in fact some of them might be committing fraud! Curious, I started digging into the data and made a histogram to get an idea of spending per customer. The graph looked something like this:

The value x=10 corresponded to the credit limit we were giving out to many of our customers. For some reason, a certain cohort of users were spending as much as possible on the credit lines we gave them. Further investigation determined that most of those customers were not repaying the money we lent them.

In contrast, under normal circumstances, a graph of the same quantity would typically kook like this:

A third example - with graphs very similar to the previous example - happened to me when debugging some DB performance issues. We had a database in US-East which was replicated to US-West. Read performance in US-West was weirdly slow, and when we made a histogram of request times, it turned out that the slowness was driven primarily by a spike at around 90ms. Coincidentally, 90ms was the ping time between our US-East and US-West servers. It turned out that a misconfiguration resulted in the US-West servers occasionally querying the US-East read replica instead of the US-West one, adding 90ms to the latency.

A fourth example comes from the paper Under Pressure? Performance Evaluation of Police Officers as an Incentive to Cheat: Evidence from Drug Crimes in Russia which discovers odd spikes in the amount of drugs found in police searches.

It sure is very strange that so many criminals all choose to carry the exact amount of heroin needed to trigger harsher sentencing thresholds, and never a few grams less.

In short, many histograms should be relatively smooth and decreasing. When such histograms display a spike, that spike is a warning sign that something is wrong and we should give it further attention.

In all the cases above, I made these histograms as part of a post-hoc analysis. Once the existence of a problem was suspected, further evidence was gathered and the spike in the histogram was one piece of evidence. I've always been interested in the question - can we instead automatically scan histograms for spikes like the above and alert humans to a possible problem when they arise?

This blog post answers the question in the affirmative, at least theoretically.

Mathematically modeling a single test

To model this problem in the frequentist hypothesis testing framework, let us assume we have a continuous probability distributions $f$ which is supported on $[0,\infty)$. As our null hypothesis - i.e. nothing unusual to report - we'll assume this distribution is absolutely continuous with respect to Lebesgue measure and that it has pdf $f(x) dx$ which is monotonically decreasing, i.e. $f(y) \leq f(x)$ for $y \geq x$ (almost everywhere).

In contrast, for the alternative hypothesis - something worth flagging as potentially bad - I'll assume that the distribution is a mixture distribution with pdf $(1-\beta) f(x) + \beta s(x) dx$. Here $s(x)$ is monotonically increasing, or more typically $s(x) = \delta(x-x_0)$.

Observation: Consider a probability distribution $f(x)$ that is monotonically decreasing. Then the cumulative distribution function $F(x)=\int_0^x f(t) dt$ is concave. This can be proven by noting that it's derivative, $F'(x) = f(x)$ is monotonically decreasing.

Our hypothesis test for distinguishing between the null and alternative hypothesis will be based on concavity. Specifically, if there are spikes in a histogram of the pdf of a distribution, then it's CDF may cease to be concave at the point of the spike. Here's an illustration. First, consider the empirical CDF of a distribution which is monotonically decreasing:

This graph is clearly concave. The red line illustrates a chord which must, by concavity, remain below the actual curve.

In contrast, a pdf with a spike in it will fail to be concave near the spike. Here's an illustration:

At $x=10$ the chord (the red line) is above the graph of the CDF (the green line).

In mathematical terms, concavity of the true CDF can be expressed as the relation:

\begin{equation*} F(x + \alpha (y-x)) \geq (1-\alpha)F(x) + \alpha F(y) \end{equation*}

or equivalently:

\begin{equation*} F(x + \alpha (y-x)) - (1-\alpha)F(x) - \alpha F(y) \geq 0 \end{equation*}

Since we do not know $F(x)$ exactly, we of course cannot measure this directly. But given a sample, we can construct the empirical CDF which is nearly as good:

\begin{equation*} F_n(x) = \frac{1}{n} \sum_{i=1}^n 1_{x \geq x_i}. \end{equation*}

Using the empirical CDF and the definition of concavity suggests a test statistic which we can use:

\begin{equation*} q = \min_{0 \leq \alpha \leq 1} \min_{x} \min_{y \geq x} \left[ F_n(x + \alpha (y-x)) - (1-\alpha)F_n(x) - \alpha F_n(y) \right] \end{equation*}

Our goal is to show that if this test statistic is sufficiently negative, then a spike must exist.

When $q$ becomes negative, this shows that $F_n(x)$ is non-concave. However, the empirical distribution function is by definition non-concave, as can be seen clearly when we zoom in:

Mathematically we can also see this simply by noting that $1_{x \geq x_i}$ is not concave. However, this non-concavity has order of magnitude $O(n^{-1})$, so to deal with this we can simply demand that $q < -1/n$.

There is a larger problem caused - potentially - by deviation between the empirical distribution $F_n(x)$ and the true, continuous and concave cdf $F(x)$. This however can also be controlled and will be controlled in the next section.

Controlling false positives

To control false positives, there is a useful mathematical tool we can use to control this - the DKW inequality (abbreviating Dvoretzky–Kiefer–Wolfowitz). This is a stronger version of the Glivenko-Cantelli Theorem, but which provides uniform convergence over the range of the cdf.

We use it as follows.

Recall that $q$ is defind as a minima of $\left[ F_n(x+\alpha(y-x)) - (1-\alpha)F_n(x) - \alpha F_n(y)\right]$. Let us choose $(x,y,\alpha)$ now to be the value at which that minima is achieved. Note that this requires that $x < y$ are two points in the domain of $F(x)$ and $\alpha \in [0,1]$. Let us also define $z=x + \alpha(y-x)$ in order to simplify the calculation.

Now lets do some arithmetic, starting from the definition of concavity of the CDF:

\begin{equation*} F(z) - (1-\alpha)F(x) - \alpha F(y) = \end{equation*}

\begin{equation*} = F(z) - (1-\alpha)F(x) - \alpha F(y) - q + q \end{equation*}

\begin{equation*} = F(z) - (1-\alpha)F(x) - \alpha F(y) - \left[ F_n(z) - (1-\alpha)F_n(x) - \alpha F_n(y)\right] + q \end{equation*}

(This line follows since $\left[ F_n(x+\alpha(y-x)) - (1-\alpha)F_n(x) - \alpha F_n(y)\right] = q$ due to our choice of $(x,y,\alpha)$ above.)

\begin{equation*} = \left(F(z) - F_n(z) \right) + (1-\alpha) \left(F(x) - F_n(x) \right) + \alpha \left( F(y)-F_n(y) \right) + q \end{equation*}

The DKW inequality tells us that for any $\epsilon > 0$,

\begin{equation*} P\left[\sup_x (F_n(x) - F(x)) > \epsilon \right] \leq e^{-2n\epsilon^2} \end{equation*}

Substituting this into the above, we can therefore say that with probability $e^{-2n\epsilon^2}$,

\begin{equation*} F(z) - (1-\alpha)F(x) - \alpha F(y) \leq q + 2\epsilon \end{equation*}

If $q + 2\epsilon < 0$, this lets us reject the null hypothesis that $F(x)$ is concave, or equivalently, that $f(x)$ is monotonically decreasing. Conversely, given a value of $q$, we can invert to gain a p-value. We summarize this as a theorem:

Theorem 1: Assume the null hypothesis of concavity is true. Let $q$ be defined as above. Then if $q < 0$, we can reject the null hypothesis (that $f(x)$ is decreasing monotonically) with p-value $p=e^{-n q^2/2}$.

This convergence is exponential but at a slow rate. Much like a Kolmogorov-Smirnov, the statistical power is relatively low compared to parametric tests (such as Anderson-Darling) that are not based on the DKW inequality.

Controlling true positives

Let us now examine the true positive rate and attempt to compute statistical power. As a simple alternative hypothesis, let us take a mixture model:

\begin{equation*} F(x) = (1-\beta) f(x) + \beta \delta(x-x_0) \end{equation*}

Here $f(x)$ is monotone decreasing and $\delta(x-x_0)$ is the point mass at $x_0$. Let us attempt to compute

\begin{equation*} \min_{0 \leq \alpha \leq 1} \min_{x} \min_{y \geq x} \left[ F(x + \alpha (y-x)) - (1-\alpha)F(x) - \alpha F(y) \right] \end{equation*}

Let $x=x_0-\epsilon$, $y=x_0+\epsilon^2$ and $\alpha=\frac{1-\epsilon}{1+\epsilon}$. Then:

\begin{equation*} x + \alpha(y-x) = (x_0-\epsilon) + \frac{1-\epsilon}{1+\epsilon}\left[x_0+\epsilon^2 - (x_0-\epsilon)\right] = x_0-\frac{\epsilon^3}{1+\epsilon} \end{equation*}

Now substituting this in, we discover:

\begin{equation*} F(x + \alpha (y-x)) - (1-\alpha)F(x) - \alpha F(y) \end{equation*}

\begin{equation*} = F(x_0-\frac{\epsilon^3}{1+\epsilon}) - \frac{2\epsilon}{1+\epsilon} F(x_0-\epsilon) - \frac{1-\epsilon}{1+\epsilon} F(x_0+\epsilon) \end{equation*}

Letting $\bar{F}(x) = \int_0^x f(x) dx$, we observe that $F(x) = (1-\beta)\bar{F}(x) + 1_{x \geq x_0}$. Since $f(x)$ is absolutely continuous, $\bar{F}(x)$ is of course a continuous function.

Let us now take the limit as $\epsilon \rightarrow 0$:

\begin{equation*} \lim_{\epsilon \rightarrow 0} F(x_0-\frac{\epsilon^3}{1+\epsilon}) - \frac{2\epsilon}{1+\epsilon} F(x_0-\epsilon) - \frac{1-\epsilon}{1+\epsilon} F(x_0+\epsilon) \end{equation*}

\begin{equation*} = (1-\beta)\bar{F}(x_0 - 0) - \frac{2\cdot0}{1+0} (1-\beta)\bar{F}(x_0 - 0) - \frac{1-0}{1+0} \left( (1-\beta) \bar{F}(x_0 + 0) + \beta \right) \end{equation*}

\begin{equation*} = (1-\beta)\bar{F}(x_0) - 0 - (1-\beta) \bar{F}(x_0) - \beta \end{equation*}

\begin{equation*} = -\beta \end{equation*}

This implies that

\begin{equation*} \min_{0 \leq \alpha \leq 1} \min_{x} \min_{y \geq x} \left[ F(x + \alpha (y-x)) - (1-\alpha)F(x) - \alpha F(y) \right] \leq - \beta, \end{equation*}

since the minima is of course smaller than any limit.

By the same argument as in the previous section - using the DKQ inequality to relate $F(x)$ to $F_n(x)$ - we can therefore conclude that:

\begin{equation*} q \leq - \beta + 2\epsilon \end{equation*}

with probability $1-e^{-2n\epsilon^2}$.

Distinguishing the null and alternative hypothesis

We can combine these results into a hypothesis test which is capable of distinguishing between the null and alternative hypothesis with any desired statistical power.

Theorem 2: Let $p$ be a specified p-value threshold and let $r$ be a desired statistical power. Let us reject the null hypothesis whenever

\begin{equation*} q \leq 2 \sqrt{\frac{\ln(p)}{2n}}. \end{equation*}

Suppose now that

\begin{equation*} \beta \geq 2 \left(\sqrt{\frac{-\ln(p)}{2n}} + \sqrt{\frac{-\ln(1-r)}{2n}} \right). \end{equation*}

Then with probability at least $r$, we will reject the null hypothesis.

Example numbers and slow convergence

Due to the slowness of the convergence implied by the DKW inequality, we unfortunately need fairly large $n$ (or large $\beta$) for this test to be useful.

n	$\beta$
1000	0.155
2000	0.109
5000	0.0692
10000	0.0490
25000	0.0310
100000	0.0155

Thus, this method is really only suitable for detecting either large anomalies or in situations with large sample sizes.

Somewhat importantly, this method is not particularly sensitive to the p-value cutoff. For example, with a 1% cutoff rather than a 5%, we can detect spikes of size $\beta=0.055$ at $n=10000$.

This makes the method reasonably suitable for surveillance purposes. By setting the p-value cutoff reasonably low (e.g. 1% or 0.1%), we sacrifice very little measurement power on a per-test basis. This allows us to run many versions of this test in parallel and then use either the Sidak correction to control the group-wise false positive rate or Benjamini-Hochburg to control the false discovery rate.

Conclusion

At the moment this test is not all I was hoping for. It's quite versatile, in the sense of being fully nonparametric and assuming little beyond the underlying distribution being monotone decreasing. But while theoretically the convergence is what one would expect, in practice the constants involved are large. I can only detect spikes in histograms after they've become significantly larger than I'd otherwise like.

However, it's still certainly better than nothing. This method would have worked in several of the practical examples I described at the beginning and would have flagged issues earlier than than I detected them via manual processes. I do believe this method is worth adding to suites of automated anomaly detection. But if anyone can think of ways to improve this method, I'd love to hear about them.

I've searched, but haven't found a lot of papers on this. One of the closest related ones is Multiscale Testing of Qualitative Hypotheses.

Isotonic: A Python package for doing fancier versions of isotonic regression

2020-03-10T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Frequently in data science, we have a relationship between X and y where (probabilistically) y increases as X does. The relationship is often not linear, but rather reflects something more complex. Here's an example of a relationship like this:

In this plot of synthetic we have a non-linear but increasing relationship between X and Y. The orange line represents the true mean of this data. Note the large amount of noise present.

There is a classical algorithm for solving this problem nonparametrically, specifically Isotonic regression. This simple algorithm is also implemented in sklearn.isotonic. The classic algorithm is based on a piecewise constant approximation - with nodes at every data point - as well as minimizing (possibly weighted) l^2 error.

The standard isotonic package works reasonably well, but there are a number of things I don't like about it. My data is often noisy with fatter than normal tails, which means that minimizing l^2 error overweights outliers. Additionally, at the endpoints, sklearn's isotonic regression tends to be quite noisy.

The curves output by sklearn's isotonic model are piecewise constant with a large number of discontinuities (O(N) of them).

The size of the isotonic model can be very large - O(N), in fact (with N the size of the training data). This is because in principle, the classical version isotonic regression allows every single value of x to be a node.

The isotonic package I've written provides some modest improvements on this. It uses piecewise linear curves with a bounded (controllable) number of nodes - in this example, 30:

It also allows for non-l^2 penalties in order to handle noise better.

Isotonic regression for binary data

Another issue facing the standard isotonic regression model is binary data - where y in [0,1]. Using RMS on binary data sometimes works (when there's lots of data and it's mean is far from 0 and 1), but it's far from optimal.

For this reason I wrote a class isotonic.BinomialIsotonicRegression which handles isotonic regression for the case of a binomial loss function.

As is apparent from the figure, this generates more plausible results for binary isotonic regression (in a case with relatively few samples) than the standard sklearn package. The result is most pronounced at the endpoints where data is scarcest.

Code is available

You can find the code on my github. It's pretty alpha at this time, so don't expect it to be perfect. Nevertheless, I'm currently using it in production code, in particular a trading strategy where the noise sensitivity of sklearn.isotonic.IsotonicRegression was causing me problems. So while I don't guarantee it as being fit for any particular purpose, I'm gambling :code:O($25,000) on it every week or two.

Appendix: Mathematical Details

This appendix explains the mathematical details of the methods, as well as technical details of the parameterization. It is mainly intended to be used as a reference when understanding the code.

The package uses maximum likelihood for curve estimation, and uses the Conjugate Gradient method (as implemented in scipy.optimize.minimize) to actually compute this maximum.

Parameterizing the isotonic curves

The first part of this is parameterizing the curves. The curves are parameterized by a set of $\vec{x}_i, i=0 \ldots N-1$ and a corresponding set of $\vec{y}_i$, with $\vec{y}_i \leq \vec{y}_{i+1}$ for all $i$. (I'm using zero-indexing to match the code.)

Since conjugate gradient doesn't deal with constraints, we must come up with a parameterization $\alpha: \mathbb{R}^M \rightarrow \mathbb{R}^N$ where the domain is unconstrained and the range satisfies the monotonicity constraint.

There are two cases to consider.

Real valued curves

For real-valued isotonic regression, there are no constraints on $\vec{y}_i$ beyond the monotonicity constraint. Thus, we can use the parameterization:

\begin{equation*} \vec{y}_i = \vec{\alpha}_0 + \sum_{j=1}^i e^{\alpha_j} \end{equation*}

Since $\vec{y}_{i+1} - \vec{y}_{i} = e^{\alpha_{i+1}} > 0$, this trivially satisfies the monotonicity constraint.

In this case, the Jacobian can be computed to be:

\begin{equation*} \frac{\partial y_i}{\partial \alpha_0} = 1 \end{equation*}

\begin{equation*} \frac{\partial y_i}{\partial \alpha_j} = 1(j \leq i) e^{\alpha_j}, j \geq 1 \end{equation*}

Here the function $1(x)$ is equal to $1$ if it's argument is true and $0$ otherwise.

This parameterization is implemented here.

Probabilistic curves

In the case of binomial isotonic regression, we have the additional constraint that $0 < \vec{y}_{0}$ and $\vec{y}_{N-1} < 1$ (since the curve represents a probability). We can parameterize this via:

\begin{equation*} \vec{y}_i = \frac{ \sum_{j=0}^i e^{\vec{\alpha}_{j}} }{ \sum_{j=0}^{N} e^{\vec{\alpha}_{j} } } \end{equation*}

It is trivially easy to verify that this satisfies both the monotonicity constraint as well as the constraint that $0 < \vec{y}_i < 1$. Note that in this case, there are $N+1$ parameters for an $N$ -dimensional vector $\vec{y}$.

The Jacobian can be calculated to be:

\begin{equation*} \frac{\partial y_i}{\partial \alpha_j} = \frac{e^{\alpha_j} \left(1(j \leq i) \sum_{k=0}^{N+1} e^{\alpha_k} - \sum_{k=0}^i e^{\alpha_k} \right) }{ \left(\sum_{k=0}^N e^{\alpha_k} \right)^2 } \end{equation*}

This parameterization is implemented here.

Different parameterizations

One parameterization for $c(z; \vec{x}, \vec{y})$ is piecewise constant, i.e.:

\begin{equation*} c(z; \vec{x}, \vec{y}) = \vec{y}_j \end{equation*}

\begin{equation*} j(z) = \textrm{arg max} \left\{ i | \vec{x}_i \leq z \right\} \end{equation*}

In this case, simple calculus shows that

\begin{equation*} \frac{\partial}{ \partial y_k } c( z ; \vec{x}, \vec{y}) = \delta_{k,j(z)} \end{equation*}

with $j$ as above.

This is implemented as the PiecewiseConstantIsotonicCurve in the library.

Another parameterization is piecewise linear:

\begin{equation*} c(z; \vec{x}, \vec{y}) = (1-\beta) \vec{y}_{j(z)} + \beta \vec{y}_{j(z)+1} \end{equation*}

\begin{equation*} \beta = \frac{z - \vec{x}_{j}}{\vec{x}_{j(z)+1} - \vec{x}_{j(z)}} \end{equation*}

This has derivative:

\begin{equation*} \frac{\partial}{ \partial y_k } c( z ; \vec{x}, \vec{y}) = \beta \delta_{k,j+1} + (1-\beta)\delta_{k,j} \end{equation*}

This is implemented as the PiecewiseLinearIsotonicCurve.

Objective functions

Some notation first. Let us consider a data set $\vec{X}, \vec{Y}$. We will define a curve $c(z;\vec{x}, \vec{y})$, taking values $\vec{y}_i$ at the points $\vec{x}_i$, i.e. $c(z=\vec{x}_i; \vec{x}, \vec{y}) = \vec{y}_i$ and being parametrically related to $\vec{x}, \vec{y}$ elsewhere. Current implementations include piecewise linear and piecewise constant.

Supposing now that the nodes $\vec{x}_i$ are given, it remains to find the values $\vec{y}$ that minimize a loss function.

Real valued data

In this case, our goal is to minimize the $l^p$ error:

\begin{equation*} \sum_{k} \left| \vec{Y}_k - c(\vec{X}_k ; \vec{x}, \vec{y}) \right|^p \end{equation*}

Note that this corresponds to maximum likelihood under the model:

\begin{equation*} \vec{Y}_k = c(\vec{X}_k ; \vec{x}, \vec{y}) + \epsilon_k \end{equation*}

with $\epsilon_k$ drawn from the distribution having pdf $C e^{|Z|^p} dZ$.

Computing the gradient w.r.t. $\vec{y}$ yields:

\begin{equation*} \nabla_{\vec{y}} \sum_{k} \left| \vec{Y}_k - c(\vec{X}_k ; \vec{x}, \vec{y}) \right|^p = \sum_{k} p \left| \vec{Y}_k - c(\vec{X}_k ; \vec{x}, \vec{y}) \right|^{p-1} \nabla_y c(\vec{X}_k ;\vec{x}, \vec{y}) \end{equation*}

This is implemented in the library as LpIsotonicRegression.

Binomial data

Then given the data set, we can do max likelihood:

\begin{equation*} P(\vec{X}, \vec{Y} | c(z ; \vec{x}, \vec{y}) ) = \left[ \prod_{k|\vec{Y}_k = 1} c(z ; \vec{x}, \vec{y}) \right] \left[ \prod_{k|\vec{Y}_k = 0} (1 - c(z ; \vec{x}, \vec{y})) \right] \end{equation*}

Taking logs and computing the gradient yields:

\begin{equation*} \nabla_y \ln P(\vec{X}, \vec{Y} | c(z ; \vec{x}, \vec{y}) ) = \left[ \sum_{k|\vec{Y}_k = 1} \frac{\nabla_y c(\vec{X}_k ;\vec{x}, \vec{y})}{ c(\vec{X}_k ; \vec{x}, \vec{y}) } - \sum_{k|\vec{Y}_k = 0} \frac{\nabla_y c(\vec{X}_k ;\vec{x}, \vec{y})}{1 - c( \vec{X}_k ; \vec{x}, \vec{y})} \right] \end{equation*}

Combining this with $\nabla_\alpha \vec{y}$ computed above, we can now compute $\nabla_\alpha P(\vec{X}, \vec{Y} | c(z ; \vec{x}, \vec{y}) )$. This is sufficient to run conjugate gradient and other optimization algorithms.

This is implemented in the library as BinomialIsotonicRegression.

Putting it all together

Choosing the nodes

All the pieces are put together in a pretty straightforward way. For an M - point interpolation, first the x-node points are chosen by finding the (2i+1)/2M -th percentiles of the data, for i=0..M-1.

We do this for the following reason. Consider standard isotonic regression where every single point is a node. Suppose that the value $\vec{y}_0$ is an outlier, and is dramatically smaller than would be expected. Then for all $z < \vec{x}_0$, the isotonic estimator will be $\vec{y}_0$. This is the characteristic of a very unstable estimator, and in my use cases this poses a significant problem.

In contrast, with the M - point interpolation I'm using, the value of the isotonic estimator will be approximately $\frac{1}{N_q} \sum_{i | \vec{x}_i < q} \vec{y}_{i}$ where $q$ is the $1/2M$ -th quantile of the x-values and $N_q$ is the number of points with $x_i < q$. This is a considerably more stable estimator.

Estimating the curve

Once the nodes are given, estimation of the curve is pretty straightforward. We parameterize the curve as described above and use the conjugate gradient method to minimize the error. This can be generally expected to converge, due to the convexity of the error w.r.t. the curve. I have not encountered any cases where it doesn't.

(In the binomial case, convexity is technically broken due to the normalization.)

That's basically all there is to this.

Cost Matters: Why Lambda School should have a lower success rate than college

2020-03-03T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Lambda School has recently come under fire by the mainstream media for having success rates smaller than 100%, as well as for having a founder who is a nerd. The articles imply that Lambda School is somehow ripping off it's students - possibly by, um, tricking hedge funds into paying for their education - though no real mechanism for this is provided.

(I'm not actually joking about the nerd part - the MSM article that I found literally talks about how he doesn't have enough friends.) But this critique is the height of innumeracy; there is no intrinsic reason Lambda School or any other institution should have high success rates.

In this blog post I'm going to do some really simple probability and expose an important tradeoff that I've observed people ignoring in many different contexts - Lambda School and MOOCs, in business, and elsewhere. This article is more about decision theory than Lambda School - but the recent controversy over Lambda School having a lower than 100% success rate makes a good jumping off point.

Lets do the math.

The basic math of a single gamble

Suppose we have the opportunity to make a gamble. If we make the gamble, we have a probability p of winning. If we win the gamble we earn a payoff of w, whereas if we lose we earn a payoff of l.

For example, the gamble might be "attend Lambda School", "go to college" or "buy clicks via google adwords which may or may not buy a product from our landing page". The payoff of a win (w) would be a higher paying job or a sale on a website. The loss of losing (l) would be the time and money spent on tuition or the money paid to Google for the click.

It's very easy to compute the expected value of this gamble is:

E[gamble] = p*w + (1-p)*l

Assuming no risk aversion - fairly typical when w and l are not that large - we should take this gamble whenever E[gamble] > 0. Doing some simple arithmetic, we discover that this gamble is positive sum whenever:

w/l > (1-p)/p

(If this is a repeated game we should use the Kelly Criterion or some similar rule. I'm going to focus on the non-repeated case in this post.)

Another useful rearrangement is we break even when p = l/(w+l).

Sounds simple, right?

It's important to note that there are 3 different ways we can alter the parameters to make a gamble worthwhile:

Increase w - make the payoff more attractive.
Increase p - make a winning bet more likely.
Decrease l - make the cost of losing lower.

The latter case is actually pretty important; consider the limiting case of l == 0. In this case, there is a finite probability p of winning something, and even if you don't win you lose nothing. That's a good bet to take regardless of how low p is!

So far, nothing groundbreaking, right?

Concrete examples: Lambda School and Rutgers University

Lets now consider Lambda School and cook up an oversimplified example.

I'll assume the applicant currently earns $30k/year. I'll assume that if the student is successful via Lambda School, their salary will increase to $70k/year. Furthermore, I'll assume the value of their time (before graduation) is $30k/year. I'll also use a 5 year time horizon.

Lambda School is a 9 month program. Valuing time at $30k/year, this means the cost of an unsuccessful attempt at Lambda School is l = $22.5k.

If Lambda School succeeds, the winning payoff is a job earning $70k/year. The student will have to pay Lambda School 17% of this for 2 years (yielding a net pay of $58.1k/year) and after this they earn the full $70k. The excess over their original pay is then $28.1k for 2 years and $40k for 3 years - in total, w = $178k.

On net, w/l = 7.91 implying that even if the probability of success is only 15%, this is a reasonably good deal for the student.

Now lets examine Rutgers. I'm choosing Rutgers because it's my alma mater. Following the numbers from this tweet thread, we learn the cost of Rutgers (a 4 year program) is $22.5k x 4 = $90k in lost income. Additionally tuition is $12k in state (assume a generous $6k after financial aid) adding another $24k for a total of $114k. This is paid whether the gamble pays off or not.

This means that w = 5 * $40k - 114k = $86k and l = $114k. This yields w/l=0.75, implying that even at 50% success probability, going to Rutgers is a bad deal for the student! Rutgers needs approximately a 60% success probability for the student to break even.

(Note: The actual success rate of Rutgers, measured according to the criteria of Lambda School, is about 30%.)

You can play with these numbers a lot - take a 10 year time horizon, play with outcomes beyond "fail - 30k, success - 70k", etc.

But it's really hard to escape from the extreme cost advantage that ISAs provide: Lambda School doesn't cost very much for people who don't succeed. There's a reason that Austen Allred constantly talks about alignment of incentives!

The basic math of a sorting rule

Lets suppose now that we have multiple gambles we can choose to enter or not - say N` gambles, with ``i <- 1..N representing a particular gamble. Moreover, for each gamble, we know the success probability p[i] of that particular gamble.

For example, we might be choosing which students to admit to Lambda School or Rutgers.

As another example, we might be choosing which adwords (and landing pages matched to the adwords) we wish to bid on for CPC.

We are now faced with the task of choosing which gambles we want to take. Armed with the decision rule we came up with above we have a simple way to decide this: if w/l > (1-p[i])/p[i] we will accept gamble i.

Lowering the cost of failure allows one to accept a lower success rate

Let us now consider a set of gambles; lets assume that p[i] = i/10 for i=0..10. Suppose for simplicity that w=2, l=8. This implies that the positive sum gambles are the ones with p[i] >= 0.8 or i >= 8.

If we choose gambles according to this rule, we will only choose to gamble at i=8,9,10; the overall success probability is 0.9. This set of gambles should be thought of as a regular university; the cost of failure is high.

Suppose now we reduced the cost of failure; say instead of l=8 we had l=2. In this case, the positive sum gambles are the ones with p[i] >= 0.5 or i=5...10. The average success rate of over all of these gambles is 0.75, which is lower than the 0.9 we had at a cost of l=8.

Here's the core economics of this. If you lower the cost of something, it makes sense to lower quality standards of that something.

Intuitive example: Wikipedia, Lambda School and College

On the topic of educational institutions, consider my favorites: Wikipedia and Duckduckgo. These institutions have very high failure rates for me; quite often, I search for something and do not find what I'm looking for. Obviously, Wikipedia is not in any sense bad for consumers simply because I search for lots of things I don't find.

And the reason for this is obvious; Wikipedia is free. If it's useful 20% of the time and takes me 5 seconds to search for something, that's an average of 25 seconds of search per useful result. That's a win.

In contrast, a college with a 20-30% success rate (and they do exist) is taking tens of thousands of dollars from students and providing very little value in return. This is much worse even though the success rate is the same.

Lambda school operates somewhere in between the extremes of Wikipedia and College. It's a lot cheaper than college - particular for the people who don't succeed. This means that the socially optimal result (from the perspective of a student) is that Lambda School should admit more students until it's success rate drops significantly below college.

I understand the sentiment that high failure rates are a problem, but hopefully the example of Wikipedia provides intuition about why a low cost method of education can be valuable even if it's success rate is low.

You can never raise your conversion rate

Here's an example that someone at a major travel website told me about. The customer acquisition funnel consists of paying Google for adwords on a cost-per-click basis. Once people reach the site, a certain fraction of them make a purchase.

Suppose we have several different ad channels, i.e. keywords that can be bid on. When a customer converts the profit is $50.

The channels available are: - A, with a CPC of $2 and a conversion rate of 10%. The cost per conversion here is $2/0.1 = $20, meaning the profit is $30 for people coming from this channel. - B, with a CPC of $1 and a conversion rate of 3%. The cost per conversion here is $33, meaning the profit is $16 for people coming from this channel. - C, with a CPC of $1 and a conversion rate of 1.5%. The cost per conversion here is $66, making this channel money losing.

Assuming these channels provide equal traffic, bids will be placed on channels A and B only. C is unprofitable so no bids are placed.

Now suppose the price of adwords falls by 30%; channel A now costs $1.4/click while B and C cost $0.7/click. At this point the cost per conversion of channel C has dropped to $0.7/0.015 = $46.67 < $50. At this time channel C gets switched on.

However, the net result of this is that the conversion rate has dropped from (10% + 3%)/2=6.5% to (10%+3%+1.5%)/3 = 4.8%. Profit margins have dropped too. The net result here is that traffic has increased by 50%, but the new 50% are low margin, barely profitable users with low conversion rates. These new users drag the average down even as they add profits to the bottom line.

On the flip side of this phenomenon, it's very difficult to raise conversion rates. Imagine a sitewide improvement that results in the conversion rate on every channel increasing by 40% (but prices remained the same). This raises the conversion rate of channel C to 2.1%, lowering the cost per conversion to $47.61 and making it profitable.

The average conversion rate goes to (14%+4.2%+2.1%)/3 = 6.76%, a mere 4% improvement over the old conversion rate of (10%+3%)/2=6.5%. If Channel C were twice the size of channel A, then the conversion rate would actually go down to (14%+4.2%+2x2.1%)/4 = 5.6%!

On the other hand it's important to look at aggregate numbers. In these scenarios, improving conversion rates has increased sales by 50-100%. The additional sales have significantly lower margin, but it's still an increase in profits.

Why Indian software developers are lower quality than Western ones

I've made this argument elsewhere in more detail, but I'll give a short summary here.

The cost of an American developer is approximately 3-10x the cost of a similar Indian developer. In my experience (I lived in India for many years), there are plenty of Indian developers and data scientists just as good as any in the US. (If you are seeking employees, I know of a good company that everyone is leaving from. I can connect you to people.)

Yet the reputation of Indian developers among Americans is quite low. People commonly talk about outsourced projects run by teams of idiots, and have generally strong criticisms of Indian developers in general.

I never had this experience, but I also subject anyone I hire in India to the same level of hiring rigor that I subject Americans to.

We can reconcile these two views simply by observing that as you lower the cost of a failure (e.g. making a bad hire), it is economically optimal to allow a higher rate of failure. This means that Indian companies competing with American companies should have a lower hiring bar, by the same logic as above. This will drag the average quality of Indian developers down; similarly low quality American developers would simply not get hired as developers.

Thus, the Indian software engineering market has lower quality simply because low costs have allowed a larger pool of people to enter.

(Note: I'm singling out India only because I lived and worked there for many years.)

Conclusion

The interplay between individual success probabilities, cost of failure, and aggregate success probabilities is complex. It is simply innumerate to say, as Austin Allred did, that "We will never be ok with 50% [success rate]." As the cost of failure drops, as it does with Lambda School, the socially optimal success rate also drops.

(To be clear, I believe @austen probably knows this and is merely attempting to placate innumerate journalists.)

Simple economics tells us that the lower the cost of failure, the more failed attempts should be made. When the cost of failure goes down, any critique of reduced success rates is hopelessly innumerate.

Notes on setting up a Data Science app on Azure

2020-01-09T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I have recently been working on setting up a trading strategy and running it in the cloud. Although I haven't used Azure before, I wanted to try it out - some of the data science features that Microsoft advertises look pretty nice. This post is not of general interest, and most readers should stop reading now. This is merely my working notes - placed here to help people who are googling for it - and only useful to you if you want to run an app on Azure Functions and don't yet know how.

Structure of the strategy

The trading strategy I'm using is pretty straightforward. To begin with, I periodically query the REST API of a certain marketplace. The API returns a list of trading opportunities - securities $S_i, i=1 \ldots N$ and their prices $p_i$, along with characteristics $\vec{x}^i$ of the security.

As a concrete example of systems like this, think of trading cryptocurrencies on an exchange (e.g. Coinbase Exchange API).

The price in question is an implicit market-based assessment of risk - i.e. there is a function $r(p_i)$ which implicitly assigns each price to a corresponding risk level. Higher prices imply more risk.

My mental model for the market is as follows. The market is a machine learning model $L(\vec{x}_i)$ which predicts risk, and then chooses a corresponding price for that risk. I do not know this model, however a very simple application of isotonic regression has enabled me to determine that the market prices are highly predictive (ROC_AUC is in the 70-80% ballpark).

I have additional data $\vec{y}^i$ that I do not believe the market incorporates. So the way I'm attacking the problem is the following:

Query the REST API to get the standard market data, $(p_i, \vec{x}^i)$.
Compute an inner join between my data set to enrich the data set, $(p_i, \vec{x}^i) \mapsto (p_i, \vec{x}^i, \vec{y}^i)$.
Run a machine learning model on the enriched data set and generate improved risk scores $q(p_i, \vec{x}^i, \vec{y}^i)$. In backtesting, these improved risk scores are more accurate than the risk scores $r(p_i)$ generated from prices alone.
Purchase securities which have $q(p_i, \vec{x}^i, \vec{y}^i) - r(p_i) < -T$ for some threshold $T$. In simple terms, I'm purchasing securities when my assessment of their risk is significantly lower than the market's assessment of their risk.

The ML model used to predict prices is a relatively standard sklearn.pipeline.Pipeline - it's trained offline, saved as a pickle, and loaded periodically to execute step (4) above.

Azure Functions seemed like a simple and low cost way to run this strategy - no server deployment to worry about.

Setting up Azure Functions

The general structure of my application is as follows:

requirements.txt
trading_library/data/... # Data access methods
trading_library/models/... # Code to actually train/run the ML models
trading_library/config/__init__.py
trading_library/config/_fileconfig.py  # Config when running things locally. This has secrets!
trading_library/jobs/...  # Functions that do things

az_func/host.json
az_func/local.settings.json
az_func/get_data/function.json  # Each subfolder corresponds to a single Azure function.
az_func/get_data/__init__.py
az_func/...other_job.../{function.json, __init__.py}

deploy_az_functions.ps1

The first directory, trading_library is just ordinary python code. It's a library with assorted functions that are helpful for running my trading strategy. Some are intended to run in production, others I use locally when doing analysis and development. This directory is at the top level because I frequently run a Jupyter notebook here for development.

The directory az_func corresponds to the Azure Functions application. It was created as follows:

PS ..> func init az_func --python

The notation PS ..> means that this line represents a PowerShell command. Tangentially, as a long time Linux command line user, I must say that PowerShell is the most exciting innovation in command line UI that I've seen since zsh. If you haven't tried it yet, go check it out.

This is the command to initialize an Azure Functions project as described in Microsoft's docs. Each subfolder in az_func corresponds to a single function - think of each of these as a cron job. (In other setups, they can correspond to an HTTP endpoint or something similar.)

The contents are quite simple. Here is function.json:

{
  "scriptFile": "__init__.py",
  "bindings": [
      {
          "name": "my_function_argument",
          "type": "timerTrigger",
          "direction": "in",
          "schedule": "3 25 15 * * *",
          "runOnStartup": false
      }
  ]
}

The schedule parameter is in ordinary 6 option CRON format - the example above runs on the 3'rd second of the 25'th minute of the 15'th hour of every day.

The python code in __init__.py is also quite simple:

import logging
import azure.functions as azfunc
from __app__.trading_library.jobs import get_the_data

def main(my_function_argument):
    logging.info('Running get_data at time %s.', timer)
    get_the_data()  # Real work happens here
    logging.info("Finished get_data.")

Note how the parameter "name": "my_function_argument" in function_json corresponds to the argument main(my_function_argument) in python itself. The function won't work if you don't get this right.

As far as local code structure, that's basically everything you need to create an Azure function.

Deploying it

To deploy, one must first create the Azure Functions app on Azure. Microsoft's instructions are quite good so I will not repeat them here.

Here, I'll focus on the practicalities of this - specifically, the contents of my deploy_az_functions.ps1 Powershell script. To start with, Azure Functions requires us to make the trading_library path accessible inside the az_func folder:

$shared_code_path = "az_func\trading_library"

if (Test-Path $shared_code_path){
    Remove-Item -Recurse -Force $shared_code_path
}

It is also useful to save space by not uploading assorted crud files:

# Cleanup crud
Get-ChildItem .\trading_library\ -recurse -include __pycache__ | Remove-Item -recurse
Get-ChildItem .\trading_library\ -recurse -include *.pyc | Remove-Item
Get-ChildItem .\trading_library\ -recurse -include *.py~ | Remove-Item

Finally, we copy the library folder into az_func:

# Copy shared code
Copy-Item -Recurse .\trading_library\ $shared_code_path\

Then we remove our secret-filled local use only config file

Remove-Item .\az_func\trading_library\config\_fileconfig.py

Finally we deploy the Azure Functions app:

# Publish the function
cd az_func
func azure functionapp publish trading_library
cd ..

if (Test-Path $shared_code_path){
    Remove-Item -Recurse -Force $shared_code_path
}

The job will now be running daily on a timer.

Handling Secrets

The more difficult piece for me was the handling of secrets. Azure has a service called Key Vault which provides encrypted storage of application secrets. Keyvault has two different modes of operation:

Exposing values from keyvault to the application as an environment variable. This is very easy to do, but it requires explicitly enumerating every secret needed.
Programmatically accessing keyvault with the python library. This requires python having access to keyvault credentials, which must somehow be safely transmitted to python itself.

I chose a hybrid approach - I store the keyvault credentials in keyvault itself and expose them via the method in (1). Then in Python I use them to access the other secrets programmatically.

Setting up keyvault and a client

Create a keyvault. This step is done only once, and should not be part of the powershell script:

PS..> az keyvault create -n tradinglibrarykeyvault -g $resourceGroup

Next I created a service principal:

$service_principal = az ad sp create-for-rbac -n "http://mySP" --sdk-auth | ConvertFrom-Json

The $service_principal variable will have the fields clientId and clientSecret - we must put these into keyvault:

PS..> az keyvault secret set -n "keyvault-client-id" --vault-name $keyvaultname --value $($service_principal.clientId)
PS..> az keyvault secret set -n "keyvault-client-secret" --vault-name $keyvaultname --value $($service_principal.clientSecret)

This is all done once, at the command line.

Giving the application access to keyvault credentials

We first create a managed identity. This operation is idempotent and goes in the same powershell script I use to deploy:

# Grant keyvault access
$kv_identity = az functionapp identity assign -n trading_library -g trading_library_resource_group | ConvertFrom-Json # First create identity

After the identity has been created, repeated calls to create it will simply return the existing one. We must also put these variables into keyvault:

az keyvault secret set -n "keyvault-tenant-id" --vault-name $keyvaultname --value $kv_identity.tenantId
az keyvault secret set -n "keyvault-name" --vault-name $keyvaultname --value $keyvaultname

Next we must grant that identity permission to access keyvault:

az keyvault set-policy -n trading_library_keyvault -g $resourceGroup --object-id "$($kv_identity.principalId)" --secret-permissions get  # Assign the policy

Finally, I put the keyvault access parameters into the keyvault itself:

foreach ($keyname in "keyvault-tenant-id","keyvault-name","keyvault-client-id","keyvault-client-secret") {  # Then provide the keyvault variables.
    $secretId = az keyvault secret show -n $keyname --vault-name lendingclubstratkeyvault --query "id"
    $keyname_var= $keyname.replace('-', '_')
    az functionapp config appsettings set -n lendingclubstrat -g lendingclubstrat --settings "$keyname_var=@Microsoft.KeyVault(SecretUri=$secretId^^)"
}

The last line az functionapp config appsettings... is what provides access to the keyvault variables as environment variables.

Inside python, you can now access the clientId via os.environ['keyvault_client_id'].

Programmatic access

I manage my configuration via the tr11/python-configuration library. When doing development on my local machine, my configuration is stored in a simple python dictionary.

The configuration library does not have an Azure Keyvault backend native. For this reason, I wrote my own backend for it. Update: The author of python-configuration was kind enough to merge my backend into master. Use that one.

All my code access configuration as follows:

from trading_library.config import cfg

do_a_thing(cfg['trading_api_key'])

The file trading_library/config/__init__.py looks like the following:

import config
import os

__all__=['cfg']

try:
    from ._fileconfig import _fileconfig
    filecfg = config.config_from_dict(_fileconfig)
except ImportError:
    filecfg = config.config_from_dict({})

if 'keyvault_name' in os.environ:  # On azure, we will have access to keyvault.
    from config.contrib.azure import AzureKeyVaultConfiguration

    azure_cfg = AzureKeyVaultConfiguration(
        az_client_id=os.environ['keyvault_client_id'],
        az_client_secret=os.environ['keyvault_client_secret'],
        az_tenant_id=os.environ['keyvault_tenant_id'],
        az_vault_name=os.environ['keyvault_name']
        )
    cfg = config.ConfigurationSet(filecfg, azure_cfg)  # Config looks to
else:
    cfg = filecfg

Thus, in local development, secrets are read from the _fileconfig variable. In production they come from Key Vault.

Backtest your SQL queries - they are models too

2019-12-09T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I was recently discussing a project with a younger data scientist and I noticed a curious mismatch between our language. We had an API that we wanted to impose rate limits on. We want to ensure that 99% of our good customers have a good experience and never hit the rate limit, while locking down the heaviest users to prevent overload. We also want to limit the potential damage caused by malicious usage.

Luckily all usage was nicely logged in redshift. For simplicity, lets assume we have a table endpoint_usage with columns endpoint_url, user_id, daily_requests, date and malicious. The malicious flag is an indication of malicious usage of the API, and is unfortunately not something which can be computed in realtime. (If we could compute it in realtime then we could simply use this flag instead of a rate limit.)

Our analysis was quite simple - we'll just measure the historical usage of our bottom 99% of non-malicious customers - grouped by endpoint_url - and use that. The code was not much more sophisticated than this:

SELECT endpoint_url,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY daily_requests) OVER (PARTITION BY endpoint_url) AS num_requests_99pct
  FROM endpoint_usage
    WHERE (NOT malicious)
  GROUP BY endpoint_url

We are excluding malicious users from this query because we do not care about keeping them happy.

The SQL in the second line of the query is complex, but all it's doing is computing the 99'th percentile of daily_requests over each endpoint_url. Unfortunately Postgres/Redshift do not have a PERCENTILE aggregate function. Only a window function is provided and the syntax for window functions is a bit more complex than aggregates. If we were taking the 100'th percentile (i.e. MAX) instead of the 99'th, it would simply be:

SELECT endpoint_url, MAX(daily_requests)
 FROM endpoint_usage
   WHERE (NOT malicious)
 GROUP BY endpoint_url

(Those of you who know what I do know I'm actually not rate limiting an API, but instead allocating capital and choosing bet sizes for real money gambles. But this example is safely removed from trade secrets so I'm sticking with it.)

Me: "Is the rate limit model ready to go?"

Her: "You mean running the rate limit SQL query?"

Me: "Yes, but also are we backtesting the model?"

Her: "What do you mean 'model'? It's just a SQL query."

Me: ...gave explanation that makes up the rest of this blog post...

To be very clear, I am not attempting to dress up a SQL query as something more sophisticated than it is. I am not a big fan of the hype train that turns simple linear regression into "AI" However, I believe that concepts made popular in data science are valuable even for simple programming tasks such as this.

What's a "model"?

Fundamentally, a model is a piece of code that attempts to predict the future based on past data. In many data science applications this means that the model will be something complex; a deep neural network, a gradient boosting machine, or even a linear regressor. In this case our model is something simpler - simple a hash map endpoint_url -> rate_limit. Just as the model is simple, the process computing it is also quite easy. The most obvious way to do it would be to simply open up Jupyter, run the analysis once, and then hardcode the result into the production system.

The danger in simply doing this is that the future may not simply be a repeat of the past. Our goal is to rate limit the API in such a way as to minimize the impact on future non-malicious users. We are attempting to predict the future, specifically future usage of the API.

Just like any other data science model, we should therefore run a backtest to determine if our predictions would have worked well in the past had we used them.

How does backtesting work?

To backtest a model is to simulate what would have happened had we used the model in the past.

Let's be concrete. Suppose we plan to train this model monthly. It's currently December so we have a complete month of data from November. Therefore, we'll calculate thresholds for November as follows:

SELECT endpoint_url,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY daily_requests) OVER (PARTITION BY endpoint_url) AS num_requests_99pct
  FROM endpoint_usage
    WHERE (NOT malicious)
          AND (date < '2019-11-01')  -- This part is new
  GROUP BY endpoint_url

This tells us what the rate limits would have been during the month of November if we ran this code on Nov 1.

Then we'll measure what fraction of non-malicious users would have gotten rate limited during the month of Nov. In this query, the rate_limits table is a temporary table generated from the output of the above (i.e. (date < '2019-11-01')) query:

SELECT endpoint_url, COUNT(*) AS total_usage,
       SUM(CASE WHEN (daily_requests > num_requests_99pct) THEN 1 ELSE 0) AS num_rate_limited,
  FROM endpoint_usage
    INNER JOIN rate_limits ON (endpoint_usage.endpoint_url = rate_limits.endpoint_url)
  WHERE (NOT malicious)
        AND (date >= '2019-11-01') AND (date < '2019-12-01')
  GROUP BY endpoint_url

This query tells us the total usage as well as the total fraction of usage that gets rate limited. Concretely, a single user who uses the API for 5 days and got rate limited once would contribute 5 to total_usage and 1 to num_rate_limited.

If num_rate_limited is 1% of total_usage, we're in business! Our modeling assumption appears to be true and we can safely put this model into production.

If it's not, then we might need to do more work.

Coding note

One way to implement this would be as a single big query:

WITH rate_limits AS (SELECT endpoint_url,
                            PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY daily_requests) OVER (PARTITION BY endpoint_url) AS num_requests_99pct
                       FROM endpoint_usage
                         WHERE (NOT malicious)
                               AND (date < '2019-11-01')
                     GROUP BY endpoint_url)


SELECT endpoint_url, COUNT(*) AS total_usage,
      SUM(CASE WHEN (daily_requests > num_requests_99pct) THEN 1 ELSE 0) AS num_rate_limited,
  FROM endpoint_usage
    INNER JOIN rate_limits ON (endpoint_usage.endpoint_url = rate_limits.endpoint_url)
  WHERE (NOT malicious)
        AND (date >= '2019-11-01') AND (date < '2019-12-01')
  GROUP BY endpoint_url

I would, however, recommend an alternate approach. I would write a python script which takes an end_date parameter and runs a query which excludes all data only made available after end_date.

This function would be used in both training and backtesting, so that there is only a single chunk of code to change if the algorithm changes. The danger we are protecting against here is forgetting to change the backtesting query when we change the model query.

Does the model change over time?

Most ML models do not retain their accuracy indefinitely. In this case, it is worth considering the possibility that legitimate API usage might change over time. So we might wish to train our model over a shorter time period. Then we would periodically update our bet sizes.

Of course, if we do this, we also need to test the results and see if affects the accuracy of the model.

Putting it into production

Instead of simply running this SQL query once and hard coding the result, we can have a much more robust system with a moderate amount of effort. We'll follow the exact same steps as putting a data science model into production:

Set up a cron job (or better, an Airflow dag to run the query (i.e. train the model) every month.
Set up a second cron job to run the backtest every month. This means that on Jan 1, the code should generate rate limits using data available up to Dec 1. Then it should check how many users exceeded those rate limits during Dec 1-Dec 31. The results of this should be posted to a slack channel or monthly email.
Ensure proper alerting. If the API usage in redshift drops below some expected level, raise an alert - the model might be broken due to insufficient data (or data collection might simply be broken). Similarly, if the number of non-malicious users getting rate limited exceeds an expected threshold (e.g. 3-5%) then an alert should be raised. Models are finicky things and sanity checks are needed.

(Obviously adjust "monthly" to whatever makes sense for your use case.)

Ultimately, the idea of running a train/test split on historical data and auto-update your parameters is a very powerful paradigm. It is normally applied to complex data science models but even many simple tasks can benefit from this process as well.

The Final Stage of Grief (about bad data) is Acceptance

2019-07-25T09:00:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently gave a talk at the Fifth Elephant 2019. The talk was a discussion about how to use math to handle unfixably bad data. The slides are available here.. Go check it out.

Don't believe the hype: Basic Income reduces labor supply by 10%, which is a lot

2019-05-06T08:30:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

With Andrew Yang's presidential candidacy moving forward, people are discussing basic income again. One common meme about a Basic Income is that by removing the implicit high marginal tax rates that arise from the withdrawal of welfare benefits, disincentives for labor would be reduced and therefore a Basic Income would not reduce labor supply. In this blog post I provide both empirical and theoretical evidence why this conclusion is false.

In particular, I review the experimental literature, which suggests a Basic Income will result in approximately a 10% drop in labor supply. I also review standard economic theory regarding diminishing marginal utility, which provides a clear theoretical reason why a Basic Income would reduce labor supply.

Finally, I extrapolate the data from the 1970's Basic Income experiments to the contemporary era. In particular, I consider a counterfactual history, taking the employment effects from past experiments and applying them to contemporary employment rates.

Empirical evidence: 5 North American Experiments

Let me begin with the empirical evidence. There have been 5 experiments on either Basic Income or Negative Income Tax in North America. I'll discuss each of them in turn, focusing on the labor force effects.

Seattle/Denver Income Maintenance Experiment

This experiment measured the effect of a Negative Income Tax as compared to the welfare programs (AFDC, Food Stamps) available at the time. The experiment ran from 1970-1975 in Seattle and 1972-1977 in Denver. The benefit of the plan was an unconditional cash transfer, assigned randomly at various levels, with the maximum transfer being equal to 115% of the poverty line.

The sample was focused on lower income Americans, and aimed to collect representative data on Whites, Blacks and Chicanos (in Denver only) - this resulted in the minority ethnic groups being significantly overrepresented. There were also two variations - a 3 year treatment group and a 5 year treatment group.

The net result on labor supply is the following:

Husbands reduced their labor supply by about 7% in the 3-year treatment group and by 12-13% in the 5 year treatment group.
Wives reduced their labor supply by about 15% in the 3-year treatment group and by 21-27% in the 5 year treatment group.
Single mothers reduced their labor supply between 15 and 30%.
The labor supply reduction typically took 1 year to kick in, suggesting a shorter experiment might have missed it.
The fact that the reduction is larger in the 5 year group than the 3 year group suggests anticipation effects - people know that they will have guaranteed income for several years so they plan for long term work reductions.

The experiment tracked both treatment groups for 5 years, and in the final 2 years the 3 year treatment group recovered most of their labor market losses.

An additional interesting effect observed is a higher rate of divorce in the treatment groups.

Mincome: Manitoba Basic Annual Income Experiment

The Mincome experiment is the one most commonly cited by fawning journalists. For example, here's how Vice describes it:

> The feared labor market fallout—that people would stop working-didn't materialize... "If you work another hour, you get to keep 50 percent of the benefit you would have gotten anyway, so you are better off working than not."

This framing is typical in mainstream journalism:

> 'Politicians feared that people would stop working, and that they would have lots of children to increase their income,' professor Forget says. Yet the opposite happened: the average marital age went up while the birth rate went down. The Mincome cohort had better school completion records. The total amount of work hours decreased by only 13%.

This framing is weird, because a 13% decrease in labor supply is actualy pretty big. A differences in differences analysis of Mincome suggests a treatment effect of about 11.3%. There are also subgroup analysis suggesting that this effect might be driven more by women than men, but these are prone to small sample effects as well as the Garden of Forking Paths.

One interesting result from Mincome is that about 30% of the effect of Mincome is likely to be socially driven rather than pure economics. The claim made is that if a Basic Income is given to everyone, reducing labor market participation may become socially normalized and this will drive further reductions in labor supply. This effect would likely not be measurable in a randomized control trial.

The Rural Income Experiment

Unlike the other experiments, this one was designed to measure the effect of Basic Income on rural people, in particular self employed farmers. Rural communities in Iowa and North Carolina were chosen and basic income equivalent to 50%-100% of the poverty level were given to people in the treatment group.

This experiment was on the small side; only 809 families entered and 729 remained in the program for all three years.

After controlling statistically for differences in observable characteristics of participants, the Rural Income Experiment showed an overall labor supply reduction of 13%. In this experiment the labor market effect was somewhat smaller for husbands (-8% to +3%) while being quite large for wives (-22%-31%) and dependents (-16% to -66%).

The widely disparate results reported across subgroups also suggest that the subgroup analysis is noisy and suffering from insufficiently large samples - not surprising given 809 sample familes split across 3 subgroups (NC Blacks, NC Whites and Iowa) plus 3 subgroup analyses per group (Husbands, Wives and Dependents).

The Gary Experiment

The Gary Experiment was focused on mitigating urban Black poverty. It was run from 1971-1974 and had a sample of 1800 families (43% of which were the control group). 60% of participating familes were female headed. Dual earning families were generally excluded from the experiment because their income was too high.

The size of the BI was pretty similar to those of the other experiments - 70% and 100% of the poverty level.

In this experiment, the work reduction was 7% for husbands, 17% for wives, and 5% for female heads of household. The reason the drop in female heads of household is low may simply be due to the fact that prior to the experiment, female heads of household only worked an average of 6 hours/week.

An additional effect of the BI was an increase in wealth inequality - higher earning married couples tended to save money and pay down debt while much poorer single mothers merely increased consumption.

The New Jersey Experiment

The New Jersey Experiment ran from 1967-1974. In this experiment the income levels ranged from 50% to 125% of the poverty line, and the experiment included 1350 randomly selected low income families in NJ and PA. Each family in the experiment received 3 years of Basic Income. As with the Rural Income Experiment, many subgroup analysis were performed (on a relatively low number of families per subgroup) and inconsistent results were obtained across subgroups.

The overall results were a reduction in labor supply (hours worked) by 13.9% for white families, 6.1% for black families and 1.5% for Spanish speaking families. The labor force participation rate reduction was 9.8%, 5% and +/- 6.7% for White, Black and Spanish speaking families respectively. (Due to the poor quality of the scan, I can't make out the digits after the decimal for black families or whether the effect is positive or negative in table 3.)

I do not endorse the level of excessive subgroup analysis they performed. In such a small sample they should have just done an overall analysis. But the experiment was designed in 1967 so I'll be forgiving of the authors - my viewpoint of their methodology is, of course, heavily informed by living through the modern replication crisis.

Empirical conclusions

The studies I've surveyed were all social experiments performed in the 1970's. As such, the treatment effects are comparing a Basic Income providing a roughly 1970-level poverty rate income to welfare programs from that era. These experiments were also performed in an era with significantly lower female workforce participation and higher marriage rates.

The experiments were also all pre-Replication Crisis, and as a result they feature excessive subgroup analysis, experimenter degrees of freedom, and for this reason I don't fully believe most of the fine grained effects these studies purport to measure.

However, there is one very clear and significant top line effect that is consistent across every experiment: a roughly 10% reduction in labor supply.

Theory: Why would this be true?

The common justification for why a Basic Income would not reduce labor supply is the following. Because a BI is given regardless of work, a person receiving a BI gains the same amount of money from working as they would gain if they did not work. This is often contrasted to means-tested welfare, which often has high implicit marginal tax rates due to the withdrawal of welfare benefits.

However, this verbal analysis ignores something very important: diminishing marginal utility.

In economics, people are modeled as making decisions based on utility - roughly speaking, the happiness you get from something - not on cash. And an important stylized fact, accepted by pretty much everyone, is that utility as a function of income is strictly concave down. In mathematical terms, that means that for any $0 \leq \alpha \leq 1$:

\begin{equation*} U(\alpha I_1 + (1-\alpha) I_2) > \alpha U(I_1) + (1-\alpha) U(I_2) \end{equation*}

Since more income is always better, we can also assume that $U(I)$ is a strictly increasing function of income.

In pictures, this means that a person's utility function looks like this:

Now the choice to work is made by balancing the utility gained from income against the disutility from working:

Since the net utility is positive, this person will choose to work.

However, because the utility function is concave, if we start from a point further out (namely $I_b$), the utility gain from labor decreases. This can be illustrated in the following graph:

In the Basic Income regime, a person's utility gain from working is only $U(I_b+I_w) - U(I_b)$, which is lower than $U(I_w)-U(0)$.

In some cases, this decrease will result in the net utility gain from work being negative:

These are the people who are deterred from working.

Now the graphs I've given above are just an example. The clever reader might ask if it is true for every graph. I will prove in the appendix that a Basic Income always reduces the marginal utility from work if diminishing marginal utility is true.

Conclusion

How big is this affect? Journalists favorable to a Basic Income tend to talk about "only" a 10% drop in labor supply. Let me make an invidious comparison.

In 2008, the United States (and the world) suffered the Great Recession. To make a comparison, I've plotted the male employment to population ratio (approximated by taking the male civilian employment rate and dividing it by half of the US population) at the time of the great recession.

What would have happened if the Great Recession didn't occur, but we instead instituted a Basic Income in 2008?

To speculate about this, I assumed a baseline employment to population rate of 52% for men (the peak employment rate just before the recession). I then plotted for comparison the results of several Basic Income experiments focusing on the effects on men (though in a couple of cases that was not well disambiguated).

In the case of Seattle/Denver, I plotted the effect observed in each year. In the other cases, where yearly effects were not reported, I merely assumed a drop equal to the average reported drop.

The result can be seen above. The typical effects of a Basic Income are in the same ballpark as those of the Great Recession.

The conclusion we can draw from this is that all the available evidence suggests that a Basic Income will have a very large and negative effect on the economy.

We can also anticipate that the effect will be worse if people believe that a Basic Income is likely to be permanent. As can be seen by comparing the 3 and 5 year groups in the Seattle/Denver experiment, people assigned to a longer term BI reduced their work effort significantly more than those assigned to the short term BI.

Updates: Confirming the theory with out-of-time data!

Update 2020/3/5: Southern Ontario Experiment

(This section added to the blog post on 2020/3/5.)

As of 2019, I predicted that a Basic Income in North America would generally reduce labor supply by double digit amounts, approximately 10%. I now have out-of-time data that vindicates this prediction!

From 2017-2018, Southern Ontario ran a new Basic Income experiment. For what I believe to be political reasons, the pilot study was cancelled and the assessment was not performed using official government records. To make up for this lack of good data collection, some academics ran a poll on some of the people who were receiving Basic Income during the pilot. Due to this methodology - entirely self reported based on people's recollection of things from years ago - this study is far more speculative than the ones above.

However, the results are generally consistent with all the other studies. Of the BI recipients who were employed 6 months prior to receiving Basic Income, 23.9% report being unemployed during the Basic Income pilot. Of the people unemployed prior to BI, 81.8% remain unemployed. Changes in working hours were not measured.

This more recent study offers results in the same ballpark as all the older studies; Basic Income appears to reduce labor supply by double digit percentages. My prediction came true in out-of-time data!

Update: Barcelona

Barcelona ran a BI experiment in 2017 in which they gave a basic income to a set of low income residents. Labor market participation was measured via social security records, and includes point-in-time information at 10 day intervals on whether a participant was employed. A survey was performed at much less regular intervals to measure self-employment not captured by social security records, as well as measuring job search activity and civic engagement.

The net result is a 20% reduction in probability of being employed (from 47% to 37.5%) as per administrative records, and survey records suggest this effect is larger for full time employment.

Survey results also suggest no statistically significant increase in civic engagement, education or job search.

Update: Finland

Finland also ran a BI experiment. This is being widely touted by BI proponents as proving that BI doesn't reduce employment in wealthy countries, but the proponents seem to be ignoring the fact that it was given to unemployed people only. What the study does conclusively prove is that BI doesn't turn unemployed people into "double-unemployed" people who work a negative number of jobs. (Of course you don't need a study to prove that.)

Update on Keynesian Economics

A few commenters on reddit suggest that unemployment due to "low aggregate demand" is somehow different from reduced labor force participation due to Basic Income. However, this idea is based on either MMT or some weird newspaper columnist pop-Keynesianism; it is not in any way based on the economic mainstream.

Mainstream economist version of Keynesian theory says that in a recession, people do not work because they have sticky nominal wages but a shock has resulted in their real output dropping. Concretely, a worker has a nominal wage demand of $W$ dollars. He used to produce $K$ widgets at a cost of $W/K$ each, but for whatever reason he can now only produce $K_2 < K$ widgets. His real output is now $K_2$ which has nominal value $K_2 (W/K) = (K_2/K) W < W$.

In order to productively employ him the nominal wage must be reduced by a factor of $K_2/K$. However, the worker refuses to work unless he is paid $W$ dollars.

The Keynesian prescription of stimulating aggregate demand solves this problem by inflation; if the price of a single widget can be increased by a factor of $K/K_2$, then the worker can again be paid a wage of $W$ dollars.

In essense, stimulating aggregate demand is about tricking prideful workers into reducing their real wage demands so that they stop being lazy and go back to work.

In spite of the difference in mood affiliation, Keynesian economics claims that recessions reduce labor supply for the exact same reason a Basic Income does: workers refusing to work.

Appendix: Proof that Diminishing Marginal Utility implies a work disincentive from Basic Income

The choice to work can be framed as a question of utility maximization. Assuming one receives an income of $I_w$ from work and an income of $I_b$ from Basic Income, the utility of working is:

\begin{equation*} U(I_b + I_w) - U_w \end{equation*}

While the utility of not working is

\begin{equation*} U(I_b) \end{equation*}

Here $U_w$ is the utility penalty that describes the unpleasantness of work. Let us define $\Delta U_{bi}$ as the marginal utility gained by making the choice to work in a Basic Income regime:

\begin{equation*} \Delta U_{bi} = [U(I_b + I_w) - U_w] - U(I_b) \end{equation*}

In contrast, the marginal utility gained or lost from work in a non-Basic Income regime is:

\begin{equation*} \Delta U = [U(I_w) - U_w] - U(0) \end{equation*}

The concavity relation (setting $\alpha=I/(I_b+I_w)$) tells us that for any $0 \leq I \leq I_b+I_w$, we have:

\begin{equation*} U(I) > \frac{I}{I_b+I_w} U(I_b+I_w) + \frac{I_b+I_w-I}{I_b+I_w}U(0) \end{equation*}

Now if we compute the difference $\Delta U_{bi} - \Delta U$, we discover:

\begin{equation*} \Delta U_{bi} - \Delta U = U(I_b + I_w) - U(I_b) - U(I_w) + U(0) \end{equation*}

If we substitute $I=I_b$ and $I=I_w$ into the concavity relation above, we discover:

\begin{equation*} \Delta U_{bi} - \Delta U < \left[U(I_b + I_w) + U(0)\right] - \end{equation*}

\begin{equation*} \left( \frac{I_b}{I_b+I_w} U(I_b+I_w) + \frac{I_w}{I_b+I_w}U(0) + \frac{I_w}{I_b+I_w} U(I_b+I_w) + \frac{I_b}{I_b+I_w}U(0) \right) \end{equation*}

\begin{equation*} = \left[U(I_b + I_w) + U(0)\right] - \left( U(I_b + I_w) + U(0) \right) \end{equation*}

\begin{equation*} = 0 \end{equation*}

Therefore:

\begin{equation*} \Delta U_{bi} < \Delta U \end{equation*}

This completes the proof that if Diminishing Marginal Utility is true, a Basic Income reduces the incentive to work.

Is white nationalism a serious problem? Extracting wikipedia data with Python.

2019-03-21T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently saw a silly twitter exchange between two of the lyingest politicians in American politics. Given that they have both explicitly expressed the viewpoint that morals matter more than numbers and being "technically correct", I figured that I should just check for myself. On twitter, Trump says white nationalism is a small group of people with serious problems while Alexandria O. Cortez claims "White supremacists committed the largest # of extremist killings in 2017". This question is easily answerable... with Python.

So actually no, this blog isn't about politics. But I recently discovered pandas.read_html, and two idiot politicians tweeting at each other is as good as reason as any to write a blog post about it. The real audience for this post is python developers who want to see a couple of cool pydata tricks I've learned recently.

Cool python trick #1: `pandas.read_html`

This is one of the coolest tricks I've learned in 2019. The pandas library has a method, read_html which takes a webpage as input and returns a list of dataframes containing the tables on that webpage.

So to answer the question about terrorism in 2017, I'm going to browse Wikipedia's List of Terrorist Incidents in 2017.

Sadly, there's a lot of terror attacks, so they have separate pages for each month. Each page looks like this: https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_January_2017.

Therefore, to extract the data I'll do this:

import pandas
def load_month(m):
    results = pandas.read_html('https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_' + m + '_2017')
    df = results[0]
    df.columns = df[0:1].values[0]
    return df[1:].copy()
data = []
for month in [datetime(2008, i, 1).strftime('%B') for i in range(1,13)]:
    data.append(load_month(month))
data = pandas.concat(data)

The function read_html is doing all the heavy lifting here.

The result of this is a dataframe listing a location, a perpetrator, a number of deaths/injuries, and a few more columns. It's not super clean, but at least it's pretty structured.

This read_html function is awesome because I needed to do literally no work parsing.

Cool python trick #2: the `wikipedia` module

In this data, there were 230 separate perpetrators listed after cleaning up some of the obvious data issues (e.g. some rows containing "Al Shabaab" and others containing "Al-Shabaab"). That's far too much for me to manually classify everything.

So instead I used the wikipedia module.:

import wikipedia

def get_summary(x):
    if x == 'Unknown':
        return None
    try:
        return wikipedia.page(x).summary
    except Exception:
        return None

count['perp_summary'] = count['perpetrator_cleaned'].apply(get_summary)

This gets me a summary of each terrorist group, assuming wikipedia can easily find it. For example, here's the result of get_summary('Al-shabaab'):

'Harakat al-Shabaab al-Mujahideen, more commonly known as al-Shabaab (lit. '"The Youth" or "The Youngsters", but can be translated as "The Guys"'), is a jihadist fundamentalist group based in East Africa. In 2012, it pledged allegiance to the militant Islamist organization Al-Qaeda.[...a bunch more...]

With a little bit of string matching (e.g. if the summary contains "Communist" or "Marxist", classify as "Communist"), I was able to classify assorted terrorist attacks into a few broad causes:

cause                       dead
Islam                       8170.0
Central Africal Republic    432.0
Communism                   310.0
Myanmar                     105.0
Congo                       85.0
Anarchy                     3.0
Far-right                   3.0
Far-left                    1.0

Some of these are broad catch-all terms that simply reflect my ignorance. For example, one group is called Anti-Balaka, and wikipedia explains them as a predominantly Christian rebel group "anti-balakas are therefore the bearers of grigris meant to stop Kalashnikov bullets". I lumped a bunch of similar groups into "Central African Republic", similarly for "Congo" and "Myanmar".

This classification scheme let me classify 92% of the 9933 deaths due to terrorism. Note that Islam alone accounted for at least 82%, and eyeballing the groups I didn't match it's probably higher.

There were also a number of attacks that I found very hard to classify, e.g. Patani independence or Fulani pastoralism. Key summary of the Fulani Pastoralism conflict: the Fulani people of Nigeria are mostly nomadic cow herders and they are getting into violent land disputes with non-Fulani farmers who don't want Fulani cows eating/trampling their crops. The world is a big place and it's full of all sorts of bad shit most folks have never heard of.

Conclusion

It looks like Donald Trump is right and AOC is wrong. Even if we take high end estimates of the number of people killed by white supremacists in 2017 (34 in the US according to the SPLC), it seems like a small problem compared to things like Anti-Balaka, Communism or Balochistan independence.

There are many individual terrorist groups that I imagine most readers have never heard of, such as Indian Naxalites (communists), which kill far more people than white supremacists.

Also, far more importantly for most of my readers, you can easily extract data from Wikipedia into a dataframe using pandas.read_html and the wikipedia module.

Methodology

You can find my python notebook here.

Correction: A previous version of this post described an "Independent Nasserite Movement (a Socialist pan-Arab nationalist movement)", which was a reference to Al Mourabitoun. However that might have been me getting confused by wikipedia results - I think the actual attack in 2017 was done by a different Al Mourabitoun which is just ordinary boring Islamist violence. So we probably need to add another 77 or so to the Islam row.

Also, at least one commenter noted that the SPLC counts 34 dead due to white nationalists, which is higher than I get from Wikipedia. I don't particularly trust the SPLC, but I do reference it above. It still doesn't really change the results. Fulani Pastoralism killed more people.

Barometric Pressure, a statistical myth: Why one common argument against IQ is flawed

2019-01-21T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Anyone following Nassim Taleb's dissembling on IQ lately has likely stumbled across an argument, originally created by Cosma Shalizi, which purports to show that the psychometric g is a statistical myth. But I realized that based on this argument, not only is psychometrics a deeply flawed science, but so is thermodynamics!

Let us examine pressure.

In particular, we will study a particular experiment in mechanical engineering. Consider a steel vessel impervious to air. This steel vessel has one or more pistons attached, each piston of a different area. For those having difficulty visualizing, a piston works like this:

The pipe at the top of the diagram is connected to the steel vessel full of gas, and the blue is a visualization of the gas expanding into the piston. The force can be determined by measuring the compression of the spring (red in the diagram) - more compression means more force.

If we measure the force on the different pistons, we might makes a curious observation - the force on each piston is equal to a constant P times the area of the piston. If we make a graph of these measurements, it looks something like this:

We can repeat this experiment for different steel vessels, containing different quantities of gas, or perhaps using the same one and increasing the temperature. If we do so (and I did so in freshman physics class), we will discover that for each vessel we can make a similar graph. However, the graph of each vessel will have a different slope.

We can call the slope of these lines P, the pressure, which has units of force divided by area (newtons/meter^2).

To summarize, the case for P rests on a statistical technique, making a plot of force vs area and finding the slope of the line, which works solely on correlations between measurements. This technique can't tell us where the correlations came from; it always says that there is a general factor whenever there are only positive correlations. The appearance of P is a trivial reflection of that correlation structure. A clear example, known since 1871, shows that making a plot of force vs area and finding the slope of the line can give the appearance of a general factor when there are actually more than $10^{23}$ completely independent and equally strong causes at work.

These purely methodological points don't, themselves, give reason to doubt the reality and importance of pressure, but do show that a certain line of argument is invalid and some supposed evidence is irrelevant. Since that's about the only case which anyone does advance for P, however, it is very hard for me to find any reason to believe in the importance of P, and many to reject it. These are all pretty elementary points, and the persistence of the debates, and in particular the fossilized invocation of ancient statistical methods, is really pretty damn depressing.

Particles explain pressure

If I take an arbitrary set of particles obeying Newtonian mechanics, and choose a sufficiently large number of them, then the apparent factor "Pressure" will typically explain the behavior of pistons. To support that statement, I want to show you some evidence from what happens with random, artificial patterns of particles, where we know where the data came from (my copy of Laundau-Lifshitz). So that you don't have to just take my word for this, I describe my procedure and link to a textbook on statistical mechancs where you can explore these arguments in detail.

Suppose that the gas inside the vessel is not a gas in the continuous sense having some intrinsic quantity pressure, but is actually a collection of a huge number of non-interacting particles obeying Newton's laws of motion. It can be shown that the vast majority of the time, provided the vessel has been at rest for a while, that the distribution of particle velocities is approximately the same in any particular cube of volume. Furthermore, the density of particle positions will be uniformly distributed throughout the vessel.

For simplicity, let us suppose the volume is cubic of length L, and one side of the volume is the piston. Consider now a single particle in the cube, moving with an x-velocity $v_x$. This particle will cross the cube once every $t=L/v_x$ units of time, and each time it hits the piston it will exert a force $F=2mv_x$ every $t$ units of time. Thus, on average the force will be $F=2m v_x^2/L$.

The total force on the piston will be the sum of this quantity over all the particles in the vessel, namely $F=2mN \bar{v}_x^2/L$. Here $\bar{v}$ denotes the average velocity (in the x-direction) of a particle. If we divide this by area, we obtain $F/A=2mN \bar{v}_x^2/L^3 = 2m \rho \bar{v}_x^2$. Here $\rho = N/L^3$ is the density of particles per unit volume. I.e., we have derived that $P=2m \rho \bar{v}_x^2$!

Thus, we have determined that under these simple assumptions, pressure is nothing fundamental at all! Rather, pressure is merely a property derived from the number density and velocity of the individual atoms comprising the gas.

But - and I can hear people preparing this answer already - doesn't the fact that there are these correlations in forces on pistons mean that there must be a single common factor somewhere? To which question a definite and unambiguous answer can be given: No. It looks like there's one factor, but in reality all the real causes are about equal in importance and completely independent of one another.

(As a tangential note, several folks I've spoken about this article vaguely recollect that temperature is when the particles in a gas move faster. This is true; increasing temperature makes $\bar{v}_x^2$ increase. If we note that $\bar{v}_x^2 = C T$ - ignoring what the constant C is - and multiply our equation of pressure above by the volume of the vessel V, we obtain $PV = \rho V 2m C T$. Since $\rho V = N$, this becomes $PV = N R T$, with $R=2mC$. We have re-derived the fundamental gas law from high school chemistry from the kinetic theory of gases.)

Doing without P

The end result of the self-confirming circle of test construction is a peculiar beast. If we want to understand the mechanisms of how gases in a vessel work, how we can use it to power a locomotive, I cannot see how this helps at all.

Of course, if P was the only way of accounting for the phenomena observed in physical tests, then, despite all these problems, it would have some claim on us. But of course it isn't. My playing around with Boltzmann's kinetic theory of gases has taken, all told, about a day, and gotten me at least into back-of-the-envelope, Fermi-problem range.

All of this, of course, is completely compatible with P having some ability, when plugged into a linear regression, to predict things like the force on a piston or whether a boiler is likely to explode. I could even extend my model, allowing the particles in the gas to interact with one another, or allowing them to have shape (such as the cylindrical shape of a nitrogen molecule) and angular momentum which can also contain energy. By that point, however, I'd be doing something so obviously dumb that I'd be accused of unfair parody and arguing against caricatures and straw-men.

Shalizi is right, but only about a trivial philosophical point

I'll now stop paraphrasing Shalizi's article, and get to the point.

In physics, we call quantities like pressure and temperature mean field models, thermodynamic limits, and similar things. A large amount of the work in theoretical physics consists of deriving simple macroscale equations such as thermodynamics from microscopic fundamentals such as Newton's law of motion.

The argument made by Shalizi (and repeated by Taleb) is fundamentally the following. If a macroscopic quantity (like pressure) is actually generated by a statistical ensemble of microscopic quantities (like particle momenta), then it is a "statistical myth". Lets understand what "statistical myth" means.

The most important fact to note is that "statistical myth" does not mean that the quantity cannot be used for practical purposes. The vast majority of mechanical engineers, chemists, meteorologists and others can safely use the theory of pressure without ever worrying about the fact that air is actually made up of individual particles. (One major exception is mechanical engineers doing microfluidics, where the volumes are small enough that individual atoms become important.) If the theory of pressure says that your boiler may explode, your best bet is to move away from it.

Rather, "statistical myth" merely means that the macroscale quantity is not some intrinsic property of the gas but can instead be explained in terms of microscopic quantities. This is important to scientists and others doing fundamental research. Understanding how the macroscale is derived from the microscale is useful in predicting behaviors when the standard micro-to-macro assumptions fail (e.g., in our pressure example above, what happens when N is small).

As this applies to IQ, Shalizi and Taleb are mostly just saying, "the theory of g is wrong because the brain is made out of neurons, and neurons are made of atoms!" The latter claim is absolutely true. A neuron is made out of atoms and it's behavior can potentially be understood purely by modeling the individual atoms it's made out of. Similarly, the brain is made out of neurons, and it's behavior can potentially be predicted simply by modeling the neurons that comprise it.

It would surprise me greatly if any proponent of psychometrics disagrees.

One important prediction made by Shalizi's argument is that in fact, the psychometric g could very likely be an ensemble of a large number of independent factors; a high IQ person is a person who has lots of these factors and a low IQ person is one with few. Insofar as psychometric g has a genetic basis, it may very well be highly polygenic (i.e. the result of many independent genetic loci).

However, none of this eliminates the fact that the macroscale exists and the macroscale quantities are highly effective for making macroscale predictions. A high IQ population is more likely to graduate college and less likely to engage in crime. Shalizi's argument proves nothing at all about any of the highly contentious claims about IQ.

AI Ethics, Impossibility Theorems and Tradeoffs

2018-10-30T09:00:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently gave a talk at CrunchConf 2018. The talk was a about the various impossibility theorems that make a person concerned with AI Ethics must content with. The slides are available here.. Go check it out.

Why you can't have privacy on the internet

2018-08-22T08:30:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently attended a discussion at Fifth Elephant on privacy. During the panel, one of the panelists asked the audience: "how many of you are concerned about your privacy online, and take steps to protect it?"

At this point, most of the hands in the panel shot up.

After that, I decided to ask the naughty question: "how many of you pay at least 500rs/month for services that give you privacy?"

Very few hands shot up.

Let me emphasize that this was a self selected group, a set of people at a technology conference who were so interested in privacy that they chose to attend a panel discussion on it (instead of concurrent talks on object detection and explainable algorithms). Besides me and perhaps 2 or 3 others, no one was willing to pay for privacy.

Instead of paying for it, many of the people at the panel wanted the government to mandate it. Moreover, many people seemed to think it would somehow be free to provide.

People won't pay for privacy

Online Services Aren't Free

If you are not paying for it, you're not the customer; you're the product being sold.

Every online service costs money to provide. To get an idea on the metrics, here are some leaked revenues at a company I worked for. Content isn't free. Engineers aren't free. Ad revenues aren't very high. If the site is storing lots of personal data (e.g. email, picture/videos, etc), even the cost of computing infrastructure can become significant.

Since most people are unwilling to pay for online services, the way to cover these costs is by advertising to the users.

Ad revenue per user varies by several orders of magnitude depending on how well targeted it is.

Here's a calculation, which was originally done by Patrick McKenzie to answer the question

I just bought a refrigerator yesterday. Why, why, why do you show me refrigerator ads?

Assume a typical person buys a refrigerator once every 10 years.
Assume 2% of refrigerator purchases go wrong (e.g. your wife hates it, it breaks), and you need to buy a new refrigerator within a week.

Subject to these assumptions, a person who's bought a refrigerator is 10x more likely to buy another refrigerator in the next week than someone who hasn't.

The fundamental problem of advertising is sparsity - the fact that most advertisements are worthless to most people. An ad for "faster than FFTW" might be useful to me, but it's pointless for most people who've never heard of FFTW. If you haven't spied on me well enough to know that I do fast fourier transforms, your odds of making money by advertising to me are essentially zero.

Advertising generates negligible revenue without personalization.

Without advertising, people will need to pay for their online services. Email services tend to cost around $5-10/month. The NY Times costs about $10/month, and the Wall St. Journal costs 2-4x that. It's hard to guesstimate the cost of social networks, but my best guesstimates for Facebook is several dollars per user per month.

Will you pay $20-50 a month to replace your free online services with privacy preserving ones?

Another major fact is that service providers use data to improve their service. User tracking enables product managers/UI designers to figure out exactly what customers want, and give it to them. Google cannot index your email and make it searcheable without also reading it. Would you use a free email product with a much worse UI than Gmail?

Fraud is real and pervasive

Consider your payment provider - PayPal, PayTM, Simpl (disclaimer: I work there), etc. One of the most invisible and pervasive concerns at a company like this is preventing fraud.

The economics of a payment provider are as follows:

A customer books a 100rs movie ticket on BookMyShow.
The customer pays 100rs to the payment provider.
The payment provider transfers 97-99.5rs to BookMyShow and pays for their expenses with the remaining 0.5-3rs.

That's a pretty tight margin. For concreteness and simplicity of exposition, lets suppose the Merchant Discount Rate is 1%.

Now lets consider the impact of fraud. If fraud levels ever get as high as 1 transaction out of every 100, the payment provider will have zero revenue and will go broke. If fraud is not carefully controlled, it can reach levels far higher than this.

In mid-2000, we had survived the dot-com crash and we were growing fast, but we faced one huge problem: we were losing upwards of $10 million to credit card fraud every month.

-- Peter Thiel, Zero to One

Peter Thiel notes that reducing fraud was the difference between loss and profitability.

In the long run, the cost of fraud must be passed on to the consumer. Either the payment provider or the merchant will eat the cost of fraud, and will in turn raise prices on consumers to compensate.

Will you pay 120rs for a 100rs movie ticket in order to protect your privacy from your payment provider? It's important to note that while the extra 20rs may seem to go to the payment network, in reality it will go to the smartest scammers.

There is plenty of fraud that occurs beyond payment networks. On Uber, there are fake drivers that take fake passengers on trips and demand to be paid even though the fake passengers have paid with stolen credit cards. Many fraud rings attempt to misuse incentive systems (e.g. "refer a friend, get 100rs off your next order") in order to generate credits with which they can purchase saleable goods. A merchant aggregator is also at risk from the submerchants; movie theatres/restaurants/etc will attempt to exploit BookMyShow/Seamless/etc, in general, submerchants will attempt to make fraudulent transactions on the aggregator and demand payment for them.

A special case of fraud which also relates to the problem of paying for services with advertising is display network fraud. Here's how it works. I run "My Cool Awesome Website About Celebrities", and engage in all the trappings of a legitimate website - creating content, hiring editors, etc. Then I pay some kids in Ukraine to build bots that browse the site and click the ads. Instant money, at the expense of the advertisers. To prevent this, the ad network demands the ability to spy on users in order to distinguish between bots and humans.

Even if you were willing to pay for it, privacy is illegal

Question: What does the government call a payment platform that provides privacy to it's users?

Answer: Money laundering.

Here in India, the bulk of the privacy intrusions I run into are coming from the government. It is government regulations which require me to submit passport photocopies/personal references/etc in order to get a SIM card. Tracking my wifi use by connecting my Starbucks WiFi to a phone number via OTP is another government regulation. Prohibitions against the use of encryption are generally pushed by national governments. Things were pretty similar in the US.

It is, of course, impossible for a service provider to satisfy the government's desire to spy on users without doing so itself.

The desire for the government to spy on users extends far beyond preventing money laundering. In the United States, Congress has demanded information and action from technology companies in order to prevent Russians from posting Pepe memes on Twitter or attempting to organize "Blacktivism" on Facebook. The Kingdom bans most encrypted communication, and many democratic nations (the US, India, UK, France) have politicians pushing in the same direction.

In the intermediary stages, there is a large amount of information that the government requires service providers to keep. This typically includes accounting details (required by tax departments), both purchase history as well as KYC information used by tax authorities to track down tax evaders (e.g., Amazon is required to keep and provide to the IRS tax related information about vendors using Amazon as a platform).

In many cases, censorship authorities require social networks and others to track and notify them about people posting illegal content (Nazi imagery, child pornography, Savita Bhabhi, anti-Islamic content).

Fundamentally, it is government regulations that shut down cryptocurrency exchanges in India. It is government regulations that ban encrypted communication in the Kingdom (at least partially), and it was politicians in the US and UK and India who want to move in the same direction.

Insofar as privacy preserving platforms might exist, it is far from clear whether governments will allow them to continue existing should they become popular.

The privacy preserving service has approximately three crypto geeks and seven zillion Ukrainians hackers laundering money on it

. . .if you're against witch-hunts, and you promise to found your own little utopian community where witch-hunts will never happen, your new society will end up consisting of approximately three principled civil libertarians and seven zillion witches. It will be a terrible place to live even if witch-hunts are genuinely wrong.

-- Scott Alexander

Unfortunately, this Scott Alexander quote explains very nicely what will happen when someone builds a moderately successful privacy preserving network.

If we built a privacy preserving payment network, it would be used for money laundering, drug sales and ransomware. If the Brave private browser/micropayment system ever approaches viability, it will be overrun by criminals laundering money through blogs about Ukrainian food.

If an ad network vowed to protect privacy, fraud would shoot up and good advertisers would leave. The few remaining customers would be selling penis enlargement pills, accepting the click fraud as the cost of doing business because no one else will work with them.

There are privacy preserving/censorship resistant social networks. They're full of Nazis.

This is a fundamental collective action problem, and no player in the game seems to have the ability change things. There are bad actors out there - fraudsters/scammers, terrorists laundering money, legal gun manufacturers moving money around, child pornographers, people who believe in evolution (even among humans), people advocating abandoning Islam, Russians posting Pepe memes, and journalists/revenge pornographers revealing truthful information that people want kept hidden. Any privacy preserving network, at it's core, allows these people to engage in these actions without interference.

And as any network approaches viability, it's early adopters will be these sorts of undesirables.

Make no mistake; I want this privacy preserving network to exist. I have no problem with teaching evolution and exploring it's consequences, advocating atheism over Islam, laundering drug money, or teaching people how to manufacture firearms. But I'm very much in a minority on this.

And if, like me, you want this privacy preserving network, the first step in making that happen is recognizing and acknowledging the very real barriers to making it happen.

Bayesian Linear Regression and Generalized Linear Models

2018-07-30T09:00:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently gave a talk at the Fifth Elephant 2018. The talk was an introduction to linear regression and generalized linear models from the Bayesian perspective. The slides are available here.. Go check it out.

Deterministic A/B tests via the hashing trick

2018-03-20T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

In principle A/B testing is really simple. To do it you need to define two separate user experiences, and then randomly allocate users between them:

def final_experience(user):
    if random.choice([0,1]) == 0:
        return user_experience_A(user)
    else:
        return user_experience_B(user)

So far this seems pretty simple. But then you think about edge cases:

Shouldn't the same user get the same experience if they do this twice?
After the test is complete, how can I compare group A and b?

It's not hard to track this data, but it certainly makes your code a bit uglier:

def final_experience(user):
    user_variation = db.run_query("SELECT user_variation FROM users WHERE user_id = ?", user.id)
    if user_variation == 0: # If the user already saw a variation, show them the same one
        return user_experience_A(user)
    elif user_variation == 1:
        return user_experience_B(user)
    else: #No record in the DB
        user_variation = random.choice([0,1])
        db.run_query("INSERT INTO user_variation (user_id, variation) VALUES (?, ?)", user.id, user_variation)
        if user_variation == 0:
            return user_experience_A(user)
        else:
            return user_experience_B(user)

This is doable, but the code is a lot longer and more annoying. Are there race conditions? Should everything live in a single transaction, potentially skewing things?

Fortunately there's a better way: the hashing trick.:

def deterministic_random_choice(user_id, test_name, num_variations):
    """Returns a 'random'-ish number, between 0 and num_variations,
       based on the user_id and the test name.

       The number will not change if the user_id and test name
       remain the same.
       """
   return (hash(user_id + test_name) % num_variations)

def final_experience(user):
    if deterministic_random_choice(user.id, "experience_test", 2) == 0:
        return user_experience_A(user)
    else:
        return user_experience_B(user)

Usingdeterministic_random_choice instead of random.choice will ensure that the same user is always assigned to the same variation. This is done without any database access.

It also makes it very easy to run analytics and compare the two groups, even though we never stored group membership in any database table:

SELECT SUM(user.revenue), COUNT(user.id), deterministic_random_choice(user.id, "experience_test", 2)
        FROM users
        WHERE user.signup_date > test_start_date
      GROUP BY deterministic_random_choice(user.id, "experience_test", 2)

(This is not SQL that any real DB will actually run, but it's illustrative.)

Whatever you currently do for analytics, you can take the exact same queries and either GROUP BY the deterministic_random_choice or else run the query once for each variation and put deterministic_random_choice(user.id, "experience_test", 2) = 0,1 into the WHERE clause.

It's just a nice simple trick that makes it easy to start A/B testing today. No database migration in sight!

This post was first published on the Simpl company blog.

Low Rate Loans for Ladies, Stags Pay Extra - The Role of Ethics in AI/ML

2018-02-09T12:00:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Today I spoke about AI ethics at 50p 2018. Here are the slides from my talk.

The general topic was multiple ethical principles, and how it's mathematically impossible to satisfy all of them.

Why Indian companies should take on different projects than competing Valley companies - an application of Cobb-Douglas

2017-11-07T08:30:00+01:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

Some time back, I was involved in a discussion with folks at an India-based software company. An important question was asked - why isn't this company as productive (defined as revenue/employee) as it's western competitors, and what can be done to change this situation? In this discussion, I put forward an unexpected thesis: if this company were profit maximizing, then it's productivity should always be far lower than any western company. During the ensuing conversation, I came to realize that very few people were aware of the Cobb-Douglas model of productio, on which I was basing my counterintuitive conclusions.

I've observed that the topic of Cobb-Douglas has come up quite a few times, and several folks have asked me to write up a description of it. Hence, this blog post. In my opinion, Cobb-Douglas is a very useful model to have in one's cognitive toolbox.

Setting up the problem

To lay out the basics of the problem, consider two competing companies - Bangalore Pvt. Ltd. and Cupertino Inc. For concreteness, let us say that these two companies are both sotware companies catering to the global market and they are direct competitors.

The question now arises; how should Bangalore and Cupertino allocate their capital?

For a software company, there are two primary uses toward which capital can be directed:

Marketing. Both Bangalore and Cupertino can direct an extra $1 of spending towards adwords, facebook ads, attendance at conferences, and similar things. Both companies will receive the same amount of exposure on their marginal dollar.
Employees. Bangalore and Cupertino can both spend money on employees, but in this case they receive different returns on investment. In Bangalore, a typical employee might cost 100,000 rupees/month, whereas in Cupertino an employee might cost $100,000/year. This is approximately a 5x cost difference if we round up 1 lac rupees/month to $20,000/year.

Let us now model what the effect of each resource is on revenue.

It's a simple arithmetic identity is that revenue $R$ is equal to:

\begin{equation*} R = L \cdot M \end{equation*}

The values

\begin{equation*} L = L(\textrm{features, marketing strategy, etc}) \end{equation*}

is the probability of any individual prospect making a purchase multiplied by the value of that purchase, and

\begin{equation*} M = M(\textrm{marketing spend}) \end{equation*}

is the number of prospects who can be reached by marketing as a function of money spent on it.

We choose this decomposition because it helps us understand the impact of two separate resources:

The value $L$ is mainly increased by spending money on additional labor. Engineers can build features, which increase value for customers and allow the product to be sold for more money. Marketers may improve the brand value, increasing the probability of a sale.
The value $M$ is increased by spending money on additional marketing. It's a simple machine - money is spent on facebook ads, conferences, TV commercials, and more people become exposed to the product.

We also choose this decomposition since it helps us avoid the Cambridge Controversy, which can under other circumstances make the model less well founded.

Diminishing returns to labor and marketing

To understand the relationship between resources and production, let us take the following exercise. Suppose we have a large set of projects, each with a certain cost and benefit. To begin with, lets discuss labor projects:

Integrate the software with Salesforce, cost 100 hours, benefit $50/prospect.
Come up with a more enterprisey-sounding brand, cost 40 hours, benefit $10/prospect.
Slap some AI on top of the software, cost 2000 hours, benefit $60/prospect.
etc...

Fundamentally, I'm making two important assumptions here:

The projects have no interdependencies.
The amount of labor required for each project is small compared to the overall amount of labor.

Let us assume the corporate strategy is to spend whatever amount of labor we have on this collection of projects in order of decreasing ROI. This means that if we sort the list of projects by ROI = benefit / cost, then the corporate strategy will be to take on the highest ROI projects first.

Here's a typical result. As noted above, the units on the y-axis are dollars per prospect.

Note that the xkcd plotting style is used to illustrate this is a schematic drawing, and should not be taken too literally.

The graph was made as follows:

data = pandas.DataFrame({ 'cost' : uniform(10,100).rvs(50), 'benefit' : uniform(1,100).rvs(50) })
data['roi'] = data['benefit'] / data['cost']
data = data.sort_values(['roi'], ascending=False)

step(cumsum(data['cost']), cumsum(data['benefit'])) #Like `plot(...)`, except that it shows steps at each data point.

As can be seen, no particular correlation between cost and benefit was assumed in order to get diminishing returns. Diminishing returns follows solely from the sorting operation, i.e. the choice to take on the highest ROI projects first.

One can similarly construct a diminishing returns curve on marketing spend. Note also that on the marketing side, many marketing channels (for example adwords) have their own diminishing returns curves built in.However, there's one very important distinction between labor and marketing. For the labor graph the X-axis is hours of labor, while for marketing the X-axis is amount of money spent.

The Cobb-Douglas Model

After observing the diminishing returns curve above, I thought it looked kind of like a $t^{\alpha}$ for some $0 < \alpha < 1$. So I decided to do a least squares fit using the model $y(t) = a t^{\alpha}$. Using numpy, this can be accomplished in a fairly straightforward way using the minimize function from scipy:

def err(a):
    t = cumsum(data['cost'])
    y = cumsum(data['benefit'])
    return sum(pow(y - a[0]*pow(t,a[1]), 2))

x = minimize(err, [1.0, 1.0])

The result of this optimization yields $\alpha = 0.518$, as well as a reasonably accurate best fit curve:

This kind of a graph shape is not an accident. I repeated ths experiment, but this time generating a different data set:

data = pandas.DataFrame({ 'cost' : expon(1).rvs(50), 'benefit' : expon(1).rvs(50) })

I repeated this experiment, but this time I used a different distribution of costs/benefits. The result was pretty similar, albeit with $\alpha = 0.497$:

I suspect that there is some more interesting law of probability which is causing this result to occur, but I'm not entirely sure what.

If we substitute this back into the equation $R = L \cdot M$, we arrive at the Cobb-Douglas model:

\begin{equation*} R = A (\textrm{labor hours})^{\alpha} \cdot (\textrm{marketing spend})^{\beta} \end{equation*}

In the Cobb-Douglas model, the term $A$ represents Total Factor Productivity.

Note: Normally, the use of the Cobb-Douglas model is somewhat problematic due to the Cambridge Controversy which points out the difficulties in assigning a single value to capital. However in this case capital is literally dollars which can be spent on marketing, so we can avoid the issue.

What should the optimal firm do?

Let us now suppose that both Bangalore Pvt. Ltd. and Cupertino Inc. have a fixed amount of capital $C$ available for spending in the current period. These firms can convert capital into labor at the rates:

Bangalore Pvt. Ltd.: 1 unit of capital converts to 1 unit of labor.
Cupertino Inc.: 5 units of capital convert to 1 unit of labor.

Now let $m$ represent the fraction of capital spent on marketing. Then we can rewrite our output (in Bangalore) as:

\begin{equation*} R = A \left(C(1-m)\right)^{\alpha} (Cm)^{\beta} = A C^{\alpha+\beta} (1-m)^{\alpha} m^{\beta} \end{equation*}

Whereas in Cupertino our output is:

\begin{equation*} R = A \left(C(1-m)/5\right)^{\alpha} (Cm)^{\beta} = \frac{ A C^{\alpha+\beta}}{5^{\alpha}} (1-m)^{\alpha} m^{\beta} \end{equation*}

Note that these outputs differ solely due to the presence of the $5^{-\alpha}$ sitting in front. The dependence on $m$ is unchanged. We can maximize this with simple calculus:

\begin{equation*} \frac{dR}{dm} = C \left[ -\alpha(1-m)^{\alpha-1} m^{\beta} + \beta (1-m)^{\alpha} m^{\beta-1}\right] = 0 \end{equation*}

Solving this for $m$ yields $m = \beta / (\alpha + \beta)$.

In pictures, the following is what is happening:

As can be seen from the graph, the production function for both firms is the same, as is the capital allocation that maximizes production. All that differs is the level of production.

It's important to recognize what this means in business terms: the sole difference between Cupertino and Bangalore is that Bangalore has a higher total factor productivity. In terms of capital allocation, both firms should behave in the same way.

Secondly, this means that revenue in Bangalore Pvt. Ltd. will be higher than at Cupertino Inc. by a factor of $5^{\alpha}$.

The third conclusion is that revenue per employee will be significantly lower at Bangalore Pvt. Ltd. Bangalore Pvt. Ltd. is devoting the same amount of capital to labor as Cupertino Inc., but it has 5x lower cost per employee. As a result, it will have 5x as many employees as Cupertino Inc. It's revenue is higher by a factor of $5^{\alpha}$, but the number of employees is higher by a factor of $5^1$. As a result, revenue per employee is lower by a factor of $5^{\alpha-1}$ (recall that $0 < \alpha < 1$).

For example, assuming $\alpha = 0.5$ (as it appeared to be in the synthetic examples I concocted above), this means Bangalore Pvt. Ltd. will have $\sqrt{5} = 2.236$ as much revenue as Cupertino Inc., but the revenue per employee will be only 0.447 as large as that of Cupertino Inc.

What does this mean for the business?

It's often a bit difficult to translate abstract economic results into practical business advice. In this case, what the economic result implies is the following.

Because $m$ is the same for both Bangalore Pvt. Ltd. and Cupertino Inc., both firms should spend approximately the same fraction of their capital on labor. This will result in Bangalore Pvt. Ltd. consuming more labor (i.e. having more employees, and having more labor hours), and moving further along the diminishing returns curve.

For example, if these competing firms are in the adtech business, then integrating with more ad networks might be a valuable way to increase their customer value. In this case, while Cupertino Inc. might integrate only with Adwords, Facebook and AppNexus, Bangalore Pvt. Ltd. might integrate with those networks as well as YouTube, Pornhub and other more niche sites. If these firms are in the business of selling an ecommerce widget, then Bangalore Pvt. Ltd. might provide a larger number of specialized themes than Cupertino Inc. In most software businesses there is value to be generated by repeating the same process for more data providers, more platforms, etc. Generally speaking, an Indian firm should make their product significantly broader than any corresponding western firm.

Similarly, on the marketing side, one might expect Bangalore Pvt. Ltd. to create a broader advertising surface. This might involve creating a larger number of landing pages, which would target smaller niches of customers. Similarly, one would expect more organic marketing as a fraction of total marketing.

At the micro level, the fundamental calculus is the following. For Cupertino Inc. to take on a project requiring 1 man-year of labor, the project must generate $100k in revenue to break even. In contrast, Bangalore Pvt. Ltd. can take on any project generating $20k in revenue or more. As a result, Bangalore Pvt. Ltd. should take on all the same projects as Cupertino Inc., in addition to projects generating between $20k-100k revenue.

Projects in this revenue range form a natural core competency for the Indian firm; simple economics forms a moat that few western firms can cross.

So in terms of practical business advice, the takeaway (for Indian firms) is the following: hire more people, and have them work on more marginal projects. It will lower your revenue/employee, but it will increase profits and help you capture business that western competitors are economically incapable of capturing.

Inferring data loss (and correcting for it) from fundamental relationships

2017-09-01T09:45:00+02:00Z

(If mathematical formulas are not showing up here, please read this on the web..)

I recently analyzed a somewhat puzzling data set. I was sending HTTP POST requests to a system. The system's would then acknowledge receipt of these requests (returning a 200 status code), and some time later (it was a slow asynchronous process) send a web hook to a specified URL if the request was successful. However, successful was far from certain; most requests actually failed. My job was to measure the success rate.

Concretely, event A would trigger at some time t0. If A was successful, then event B might occur at time t1. B can only occur if A occurred.

Similar systems like this happen in a variety of contexts:

Ad delivery. The ad must first be displayed (event A), and only after it's displayed can the viewer click a link (event B).
Email. The email must first be opened (event A), and only after it's opened can the reader click a link (event B).
Web forms. A user must first enter their credit card, and only after that can they click submit.

What I wanted to compute was alpha = P(A) and beta = P(B | A).

When analyzing the data I had, I noticed a curious pattern.

request ID| time of A | time of B
----------+-----------+----------
abc       | 12:00     | 1:00
def       | 12:01     | null
ghi       | null      | null
jkl       | null      | 1:03  <--- WTF is this?

That last row (for request ID jkl) indicates something really weird happening. It suggests that event B has occurred even though event A has not!

According to my model, which I have a high degree of confidence in, this isn't possible. Yet it's in the data; the responding system could not post a web hook with ID jkl if they hadn't received the request; they couldn't possibly know this ID.

The conclusion I drew is that our measurements of A and B are unreliable. A and B may actually occur without being observed. So the real problem at hand is to infer the true rates at which A and B occur from the complete data set.

Some simplified calculations

I'll begin with some simple calculations - using nothing but arithmetic - to give the flavor of this analysis. To make things concrete, suppose we have the following data set:

Suppose we have the following counts:

100k requests were made
In 40k cases, event A was reported and B did was not reported
In 10k cases, event A was reported and then B was reported
In 5k cases, event B was reported but A was never reported

The most naive possible approach is to simply treat the cases where B occurred to be bad data and discard them. Then we can estimate:

alpha = 50k / 95k = 0.526
beta = 10k / 50k = 0.200

But we can do better than this. We can use logical inference to deduce that in every case where B occurred, A also occurred. So we actually know that A occurred 55k times, and A then B occurred 15k times. So we can then estimate:

alpha = 55k / 100k = 0.550
beta = 15k / 55k = 0.273

Finally, there's a third approach we can take. Lets define the parameters gamma_A = P(A reported | A occurred) and gamma_B = P(B reported | B occurred). Lets assume that gamma_A = gamma_B = gamma; this is reasonable in the event that events A and B are measured by the same mechanism (e.g., a tracking pixel).

Then we can infer, based on the fact that B occurred at least 500 times without A being reported, that approximately 10% (5k A occurrences without reports/ 50k A reports) of the time, data is lost. This suggests gamma ~= 0.9.

We can then estimate that there were 50k / 0.9 = 55.56k occurrences of A and 15k / 0.9 = 16.67k occurrences of B, yielding:

alpha = 55.56k / 100k = 0.556
beta = 16.67k / 55.56k = 0.300

Small errors result in big differences!

Based on the data we have, we've guesstimated that approximately 10% of the events which occur are not reported. However, this effect cascades and results in an overall success rate of alpha * beta being reported as 10.5% (= 1,000 / 9,500) rather than 16.7% (= 1667 / 10,000). That's a huge difference!

Statistical method

These calculations are all great, but we also need to deal with uncertainty. It's possible that actually gamma=0.95 but we simply got unlucky, or gamma=0.85 and we got very lucky. How can we quantify this?

This can be done relatively straightforwardly with pymc3.

import pylab as pl
import pymc3 as pymc
import numpy

N = 100000
ao = 40000
bo_and_ao = 10000
bo_no_ao = 5000

model = pymc.Model()

with model:
    alpha = pymc.Uniform('alpha', lower=0, upper=1)
    beta  = pymc.Uniform('beta', lower=0, upper=1)
    gamma = pymc.Uniform('gamma', lower=0, upper=1)

    a_occurred = pymc.Binomial('a_occurred', n=N, p=alpha)
    a_observed = pymc.Binomial('a_observed', n=a_occurred, p=gamma, observed=ao+bo_and_ao)

    b_occurred = pymc.Binomial('b_occurred', n=a_occurred, p=beta)
    b_observed = pymc.Binomial('b_observed', n=b_occurred, p=gamma, observed=bo_and_ao+bo_no_ao)

The results can then be plotted:

As is expected, we have sharp lower bounds; the true number of events could not be lower than our observed number of events.

These numbers are in rough accordance with our heuristic calculations above.

Extracting fundamental parameters

In the above data, we've done two important things.

First, we've built a nowcast of our underlying data. That is to say, while the number of times events A and B occur is nominally directly observable (albeit noisily), the actual number of times are not. So we can construct better estimates (as well as credible intervals) of the event occurrent counts.

Second, we've built a direct probabilistic way of computing the fundamental parameters of the problem, namely alpha and beta. In our pymc code, just as we can plot a histogram of a_occurrences (via pl.hist(trace['a_occurred'][::20], bins=50)), we can similarly plot a histogram of alpha itself. In many instances - e.g. A/B testing or bandit algorithms - the underlying probabilities are the parameter of direct interest. The actual counts are only incidental.

The conclusion here is that missing data is not a fundamentally limiting factor in running many analysis. Provided you have a more complete generative model of data collection, and adequate data to fit the model, you can actually correct for missing data when running such analysis.

Chris Stucchio

Boosting as a scheme for transfer learning

Alternate setup: Trunk and Branch Models

My proposed method: boosting across datasets

Concrete details

How I'll test this method

Dataset drift

Computing "regret"

Results

Relevance of the censored features

Conclusion

Related

Calibrating a classifier when the base rate changes

Typical characteristics of the problem

Precision, Risk Thresholds and Loss Functions

The effect of label shift on calibration

What happens when the distribution changes?

Formal statement of the setup

My method

Kernel Density Estimation on [0,1]

Computing a posterior and max-aposteriori estimation

Examples

Estimating Bayes loss

Other metrics

Comparison to more standard label shift methods

Future work

Active learning

Shareholder Short-Termism Theory has Died of COVID-19

Formalizing the theory

Short-Termism in this framework

Modeling an uncertain future

Long term investors appear short term

Coronavirus as a natural experiment

These results are completely inconsistent with short-termism

All about liquidity injections

The Short-Termism theory has died of COVID

Scalably Detecting Odd-looking Histograms

Mathematically modeling a single test

Controlling false positives

Controlling true positives

Distinguishing the null and alternative hypothesis

Example numbers and slow convergence

Conclusion

Isotonic: A Python package for doing fancier versions of isotonic regression

Isotonic regression for binary data

Code is available

Appendix: Mathematical Details

Parameterizing the isotonic curves

Real valued curves

Probabilistic curves

Different parameterizations

Objective functions

Real valued data

Binomial data

Putting it all together

Choosing the nodes

Estimating the curve

Cost Matters: Why Lambda School should have a lower success rate than college

The basic math of a single gamble

Concrete examples: Lambda School and Rutgers University

The basic math of a sorting rule

Lowering the cost of failure allows one to accept a lower success rate

Intuitive example: Wikipedia, Lambda School and College

You can never raise your conversion rate

Why Indian software developers are lower quality than Western ones

Conclusion

Notes on setting up a Data Science app on Azure

Structure of the strategy

Setting up Azure Functions

Deploying it

Handling Secrets

Setting up keyvault and a client

Giving the application access to keyvault credentials

Programmatic access

Other problems I ran into

Handling pickled models - No module named trading_library.models.alpha_model'

Resetting API keys

Backtest your SQL queries - they are models too

What's a "model"?

How does backtesting work?

Handling pickled models - `No module named trading_library.models.alpha_model'`

Cool python trick #1: `pandas.read_html`

Cool python trick #2: the `wikipedia` module