From Epidemium
Jump to: navigation, search

Welcome on the OpenCancer wiki !

From June 6 2017 to December 5 2017, La Paillasse, the JADE club and Roche made possible EPIDEMIUM, a massive data challenge oriented and community-based open science program. Please find our poster via the url https://drive.google.com/file/d/1-sbHmIu​Zsnu6PAvy6ZriQU0GjEE​ABXQI/view?usp=shari​ng.

This project follows up on the 2016 “Predictives Approaches to Cancer Risk”(http://wiki.epidemium.cc/wiki/Approches_prédictives_et_risque_de_cancer) groundwork, which received the Health Impact Prize from the Epidemium committees.

Here we aim at building a predictive tool helping health authorities to take public policy decisions in terms of cancer prevention. This tool enables everyone to observe the impact of changes in the values of relevant features -in a sidebar panel- on any cancer incidence rate, country by country. These results will be provided through a GUI, taking the shape of an interactive world map. All the Angular, C++ and R codes we have developped and will continue to developp throughout the project are available via the Github url https://github.com/EpidemiumOpenCancer/OpenCancer.


Contributors and contribution fields

Here is the OpenCancer team:

  • Benjamin Schannes (coordinated OpenCancer) : Machine Learning, Statistics
  • Lino Galiana : Machine Learning, High Performance Computing
  • Martin Micalef : Machine Learning, Economic Science
  • Raphaële Adjerad : Machine Learning, Statistics
  • Benoît Rolland : Programming, Web Development

We received valuable guidance from other experts we would like to thank, in particular Jordan Scheer, public health doctor, for his medical insight throughout the project. We also thank Sukran Dal, geography and statistics student, for her constant interest and research in the project. We also thank all our followers on the platform. We also wish to thank La Paillasse for having hosted us and the epidemium team members for their constant support, in particular Mehdi, Olivier, Gaspard, Ozanne and Alicia. We thank in advance the jury for taking the time to read the following study.

Clinical and Epidemiological insights about Risk Factors

According to weekly newspaper “The Economist”, Cancer claimed the lives of 8.8 million people in 2015. Only heart disease caused more deaths. Furthermore, around 40% of Americans will be told they have cancer during their lifetimes. It’s huge and alarming.

It is usually not possible to know exactly why one person develops cancer and another doesn’t. But research has shown that certain risk factors may increase a person’s chances of developing cancer. There are also factors that are linked to a lower risk of cancer. These are sometimes called protective risk factors, or just protective factors. Most cancers are related to environmental, lifestyle, or behavioral exposures. Environmental means everything outside the body that interacts with human body. In this sense, the environment is not limited to the biophysical environment (e.g. exposure to factors such as air pollution or sunlight, encountered outdoors or indoors, at home or in the workplace), but also includes lifestyle, economic and behavioral factors. Common environmental factors that contribute to cancer death include tobacco (accounting for 25–30% of deaths), obesity (30–35%), infections (15–20%), radiation (both ionizing and non-ionizing, up to 10%), lack of physical activity, and environmental pollutants.

Assessing the carcinogenicity of a substance in our environment is a delicate task. First, defining the environment is very complex. Then cancer results from successive or simultaneous exposures to several factors and it can take many years between exposure and the onset of the disease. Finally it is still difficult to estimate the risk of cancers associated with low but chronic exposure levels to these substances. The fact that cancer may occur many years after exposure justifies the interest of a prospective open data study.

We decided run our predictive models with the colorectal cancer (CRC) as a case study. Indeed, it’s currently the third most malignancy and the fourth leading cause of cancer-related deaths in the world and its burden is expected to increase by 60% to more than 2.2 million new cases and 1.1 million cancer deaths by 2030 (Arnold et al. , 2016.[1]).

First, CRC is considered as one of the clearest markers of the cancer transition. Actually the CRC incidence rate is still rising sharply in many low-income and middle-income countries, such as Estearn Europe, Latin America and Asia countries, and tend to stabilize or decrease in highly developed countries where rate remains among the highest in the world. Risk factors associated with western lifestyle such as obesity and tobacco consumption, unhealthy diet and alcohol consumption (Favoriti et al., 2016 [2]) are of primary importance when we try to explain the tendency of increasing incidence rates. Concerning the diet, several epidemiological studies have shown that eating fruits and vegetables reduces the risk of developing colorectal cancer. Indeed, these foods are rich in fiber. A diet rich in fiber can reduce the risk of CRC by lowering the pH of the sigmoid colon, reducing the colon transit time, and increasing the uptake of folic acid and micronutrients (such as antioxidants) in vegetables (Kritchevsky, 1995[3]). In contrary, the consumption of red meat and cholesterol-rich foods are associated with an increased risk of CRC (Chao et al., 2005 [4]).

The risk of developing colorectal cancer also increases with age. Around 90% of diagnosed patient are over 50 years and the average age is 64 (Amersi et al., 2015[5]). The incidence rate of CRC increases with age and increases significantly among the age group "40-50". (Boyle and Langman, 2000 [6]). Moreover, men are more diagnosed than women and black people are more diagnosed than white people.

Numerous studies aimed at finding risk factors for colorectal cancer. We can mention diabetes (Hu et al., 1999[7]), long-term inflammatory bowel disease (Jess et al., 2006 [8]), lack of physical activity, use of aspirin and other nonsteroidal anti-inflammatory drugs (Chan et al., 2005[9]). A 2010 explorative study (Won Jin Lee et al.[10]) suggests an association between the incidence of colorectal cancer and use of certain pesticides, in particularly chlorpyrifos and aldicarb. The influence of pollution has also been explored in the literature (El-Tawil AM, 2010[11]): it would increase colorectal cancer risk through chemical contamination of food and drink -chlorination of drinking water-. We also outline that 20 % of CRC cases are hereditary (Butterworth et al., 2006[12]) because of polyps (Dove-Edwin et al., 2005[13]).

Causality assessment

Knowing that cancer incidence is correlated with tobacco consumption is not enough. We need to ascertain that tobacco consumption increases the rate of lung cancer incidence, ceteris paribus. In the medical field, the causality question is therefore critical. Unfortunately, unlike  linear correlation, there is not a universal measure of causality. However there are well-known established practices and more or less relevant approximations. Formally, a probabilist would say that what we need to do is to evaluate the difference between two conditional expectations: the conditional expectation of incidence given a sequence of events including the one for which we want to ascertain the causality -e.g. a pattern related to tobacco consumption- and the conditional expectation of incidence given the same sequence of events, but excluding the one for which we want to ascertain the causality. Main statistical and machine learning methods for causal effects are built upon this conditional expectation framework.

RCT in Epidemiology

There is a vast and robust econometric theory about treatment effect, e.g. with the average treatment effect (ATE) which is a measure used to compare treatments (or interventions) in randomized experiments, evaluation of policy interventions, and medical trials. The ATE measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control. Randomized controll trials (RCT) can be seen as an epidemiological counterpart of this theory.

Epidemiological studies enable the causality assessment through randomized controll trials (RCT), which aims to reduce bias when testing a new treatment.The people participating in the trial are randomly allocated to either the group receiving the treatment under investigation or to a group receiving standard treatment (or placebo treatment) as the control. Randomization minimises selection bias and the different comparison groups allow the researchers to determine any effects of the treatment when compared with the no treatment (control) group, while other variables are kept constant. The RCT is often considered the gold standard for a clinical trial. RCTs are often used to test the efficacy or effectiveness of various types of medical intervention and may provide information about adverse effects, such as drug reactions.

Random allocation in real trials is complex, but conceptually the process is like tossing a coin. After randomization, the two (or more) groups of subjects are followed in exactly the same way and the only differences between them is the care they receive. For example, in terms of procedures, tests, outpatient visits, and follow-up calls, should be those intrinsic to the treatments being compared. The most important advantage of proper randomization is that it minimizes allocation bias, balancing both known and unknown prognostic factors, in the assignment of treatments. The major categories of RCT study designs are:

  • Parallel-group – each participant is randomly assigned to a group, and all the participants in the group receive (or do not receive) an intervention.
  • Crossover – over time, each participant receives (or does not receive) an intervention in a random sequence.
  • Cluster – pre-existing groups of participants (e.g., villages, schools) are randomly selected to receive (or not receive) an intervention.
  • Factorial – each participant is randomly assigned to a group that receives a particular combination of interventions or non-interventions (e.g., group 1 receives vitamin X and vitamin Y, group 2 receives vitamin X and placebo Y, group 3 receives placebo X and vitamin Y, and group 4 receives placebo X and placebo Y).

An analysis of the 616 RCTs indexed in PubMed during December 2006 found that 78% were parallel-group trials, 16% were crossover, 2% were split-body, 2% were cluster, and 2% were factorial.

The advantages of proper randomization in RCTs include:[14]

  • "It eliminates bias in treatment assignment," specifically selection bias and confounding.
  • "It facilitates blinding (masking) of the identity of treatments from investigators, participants, and assessors."
  • "It permits the use of probability theory to express the likelihood that any difference in outcome between treatment groups merely indicates chance."

An ideal randomization procedure would achieve the following goals:

  • Maximize statistical power, especially in subgroup analyses. Generally, equal group sizes maximize statistical power, however, unequal groups sizes maybe more powerful for some analyses (e.g., multiple comparisons of placebo versus several doses using Dunnett’s procedure[15] ), and are sometimes desired for non-analytic reasons (e.g., patients maybe more motivated to enroll if there is a higher chance of getting the test treatment, or regulatory agencies may require a minimum number of patients exposed to treatment).
  • Minimize selection bias. This may occur if investigators can consciously or unconsciously preferentially enroll patients between treatment arms. A good randomization procedure will be unpredictable so that investigators cannot guess the next subject's group assignment based on prior treatment assignments. The risk of selection bias is highest when previous treatment assignments are known (as in unblinded studies) or can be guessed (perhaps if a drug has distinctive side effects).
  • Minimize allocation bias (or confounding). This may occur when covariates that affect the outcome are not equally distributed between treatment groups, and the treatment effect is confounded with the effect of the covariates (i.e., an "accidental bias"). If the randomization procedure causes an imbalance in covariates related to the outcome across groups, estimates of effect may be biased if not adjusted for the covariates (which may be unmeasured and therefore impossible to adjust for).

The number of treatment units (subjects or groups of subjects) assigned to control and treatment groups, affects an RCT's reliability. If the effect of the treatment is small, the number of treatment units in either group may be insufficient for rejecting the null hypothesis in the respective statistical test. The failure to reject the null hypothesis would imply that the treatment shows no statistically significant effect on the treated in a given test. But as the sample size increases, the same RCT may be able to demonstrate a significant effect of the treatment, even if this effect is small.

By construction, such approaches are likely to exhibit causality patterns.

Granger Tests in Time Series

As we will see in the next sections, our learning dataset is made of panel data. For each 5-uple (cancer type,year,country or zone,gender,age group), we have a time series of cancer incidence rates and regressors between 1970 and 2016. Hence, it seems appropriate to test the assumption of Granger causality.

The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969.[16] Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of "true causality" is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only "predictive causality"".

A time series X is said to Granger-cause Y if it can be shown, usually through a series of t-tests and F-tests on lagged values of X (and with lagged values of Y also included), that those X values provide statistically significant information about future values of Y. We say that a variable X that evolves over time Granger-causes another evolving variable Y if predictions of the value of Y based on its own past values and on the past values of X are better than predictions of Y based only on its own past values.

Granger defined the causality relationship based on two principles:

  1. The cause happens prior to its effect.
  2. The cause has unique information about the future values of its effect.

Given these two assumptions about causality, Granger proposed to test the following hypothesis for identification of a causal effect of <math>X</math> on <math>Y</math>:


\mathbb{P}[Y(t+1) \in A\mid \mathcal{I}(t)] \neq \mathbb{P}[Y(t+1) \in A\mid \mathcal{I}_{-X}(t)], </math> where <math>\mathbb{P}</math> refers to probability, <math>A</math> is an arbitrary non-empty set, and <math>\mathcal{I}(t)</math> and <math>\mathcal{I}_{-X}(t)</math> respectively denote the information available as of time <math>t</math> in the entire universe, and that in the modified universe in which <math>X</math> is excluded. If the above hypothesis is accepted, we say that <math>X</math> Granger-causes <math>Y</math>.

If a time series is a stationary process, the test is performed using the level values of two (or more) variables. If the variables are non-stationary, then the test is done using first (or higher) differences. The number of lags to be included is usually chosen using an information criterion, such as the Akaike information criterion or the Schwarz information criterion. Any particular lagged value of one of the variables is retained in the regression if (1) it is significant according to a t-test, and (2) it and the other lagged values of the variable jointly add explanatory power to the model according to an F-test. Then the null hypothesis of no Granger causality is not rejected if and only if no lagged values of an explanatory variable have been retained in the regression.

Multivariate Granger causality analysis is usually performed by fitting a vector autoregressive model (VAR) to the time series. In particular, let <math>X(t) \in \mathbb{R}^{d\times 1}</math> for <math>t=1, \ldots, T</math> be a <math>d</math>-dimensional multivariate time series. Granger causality is performed by fitting a VAR model with <math>L</math> time lags as follows:


X(t) = \sum_{\tau=1}^L A_{\tau}X(t-\tau) + \varepsilon(t), </math> where <math>\varepsilon(t)</math> is a white Gaussian random vector, and <math>A_\tau</math> is a matrix for every <math>\tau</math>. A time series <math>X_i</math> is called a Granger cause of another time series <math>X_j</math>, if at least one of the elements <math>A_{\tau}(j, i)</math> for <math>\tau = 1, \ldots, L</math> is significantly larger than zero (in absolute value).

The above linear methods are appropriate for testing Granger causality in the mean. However they are not able to detect Granger causality in higher moments, e.g., in the variance. Non-parametric tests for Granger causality are designed to address this problem. The definition of Granger causality in these test is general and does not involve any modelling assumptions, such as a linear autoregressive model. The non-parametric tests for Granger causality can be used as diagnostic tools to build better parametric models including higher order moments and/or non-linearity.

As its name implies, Granger causality is not necessarily true causality. In fact, the Granger-causality tests fulfill only the Humean definition of causality that identifies the cause-effect relations with constant conjunctions. If both X and Y are driven by a common third process with different lags, one might still fail to reject the alternative hypothesis of Granger causality. Yet, manipulation of one of the variables would not change the other. Indeed, the Granger-causality tests are designed to handle pairs of variables, and may produce misleading results when the true relationship involves three or more variables. Other possible sources of misguiding test results are: (1) not frequent enough or too frequent sampling, (2) nonlinear causal relationship, (3) time series nonstationarity and nonlinearity and (4) existence of rational expectations.

Data sources and merging challenge

We have considered data originating from 4 sources:

  • WHO (for cancer incidence, i.e. the variable we need to explain and predict)
  • World Bank (872 features)
  • Food and Agriculture Organization -FAO- of the United Nations (4623 features)
  • International Labour Organization -ILO- (1011 features)

The WHO dataset provides incidence rates for 250 cancer types between 1958 and 2012, and for 17 age groups split by gender. Each row corresponds to a 5-uple (cancer type,year,country or zone,gender,age group).

For each features dataset -WB, FAO & ILO-, each row corresponds to a 2-uple (year,country or zone) and there is not a single row having complete obserations. The FAO dataset provides information from 1961 until 2014, whereas the World Bank and ILO data repositories start from 1970 and end in 2016 for the first one and 2017 for the second one. Overall, we have thus 56 years available -providing only partial information.

Zones and countries are different from one dataset to another. Before merging the different sources, we had to build -through a geographical study- a key to match the different names in a convenient manner. Another difficulty arises for countries whose borders or status have evolved during the period, e.g. Serbia. That is why we have painstakingly built a nomenclature for countries and zones handling.

An idle and ideal scenario would be to have access to 56 years of observations for each country. Unfortunately there are countries for which some of these years are not provided in any dataset.

Moreover when the year is provided, it does not mean that we have a valuable information. As a matter of fact, it happens that many features exhibit a missing data pattern (Not Available or ‘NA’) on the same row.

The starting point was to merge these 4 datasets through left join SQL protocols so as to have the same structure, i.e. same years and same features per 4-uple (cancer type, country, gender, age group). By construction features do not change over years by cancer type, gender or age group. The only sources of variations in the WB, ILO and FAO datasets are time -i.e. year- and space -i.e. country or zone-. The final merged raw dataset is ordered by cancer type, age group, gender, country and year.

Raw data are far from being immediately usable to learn Machine Learning models. The cleaning requirement is of primary importance and fine-tuned methods to tackle this main issue are detailed below.

From raw data to consistent data

Below we report the different steps we have encoded to yield a consistent learning dataset.

Encoding issues : type recognition, type conversion and coercion

After reading the dataset, the first thing to do is to ensure that variables types have been safely recognized. For some variables we had to apply conversion operations or more simply type coercion. A challenge was to automate these conformity checks and transforms when needed, thus avoiding any manual handling.

False zeros detection and correction

Each dataset exhibits a vast proportion of missing (NA) data. Furthermore each numeric variable within each dataset may have a number of observations recorded as 0 instead of NA. If we do nothing we will underestimate the proportion of missing data and falsely impute 0 to potentially non 0 data. Thus we have to convert these 0 recordings into NA recordings. Another problem occurs if a true zero is replaced by NA. We have to design criteria enabling us to automatically replace as many false zeros as possible by NA while preserving at the same time as many real zeros as possible.

Deductive corrections and deterministic imputations

An obvious inconsistency occurs when a record contains a value or combination of values that cannot correspond to a real-world situation. For example, a man cannot be pregnant and an average heigth can not be negative. Such knowledge can be expressed as rules or constraints. In data editing literature these rules are referred to as edit rules or edits, in short. The fact we handle aggregated variables mitigates this problem we frequently have to assess with individual data. However, as the number of variables increases, the number of rules may increase rapidly. With the datasets under consideration,this is definitely a challenge we are facing. Moreover, since multivariate rules may be interconnected by common variables, deciding which variable or variables in a record cause an inconsistency may not be straightforward. Since rules may pertain to multiple variables, and variables may occur in several rules, there is indeed a dependency between rules and variables. If we have no other information available but the edit violations, it makes sense to minimize the number of fields being altered. This principle, commonly referred to as the principle of Fellegi and Holt, is based on the idea that errors occur relatively few times and when they do, they occur randomly across variables. Over the years several algorithms have been developed to solve this minimization problem.

Correction methods aim to fix inconsistent observations by altering invalid values in a record based on information from valid values. Depending on the method this is either a single-step procedure or a two-step procedure where first, an error localization method is used to empty certain fields, followed by an imputation step. In some cases, the cause of errors in data can be determined with enough certainty so that the solution is almost automatically known.

Since the data we are analyzing is generated by people rather than machines or measurement devices, certain typical human-generated errors are likely to occur. Given that data has to obey certain edit rules, the occurrence of such errors can sometimes be detected from raw data with (almost) certainty. Examples of errors that can be detected are typing errors in numbers (under linear restrictions) rounding errors in numbers and sign errors or variable swaps.

In some cases a missing value can be determined because the observed values combined with their constraints force a unique solution. For instance, the World Bank dataset provides variables expressed as ratios of other available variables over the GNI. If you know the GNI and the variable in the numerator, it is straightforward to determine the ratio and replace a corresponding potential NA recording by the real value or correct a potential error if the registered value is different from the recomputed value. If the latter case, the error can come from any of the three interconnected variables and it may be cumbersome to correct the error without adopting some handling conventions. These interconnections also prevent us from making naive imputations in case of missing data for each of the interconnected features. Indeed, imputations should respect relations between variables. Straightforward application of stochastic methods, such as the Gibbs sampler, are thus likely to create impossible combinations or destroy deterministic relations in the data. This will be discussed later. Strategies such as passive imputation can help us under these circumstances.

These approaches have enabled us to slightly reduce the number of missing data.

Redundancy controls

These controls aim at avoiding repeating patterns in the features. If a variable appears different times under different names, one will keep the column having the lowest proportion of missing data. If a numeric feature is replicated over columns corresponding to different units, one will keep in practice a single unit -the one with the lowest proportion of missing data- and drop the other columns. However it is not always so easy to decide whether a variable should be excluded for redundancy reasons. For example, in the World Bank dataset, most of the variables are presented under different perspectives and the corresponding columns are not collinear:

  • current US$
  • constant 2010 US$
  • annual % growth
  • current LCU
  • constant LCU
  •  % of GNI

Strictly speaking, we can not say that we have simply versions of the same variable expressed in different units. Current level, ratio over a macro aggregate or growth rate are different by nature. Keeping both current cost accounting and price level -inflation- accounting features is also a tricky question. Because the second one is designed to provide units of constant purchasing power and correct inflation effects, it may be relevant to keep only this variable if we decide to drop one of the two metrics. Missing data proportions shall also provide guidance on how to tackle this issue. In practice, if the current US$ column has significantly less missing values than the constant 2010 US$ -for an arbitrary variable having both metrics registered in the dataset- it may seem reasonable to keep the current US$ column and drop the constant one. We can thus decide to drop the constant value metrics and keep the current one if the difference between missing data ratios exceeds a threshold which can be parameterized by the user. Otherwise the current cost accounting metrics is dropped and the inflation accounting one is kept by default.

Feature engineering

Feature engineering is the process of creating new features -that better represent the underlying problem- from the existing ones to improve model's consistency and performance.

External data

Most machine learning problems can benefit from bringing in new external data likely to shed light on new patterns. In our case, we have added open data originating from the World Health Organization and open meteo data.

Boolean-valued transforms : NA and outlier patterns

The simplest type of feature engineering involves using indicator variables to isolate key information.

Handling Missing data

This part is much more a data cleaning than a feature engineering step. However it results in appending new features to the initial set. The process which has been designed consists of attaching an indicator column vector to each feature -a column as well- in the dataset, such that whenever a feature's row has a missing value the corresponding indicator's row takes the value 1 and 0 otherwise. This is a one-hot encoding strategy applied to the missing/non-missing data categorical features arising from the original variables. We remark that such operations double the number of columns in the resulting dataset.

Handling Outliers

Outliers do not equal errors. They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision. It may be done through feature binarization which is the process of thresholding numerical features to get boolean values. Formally we attach an outlier indicator column to each feature: this indicator will take the value 1 at each row for which the observed value is below a threshold and above a ceiling. Threshold and ceiling are specified feature by feature and determine the set of values -those lower than the threshold and greater than the ceiling- which will be considered as outliers.

Standard methods flag observations based on measures such as the interquartile range. For example, if Q1 and Q3 are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range [Q1 -k(Q3-Q1),Q3+k(Q3-Q1)] for some nonnegative constant k. John Tukey proposed to take k=1.5 to qualify outliers. This is why we have written a function returning outliers and making the resulting feature binarization in accordance with the parameter k fixed by the user.

Indicator variables for "carcinogenic events"

We also include indicator variables to take into account main "carcinogenic events". A standard example is industrial disasters, in particular energy disasters. To determine the exposure zone and the period during which potential carcinogenic substances are active after an accident is a tricky question. For example, following the steam explosion at the Chernobyl nuclear power plant -April 26, 1986- health authorities estimated several hundred thousand additional cancer deaths over time. Fallout could be detected as far away as Canada. To be as rigourous as possible, one should take into account the exact volume of emissions and fallout over different countries and over time. Due to data limitations, we have made simplistic assumptions. For Tchernobyl, the indicator variable we have created takes the value 1 from 1986 to 1990 in Ukrain, Russia and other neighboring countries.

Interaction Features

The next type of feature engineering involves highlighting interactions between two or more features. Some features can be combined to provide more information than they would as individuals. In practice it amounts to taking the sum, difference, product or quotient of multiple variables. Using an automated loop to create interactions for all features leads to "feature explosion" and is not a relevant computational strategy. Based on reasoning, one has to choose the interactions we want to include in the dataset.

Feature Representation

Sometimes we could gain information by representing the same feature in a different way. For example, considering open meteo data providing the average country temperature every day for each year and each country we can aggregate observations to create features such as the average annual temperature by country, or the number of days for which the average temperature is below or above a threshold by country. Such features can enable us to assess a potential link between heat wave and different cancers.

Another aggregation strategy likely to increase model peformance is to attach to each categorical feature a new feature counting how many observations fall into each category. For a row where a categorical variable is "A", its counting counterpart displays the number of cases with category "A" over all observations.

Data Normalization


Feature scaling is a method used to standardize the range of features. Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Indeed, since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient descent (often used in logistic regression, SVMs, perceptrons, neural networks ...) converges much faster with feature scaling than without it: with features being on different scales, certain weights may update faster than others since the feature values play a role in the weight updates.



Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This is typically done by calculating standard scores or Z-scores.

x^{\prime}= \dfrac{x-\overline(x)}{\sigma}

where x is the original feature vector, {\bar {x}} is the mean of that feature vector, and σ {\displaystyle \sigma } \sigma is its standard deviation.

Min-Max rescaling

An alternative approach to Z-score normalization is the so-called Min-Max scaling (often also simply called “normalization” - a common cause for ambiguities). In this approach, the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula is given as:

x^{\prime}= \dfrac{x-\min(x)}{\max(x) - \min(x)}

where x is the original value, x^{\prime} is the normalized value.

NA driven space and time subsampling

We enable the user to play with two parameters: a zone (a subset of countries e.g.) and a threshold of missing data that triggers a feature's extraction if the proportion of missing data pertaining to this variable is below the threshold. These parameters feed a general function -we will refer to as the subsampling function for convenience- which extracts the desired subdataset. In practice the output will be a smaller dataset containing:

  1. only the rows for which the area variable is in the zone set by the user
  2. only the columns whose missing data proportion over these rows is below the threshold
  3. if the optional argument is activated: in the subset arising from these subsampling operations we finally keep only the rows whose missing data proportion over the remaining columns is below a threshold -we will refer to as a time threshold- whose value can differ from the threshold used for features retrieval at step 2.

Leaving rows out is activated through an optional argument by the user in the above algorithm because it may be problematic in our time series context. For each cancer under consideration and each country, we have indeed 56 rows corresponding to the 56 years observed. To avoid skipping years, we must recover every row at the end of the subsampling procedure. That is why, by default, we can only drop features within the subsampling algorithm described above.

This function will be helpful to deploy the three following space-based modelling strategies :

  • country-level: the zone we specify in the aforementioned function is a single country and we train and test models on the resulting restricted dataset -the subset corresponding to this country- whose columns satisfy the NA threshold constraint we have retained.
  • continent-level: the zone we specify in the subsampling function is a continent. Features retrieval is then processed in a manner consistent with the subset pertaining to the specified continent and the selected missing data tolerance.
  • homogeneous cluster: here the zone we specify is a pool of countries having similar patterns in terms of income per capita e.g. (or any another criterion we would like to consider as a pooling variable) and we work on the dataset arising from the subsampling function applied to this cluster. If we have built clusters forming a partition of the world, we can cover the world by repeating this procedure for each cluster.

Imputation strategies, localization and sampling size

Following Gelman there are three types of missing data:

  • MCAR: Missing Completely At Random. The events leading to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. This is the desirable scenario in case of missing data. If data are missing completely at random, then throwing out cases with missing data does not bias your inferences.
  • MAR: Missing At Random. A better name would be Missing (Conditionally) At Random. Given the observed data, the missingness mechanism does not depend on the unobserved data. The missing data mechanism can expressed solely in terms of observations that are observed. MCAR is a special case of MAR. It is very common to use the terms MAR and ignorability interchangeably. When an outcome variable is missing at random, it is acceptable to exclude the missing cases (that is, to treat them as NA’s), as long as the regression controls for all the variables that affect the probability of missingness.
  • MNAR: Missing Not At Random. Missingness is no longer “at

random” if it depends on information that has not been recorded and this information also predicts the missing values. For example, suppose example given by Gelman- that “surly” people are less likely to respond to the earnings question, surliness is predictive of earnings, and “surliness” is unobserved. Valid estimation requires that the missing-data mechanism be modeled as part of the estimation process, or else you must accept some bias in your inferences. Another case where missingness is no longer "at random" is when the probability of missingness depends on the (potentially missing) variable itself. For example, suppose that people with higher earnings are less likely to reveal them. In the extreme case this is called censoring.

Missingness at random is relatively easy to handle -simply include as regression inputs all variables that affect the probability of missingness. Unfortunately, we generally cannot be sure whether data really are missing at random, or whether the missingness depends on unobserved predictors or the missing data themselves. The fundamental difficulty is that these potential “lurking variables” are unobserved -by definition- and so we can never rule them out. In practice, we typically try to include as many predictors as possible in a model so that the “missing at random” assumption is reasonable. In our open data framework, we will make the simplistic MAR assumption.

Even when assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then then we should leave that feature or sample out. This rule of thumb can provide guidance for the subsampling algorithm described above.

If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis. However leaving out available datapoints deprives the data of some amount of information and depending on the situation you face, you may want to look for other fixes before wiping out potentially useful datapoints from your dataset. A different approach to missing data is to try to fill in the missing values, and then analyze the resulting dataset as a complete dataset. Little and Rubin (2002) provide an accessible treatment to imputation and multiple imputation methods. Imputing missing values cannot always be valid, of course. Most imputation methods depend on a missing at random (MAR) assumption.

For many reasons missing data imputation is a crucial preprocessing step. From a mathematical viewpoint one may think this data cleaning issue is of poor interest. This is ignoring a rich statistical literature on these questions. Above all, neglecting this key stage would have dire learning and modelling consequences. Without clean and robust data no interesting processing can be hoped. Clean data is a sine qua non condition. In the raw and extended dataset we obtain from merging cancer data (WHO), World Bank data, FAO data and ILO data, the number of missing cells is very important and we do not observe samples without missing features. It is fundamental to remain aware of the data limitations and to establish appropriate safeguards against potential fallacious and misleading conclusions. That is why this particular situation requires a great deal of precision and discipline and demands guidance before establishing any statistical diagnosis and before pretending to have reached some form of confidence and robustness.

A good method for handling missing data should

  • Minimize bias
  • Maximize the use of available information
  • Yield accurate estimates of uncertainty, i.e. standard errors, confidence intervals and p-values
  • accomplish all of the above without making unnecessarily restrictive assumptions about the missing data mechanism

Maximum likelihood (ML) and multiple imputation (MI) do very well at satisfying these criteria; so-called conventional methods are deficient in one or more of these goals.

To automate as strongly as possible those cleaning operations, we have developped a modular procedure enabling the user to select one -or a mix- of the different imputation strategies we describe below.

Missing-data methods that discard data

Complete case analysis

A direct approach to missing data is to exclude them. In the regression context, this usually means complete-case analysis. Two problems arise with complete-case analysis:

  1. If the units with missing values differ systematically from the completely observed cases, this could bias the complete-case analysis.
  2. If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a simple analysis.

In our context this missing-data would lead us to remove all the data !

Simple missing-data methods that retain all the data

A good imputation model should:

  • Account for the process that created the missing data.
  • Preserve the relations in the data.
  • Preserve the uncertainty about these relations.

The hope is that adherence to these principles will yield imputations that are statistically correct.

Mean imputation

Here we consider a numeric variable and replace missing values using the mean among non missing values. To be consistent with the subsampling function run by the user, i.e. zone and missing data threshold retained, we need this function to depend on those parameters. In other words, for each feature satisfying the NA threshold constraint over the rows corresponding to the set of countries under consideration, the mean which replaces the missing values after imputation is computed only over these rows. We have also included another binary parameter in the function we will refer to as a localization parameter. If this argument is true, the mean we impute is country-specific. For example, suppose the user retains a zone gathering Germany and Spain. Then for each variable, the mean which replaces the Germany missing values -for this variable- after imputation is computed over the available rows in the Germany subset, and the same approach is applied to Spain. Thus, our mean imputation algorithm is at least zone-specific and at most country-specific.

Imputing the mean preserves the mean of the observed data. So if the data are missing at random, the estimate of the mean remains unbiased. Unfortunately, this strategy can severely distort the distribution for this variable, leading to complications with summary measures including, notably, underestimates of the standard deviation. Moreover, mean imputation distorts relationships between variables by “pulling” estimates of the correlation toward zero.

Median imputation

Here again, we consider a numeric variable and replace missing values using the median among available data. Again, to be consistent with the subsampling function run by the user, i.e. zone and missing data threshold retained, we need our median imputation algorithm to be zone-specific or country-specific according to the user's needs.

Imputing the median preserves the median of the observed data. If the data are missing at random and unbiased, the estimate of the mean does not unbiased if the median differs from the mean. Once again, median imputation decreases the variable's variance and attenuates any correlations involving the variables that are imputed.

LOCF imputation

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. In our case the dataset is ordered by country and year. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed. This represents the belief that if a measurement is missing, the best guess is that it hasn't changed from the last time it was measured. This method is known to increase bias and to lead to potential false conclusions.

In our context -47 rows by country in the raw dataset- we need the LOCF algorithm to be implemented country by country -even if the user selects several countries in the zone under consideration. A problem arises if the first row -first year- is missing. Indeed, we have no data prior to it. In this case each missing value is imputed with the value from the first following record that has an observed value.

Mode imputation

As far as categorical variables are concerned, replacing categorical variables is usually not advisable. Some common practice include replacing missing categorical variables with the mode of the observed ones, what we call Concept Most Common Attribute Value Fitting, CMCF. However, it is questionable whether it is a good choice.

Indicator variables for missingness of categorical predictors

For unordered categorical predictors, a simple and often useful approach to imputation is to add an extra category for the variable indicating missingness.

Indicator variables for missingness of continuous predictors

A popular approach is to include for each continuous predictor variable with missingness an extra indicator identifying which observations on that variable have missing data. Then the missing values in the partially observed predictor are replaced by zeroes or by the mean (this choice is essentially irrelevant). This strategy is prone to yield biased coefficient estimates for the other predictors included in the model because it forces the slope to be the same across both missing-data groups. Adding interactions between an indicator for response and these predictors can help to alleviate this bias (this leads to estimates similar to complete-case estimates).

Using (generalized) linear regression predictions to perform deterministic imputation

In such models, missing values are imputed by regression on the other features to attempt to capture the relationships between various features. In a linear regression framework, the cell corresponding to the ith observation and the jth feature -in case there is a missing value- is imputed from the other features as follows:

\hat{x}_{i,j} = \hat{\beta}_{0} + \sum_{k \neq j, k \text{included in the regressors}} \hat{\beta}_{k} x_{i,k},

where \hat{\beta}_{k} is the estimated linear regression coefficient for the variable x_{k}. The choice of the regressors included in the imputation model is non-neutral insofar as we can only estimate (generalized) linear models from complete cases, i.e. observations having missing features get dropped.

In our context -large number of missing data and no complete cases in the raw dataset- choosing too many regressors would lead us to estimate the model from a very small number of observations. Thus we have to select rigorously our imputation regressors in case we adopt this approach. Benjamin Schannes designed these two algorithms -unsupervised and supervised- to address those needs.

  • Supervised features extractor: You start by fixing the feature -we will refer to as the target- on which you need to perform imputations. Then you run every possible simple regression of this target on any other variable in the dataset arising from the subsampling function and order the variables by decreasing order of R2. The user has to set four parameters: a first parameter specifying the minimum absolute correlation you want to have between each predictor and the response indicator of the variable to be imputed, a second parameter specifying the maximum pairwise absolute correlation you wish to allow for any pair of predictors in the list, a third parameter specifying the minimum proportion of usable cases, a fourth optional parameter specifying the minimum proportion of complete cases you want to achieve throughout your imputation procedure. Generally speaking the proportion of usable cases measures how many cases with missing data on the target variable actually have observed values on the predictor. The proportion will be low if both target and predictor are missing on the same cases. If so, the predictor contains only little information to impute the target variable, and could be dropped from the model, especially if the bivariate relation is not primary scientific interest. Now take the first variable -in the list of variables ordered by decreasing R2- satisfying the first constraint as the first element as the list of predictors we will retain for the imputation model. Then pick up the next variable in the list ordered by R2. If it satisfies the constraints, you append it to the list of regressors. Otherwise you do not include it. In both cases you you repeat this step for the following variable in terms of R2. If it satisfies the set of constraints with the variable(s) in the list of regressors, you include it. Otherwise you do not append this variable to the list. You repeat the same process for each variable by decreasing order of R2 until having inspected all the features and extract the final list of regressors. An optional refinement enables to add a constraint pertaining to the minimum complete cases proportion you desire to ensure along the procedure. Fixing this proportion at 1 enables to keep the same multivariate linear regression imputation model to impute each missing value of the target variable. Otherwise you have to take a different and convenient subset of regressors at each imputation step. In practice the final imputation procedure is sequential and we use the following algorithm: for each missing value to impute, perform the imputation using the model providing the largest R2. In case all regressors are not missing, we have a complete case and the full mutivariate model is run. Otherwise we impute the missing value using the model containing every variable which is not missing.
  • Unsupervised features extractor: for any subset produced by the subsampling function parameterized by the user, the function extracts a list whose first element is the variable having the highest variance. The user has also to set four parameters: a first parameter specifying the minimum absolute correlation you need between each variable and each response indicator -the indicator for missingness- attached to any other variable in the list, a second parameter specifying the maximum pairwise absolute correlation you wish to allow for any pair of predictors in the list, a third parameter specifying the minimum proportion of pairwise usable cases, a fourth optional parameter specifying the minimum proportion of complete usable cases. This fourth parameter has the following sense: for each predictor in the list considered as a target, you need a minimum ratio of cases where missing data on the target variable correspond to complete observed values on the other predictors. The denominator of this target-specific ratio is the number of cases whith missing data on the target variable. Having set these four parameters -minimum response correlation, maximum pairwise correlation and minimum proportion of (pairwise and complete) usable cases- you consider the variable having the second highest variance. If it satifies those constraints you include it in the list. Otherwise you do not include it. In both cases, you then inspect the variable having the third highest variance. If it satisfies the set of constraints with the variable(s) in the list you append it to the list. Otherwise you do not add it. You repeat the same process for each variable by decreasing order of variance until having inspected all the features. This function is relevant to select a 'self-sufficient' set of variables in which you can easily perform deterministic imputations using regression on the different variables included in this set. If the zone which has been retained in the subsampling function gathers at least two countries you may wish to avoid any spurious ranking based on the variance (when the source of a feature's dispersion amounts to variations between countries and not within). To do that you can activate an argument specifying whether you want to compute country-specific variances and then calculate the average to provide the estimation of the variance which will be used to provide the ranking of variables by decreasing order of variance.

Regression methods have several drawbacks: it is difficult to apply unless (a) we observe the same features for many datapoints or (b) the regression method used can handle missing data. Also, (c) this imputation scheme is deterministic and only accounts for the part of the feature that can be predicted by GLM or LM from other features.

To deal with (a) the supervised features extractor is an interesting development. We can alternatively perform iterated regression where we first impute all the missing values using some baseline technique (like imputing the means) and then iteratively refine imputations by regressing each feature with missing values on the remaining features (making use of the imputations from the prior round for the regression covariates).

With regard to (b), several supervised learning procedures can cope with missing data out of the box. Some tree based methods like CART treat `missing' as a special value. We can also modify methods to deal with missing values. For example, in k-nearest neighbors and other methods based on pairwise dissimilarity, we can modify a pairwise dissimilarity measure (like mean squared distance) to be computed only over those features which are observed for both datapoints.

To address point (c), we could perform multiple imputation which better accounts for the uncertainty inherent in imputation by forming a posterior distribution over unobserved values and sampling many possible imputations from this distribution. Each imputed dataset can then be used for follow-up analysis.

k-NN imputation

In k nearest neighbor imputation one defines a distance function $d(i,j)$ that computes a measure of dissimilarity between records. A missing value is then imputed by finding first the $k$ records nearest to the record with one or more missing values. Next, a value is chosen from or computed out of the $k$ nearest neighbors. In the case where a value is picked from the $k$ nearest neighbors, kNN-imputation is a form of hot-deck imputation.

It is very common to use Gower's distance to determine the $k$ nearest neighbors. Gower's distance between two records labeled i and j is defined as

d_{g}(i,j) = \dfrac { \sum_{k} w_{ijk} d_{k}(i,j)}{ \sum_{k} w_{ijk} }

where the sum runs over all variables in the record and d_{k}(i,j) is the distance between the value of variable k in record i and record j. For categorical variables, d_{k}(i,j)=0 when the value for the kth variable is the same in record i and j and 1 otherwise. For numerical variables the distance is given by 1 -(x_{i}-x_{j})/(\max(x) - \min(x)). The weight w_{ijk} = 0 when the k'th variable is missing in record i or record j and otherwise 1.

For numerical variables the median of the $k$ nearest neighbors is used as imputation value by default, for categorical variables the category that most often occurs in the $k$ nearest neighbors is used by default. Both these functions may be replaced by the user.

A localization argument can be activated if the user wants the $k$ nearest neighbors to be found within the subset of rows corresponding to each of the different countries in the zone under consideration through the subsampling procedure. Again suppose the user retains a zone gathering Germany and Spain. If a record from the Germany subset has missing values the $k$ nearest neighbors which will be used for the imputation are taken from the Germany subset and not from the zone -Germany and Spain- subdataset. Curse of dimensionality considerations encourage us not to activate this argument. On the contrary we need to use as much data as possible -by default, $k$ nearest neighbors are computed over all the rows and columns satisfying NA threshold constraints set in the subsampling function, and not only the rows corresponding to the zone set by the user. Taking into account the large number of variables in the dataset -even after subsampling and having dropped columns whose missing value proportion exceeds a small threshold- and the weak number of records, this imputation method is not robust in our curse of dimensionality context. It would become relevant if we decided to keep a very weak number of variables, a maximum of 10 maybe.

Mixing k-NN and polynomial smoothing considerations, we also let the user have access to imputation by LOcal regreESSion (LOESS).

When using KNN imputation the choice of the tuning parameter k can have a large effect on the performance of the imputation. However, this parameter is not known beforehand. We can obtain a suitable k via cross-validation techniques.


Another imputation strategy that arises in the context of time series or other sequential data, where a strong relationship exists among contiguous values of a numeric feature, is to impute by interpolating between neighboring values.

The interpolation algorithm must be country-specific and shall take into account missing years whenever the subsampling function set by the user implies to drop some rows in case they do not satisfy the selected NA threshold constraint. Suppose a user wants to focus on France and suppose that the row corresponding to the year 1992 has been left out when we keep only the rows whose missing data proportion over the retained columns is below the threshold which has been chosen. Take a numeric feature for which 1991 and 1994 values are present but 1993 information is missing. To obtain the 1993 value by linear interpolation you can not take the arithmetic mean between the values 1991 and 1994 and apply it to 1993. In the weighted mean, the weights applied to the 1991 and 1994 values must be respectively 1/3 and 2/3.

We let the user choose the interpolation method -and the order for (piecewise) polynomial methods- he wants to implement:

  • linear
  • polynomial
  • spline, if the user wants to mitigate the Runge's phenomenon
Imputation with Random Forests

This method of imputation can handle any type of input data and makes as few as possible assumptions about structural aspects of the data. Random forest -RF, Breiman (2001)- is able to deal with mixed-type data and as a non-parametric method it allows for interactive and non-linear (regression) effects. Here we have to use an iterative imputation scheme. First we have to train a RF on observed values. Then we can predict the missing values and proceed iteratively. RF is known to perform very well under barren conditions like high dimensions, complex interactions and non-linear data structures. Due to its accuracy and robustness, RF is well suited for our needs. Furthermore, the RF algorithm allows us to estimate out-of-bag (OOB) error rates without the need for a test set. The MissForest algorithm we use has been proposed by Peter Bühlmann and Daniel J. Stekhoven in 2012, see Stekhoven, D.J. and Bühlmann, P.[17].

Other methods

There are many other deterministic imputation methods for matrices using for instance more refined nearest neigbours techniques or least square methods, SVD decompositions, matrix completion techniques, low-rank approximations, nuclear norm minimization or gradient descent on Grassmann manifolds. An interesting overview is given by Achiya Dax[18]. We have developped R sripts to enable the user to select some of these methods.

Multiple Imputation

Best-fit imputation conveys a false sense of accuracy if the imputed values are interpreted as ordinary observations. Rubin (1978) proposed drawing multiple random imputations of the missing data, rather than a single best-fit imputation. Variability of results between the randomly imputed datasets can the be used to assess the accuracy of the parameter of interest. This variability is often carried out by means of a Bayesian updating scheme, quite different in concept from the confidence intervals obtained by the combination of nonparametric bootstrap techniques and best-fit imputation values, see Efron[19].

Multiple imputation (MI) is currently the most popular method to deal with missing data. Based on assumptions about the data distribution (and the mechanism which gives rise to the missing data) missing values can be imputed by means of draws from the posterior predictive distribution of the unobserved data given the observed data.

Multiple imputation involves imputing M values for each missing cell in the data matrix and creating M “completed” data sets. Across these completed data sets, the observed values are the same, but the missing values are filled in with a distribution of imputations that reflect the uncertainty about the missing data.

Typical problems that may surface while imputing multivariate missing data are:

  • For a given variable for which we need to impute missing cells, predictors used in the imputation model may themselves be incomplete.
  • Circular dependence can occur, where a variable $x_{j}$ depends on $x_{l}$ and $x_{l}$ depends on $x_{j}$ because in general $x_{j}$ and $x_{l}$ are correlated, even given other variables.
  • Especially with large p (number of variables) and small n (data size), collinearity and empty cells may occur.
  • Rows or columns can be ordered, e.g., as with longitudinal data.
  • Variables can be of different types (e.g., binary, unordered, ordered, continuous), thereby making the application of theoretically convenient models, such as the multivariate normal, theoretically inappropriate.
  • The relation between $x_{j}$ and the other predictors could be complex, e.g., nonlinear, or subject to censoring processes.
  • Imputation can create impossible combinations (e.g., pregnant fathers), or destroy deterministic relations in the data (e.g., sum scores).
  • Imputations can be nonsensical

This list is by no means exhaustive, and other complexities may appear for particular data.


Let $\mathcal{D}$ be a $n \times (p+1)$ data matrix consisting of an outcome variable $y=(y_{1},\cdots,y_{n})$ and covariates $X_{j}=(X_{1j},\cdots,X_{nj})^{\prime}, j=1,\cdots,p.$ The $1 \times p$ vector $x_{i}=(x_{i1},\cdots,x_{ip})$ contains the $ith$ observation of each of the $p$ covariates and $X=(x_{1}^{\prime},\cdots,x_{1}^{\prime})$ is the matrix of all covariates. We are interested in estimating $\theta=(\theta_{1},\cdots,\theta_{k})^{/prime}, k \leq 1,$ which may be a vector of regression coefficients e.g.. The data matrix consists of both and missing values, i.e. $\mathcal{D}=(\mathcal{D}^{\text{obs}},\mathcal{D}^{\text{mis}}).$

For the purpose of inference we are interested in the observed data posterior distribution of $\theta | \mathcal{D}^{\text{obs}}$ which is


P \left( \theta | \mathcal{D}^{\text{obs}}   \right) &= \int P \left( \theta | \mathcal{D}^{\text{obs},  \mathcal{D}^{\text{mis}}  P \left( \mathcal{D}^{\text{mis}}  | \mathcal{D}^{\text{obs}}   \right) d \mathcal{D}^{\text{mis}}  \\
  &=  \int P \left( \theta | \mathcal{D}^{\text{obs},  \mathcal{D}^{\text{mis}}  \left\{ \int P \left( \mathcal{D}^{\text{mis}}  | \mathcal{D}^{\text{obs}}, \vartheta   \right) P \left( \vartheta  | \mathcal{D}^{\text{obs}}   \right) d \vartheta \right\} d \mathcal{D}^{\text{mis}}


$\vartheta$ refers to the parameters of the imputation model whereas $\theta$ is the quantity of interest. These relations are data augmentation identities. With multiple imputation we approximate the latter integral by using the average


P \left( \theta | \mathcal{D}^{\text{obs}}   \right) \approx \dfrac{1}{M} \ \sum_{m=1}^{M} P \left( \theta | {\mathcal{D}_{(m)}}^{\text{mis}} , \mathcal{D}^{\text{obs}}   \right)


where ${\mathcal{D}_{(m)}}^{\text{mis}}$ refers to draws (imputations) from the posterior predictive distribution $P \left( {\mathcal{D}_{(m)}}^{\text{mis}} | \mathcal{D}^{\text{obs}} \right).$ Tanner and Wong (1987) investigate an iterative algorithm, related to Gibbs' sampling for carrying out this method. Each iteration is defined by an imputation step and a posterior step.

The interest of Bootstrap in a missing data context

The use of the bootstrap in the context of missing data has often been viewed as a frequentist alternative to multiple imputation (Efron), or an option to obtain confidence intervals after single imputation. Some imputation algorithms use bootstrapping to create multiple imputations, and this must not be confused with the bootstrapping step we can combine with mainstream multiple imputation techniques to better control uncertainty in the estimation.

Little and Rubin (2002)[20] recommend a three step procedure for using multiple imputation with bootstrap standard errors:

  1. Generate bootstrap samples from the unimputed data;
  2. Impute missing values in each bootstrap sample;
  3. Run MI analyses in each of the bootstrap samples.

Through both simulation and theoretical considerations, Schomaker & Heumann[21] also demonstrate that it is in general more relevant to bootstrap the original data, and then apply multiple imputation to each of the bootstrapped datasets than bootstrapping datasets that have first been mutiply imputed.

It is important to understand that proper imputations are created by means of random draws from the posterior predictive distribution of the missing data given the observed data (or an approximation thereof).

  1. Following Schafer (1997), we can specify a multivariate distribution of the data (joint modelling JM) so we can simulate the posterior predictive distribution with a suitable algorithm. Schafer (1997) developed various JM techniques for imputation under the multivariate normal or the log-linear model.
  2. We can specify individual conditional distributions for each variable $X_{j}$ given the other variables (fully conditional specification FCS) and iteratively draw and update imputed values from these distributions which will then (ideally) converge to draws of the theoretical joint distribution (when it exists);
  3. Alternative algorithms.
Joint Modelling

Here one considers the framework put by Efron and one assumes that the complete data (that is, both observed and unobserved) are multivariate normal. Categorical variables are recoded into dummy variables based on the knowledge that for binary variables the multivariate normal assumption can yield good results (see Schafer, J., & Graham., J.[22]). We also make the usual assumption in multiple imputation that the data are missing at random (MAR). Because we need to estimate a parameter of interest -linear regression coefficients e.g.- in a missing data situation, we bring bootstrap approaches to bear on the question of assigning confidence intervals for this parameter. The different algorithms combining bootstrap and multiple imputation methods often require to compute maximum likelihood estimates (MLE) for which the Expectation Maximization (EM) plays a crucial role on a computational point of view.

JM involves specifying a multivariate distribution for the missing data, and drawing imputation from their conditional distributions by Markov chain Monte Carlo (MCMC) techniques, following Tanner and Wong approach described above. This methodology is attractive if the multivariate distribution is a reasonable description of the data. We can also combine JM with bootstrap. We remind the data are supposed to be $\mathcal{N}\left( \mu,\Sigma \right).$ We can draw B bootstrap samples on the data (including missing values) and apply in each bootstrap sample the EM algorithm to obtain estimates of $\mu$ and $\Sigma$ which can then be used to generate proper multiple imputations by means of the sweep-operator.

Fully Conditional Specification

FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10-20) is often sufficient. FCS is attractive as an alternative to JM in cases where no suitable multivariate distribution can be found. The basic idea of FCS is already quite old, and has been proposed using a variety of names: stochastic relaxation (Kennickell 1991[23]), variable-by-variable imputation (Brand 1999[24]), regression switching (van Buuren et al. 1999), sequential regressions (Raghunathan et al. 2001[25]), ordered pseudo-Gibbs sampler (Heckerman et al. 2001[26]), partially incompatible MCMC (Rubin 2003 [27]), iterated univariate imputation (Gelman 2004 [28]), MICE (van Buuren and Oudshoorn 2000[29]; van Buuren and Groothuis-Oudshoorn 2011) and FCS (van Buuren 2007[30]). To impute by chained equations (ICE), one first specifies individual conditional distributions (i.e. regression models) $p\left(Y_{j} | Y_{-j}, \theta_{j} \right)$ for each variable $Y_{j}, j=1,\cdots,p.$ We recall that $Y_{-j}$ stands for the random vector of all the variables except $Y_{j}$. Then, (i) one iteratively fits all regression models and generates random draws of the coefficient, e.g. $\tilde{\beta} \sim \mathcal{N}\left(\hat{\beta}, \widehat{\text{Cov}\left(\hat{\beta}\right)}\right)$. Then (ii), values are imputed as random draws from the distribution of the regression predictions. Then, (i) and (ii) are repeated k times until convergence. The process of iteratively drawing and updating the imputed values from the conditional distributions can be viewed as a Gibbs sampler that converges to draws from the (theoretical) joint distribution. However, there remain theoretical concerns as a joint distribution may not always exist for a given specifications of the conditional distributions.

Indeed,in order to address the issues posed by the real-life complexities of the data, it is convenient to specify the imputation model separately for each column in the data. A big advantage of conditional (rather than joint) modeling is that it splits a k-dimensional problem into k one-dimensional problems, each of which can be attacked flexibly. However, in general, reasonable-seeming conditional models will not be compatible with any single joint distribution. From a practical imputation perspective, A. Gelman and T.E. Raghunathan[31] notice that it will make sense to go with the inconsistent, but flexible, conditional models. In order to bypass the limitations of joint models, Gelman (2004, pp. 541) concludes: "Thus we are suggesting the use of a new class of model -inconsistent conditional distributions- that were initially motivated by computational and analytical convenience." We could thus implement a straightforward inconsistent Gibbs sampler.

A variation of (ii) is a fully Bayesian approach where the posterior predictive distribution is used to draw imputations. Here let the hypothetically complete data X be a partially observed random sample from the multivariate distribution $P(X | \theta)$. We assume that the multivariate distribution of Y is completely specified by $\theta$, a vector of unknown parameters. The problem can be how to get the multivariate distribution of $\theta$, either explicitly or implicitly. The MICE algorithm obtains the posterior distribution of $\theta$ by sampling iteratively from conditional distributions of the form


& P \left( X_{1} ~ | ~ X_{2},\cdots,X_{p} \right) \\
&\vdots \\
 & P \left( X_{p} ~ | ~ X_{1},\cdots,X_{p-1} \right)


The parameters $\theta_{1},...,\theta_{p}$ are specific to the respective conditional densities and are not necessarily the product of a factorization of the `true' joint distribution $P(X | \theta)$. Starting from a simple draw from observed marginal distributions, the tth iteration of chained equations is a Gibbs sampler that successively draws

\begin{align*} \theta^{*(t)}_{1} &\sim P \left( \theta_{1} | X_{1}^{\text{obs}},X_{2}^{(t-1)},...,X_{p}^{(t-1)} \right) \\

X^{*(t)}_{1} &\sim P \left( X_{1} | X_{1}^{\text{obs}},X_{2}^{(t-1)},...,X_{p}^{(t-1)},\theta^{*(t)}_{1} \right) \\

&\vdots \\

\theta^{*(t)}_{p} &\sim P \left( \theta_{p} | X_{p}^{\text{obs}}, X_{1}^{(t)},X_{2}^{(t)},...,X_{p-1}^{(t)} \right) \\

X^{*(t)}_{p} &\sim P \left( X_{p} | X_{p}^{\text{obs}}, X_{1}^{(t)},X_{2}^{(t)},...,X_{p-1}^{(t)}, \theta^{*(t)}_{p} \right) \end{align*}

where $X_{j}^{*(t)}$ is the vector of imputed values for the missing part of the $jth$ variable at iteration t and $X_{j}(t)=\left( X_{j}^{\text{obs}}, X_{j}^{*(t)} \right)$ is the $jth$ imputed variable at iteration t. For every variable, the observed part is never changed during the process. We observe that previous imputations $X^{*(t-1)}_{j}$ only enter $X^{*(t)}_{j}$ through its relation with other variables, and not directly. It is important to monitor convergence: the number of iterations can often be small, around 10 or 20.

Here again, the bootstrap can be used to model the imputation uncertainty and to draw the M imputations needed for the M imputed data sets.

Full Bootstrap

An example is the Approximate Bayesian Bootstrap (Rubin and Schenker). Here, the (cross-sectional) data is stratfied into several strata, possibly by means of the covariates of the analysis model. Then, within each stratum (a) one draws a bootstrap sample among the complete data (with respect to the variable to be imputed). Secondly, (b) one uses the original data set (with missing values) and imputes the missing data based on units from the data set created in (a), with equal selection probability and with replacement. The multiply imputed data are obtained by repeating (a) and (b) M times.

OpenCancer module

We let the user run the main algorithms detailed above:

  • JM: without or with Boostrap
  • FCS: the user can choose to run ICE solely or with Bootstrap. We also enable the user to select adjust final imputation methods (predictive mean matching or log linear regression e.g.) within this ICE context.
  • Multiple Imputation by bootstrap only

The problem is that the huge number of variables -high dimension pattern- and missing data makes very challenging the implementation of Bayesian multiple imputation. Gibbs sampling may be unsteady or even fail. That is why we have in practice to limit the number of predictors on which we want to perform multiple imputation. Keeping only the features with the greatest variance e.g., under correlation and usable cases constraints described in the subsection above seems to be a necessary and reasonable prerequisite. We also remind that a straightforward application of stochastic methods, such as the Gibbs sampler, is likely to create impossible combinations or destroy deterministic relations in the data. Thus it can be desirable to add a layer ensuring consistent imputations in case of deterministic relations, passive imputation methods e.g. we can run within the ICE context.

Towards a history-dependent policy: past-dependent features and distributed lag modelling

The raw dataset arising from merging procedures descrived above enables us to link cancer incidence for each year and each country to features values pertaining to this year and this country. At this stage we do not take into account past. For several variables, e.g. any variable quantifying the exposure to pesticides, this is not only the current value which is important but also the past values. At the individual scale this is the full exposure history which is required to quantify the risk. At an aggregate level we have to replicate this history-dependent or Semi Markov pattern. Thus, for each feature whose full history is relevant to analyze current cancer incidence, we would need to create a good feature to summarize the full history of its past values.

The crudest strategy is probably to include all the available lags for each feature and impute NA when the lag is missing. With this approach, the number of lags to include depends on the age group we consider. Indeed, we remind that incidence data are grouped by gender and age group. For the "20-30" age group it seems useless to include lags whose order is higher than 31 years. We can impose a priori restrictions on the weights. For example, one can assume that the coefficients of lagged values -in a linear regression context- follow a pattern of exponential decay, i.e. we want the impact of successive lags to decline exponentially. While such an approach reduces the number of parameters to be estimated, there is still a large number of regressors. This can be addressed by use of the Koyck transformation.

Another approach, less naive, is to use the cross-correlation to determine the lags to include. For a feature, we will just include the lags for which the associated cross-correlation with the cancer incidence series is above a certain threshold the user has to define. This strategy will reduce the final dataset dimension compared with the integration of all the lags for every feature.

A relevant and complementary approach is to attach to each -non lagged- regressor a single feature summarizing the full past exposure. In practice this new predictor shall depend on the past values of the underlying variable. Again it will be relative to the considered age group and we can assign decreasing weights to successive lags to construct our weighted moving average or sum. To determine relevant values for those weights, using the age pyramid is of primary importance.

To summarize, we enable the user to deploy both lags integration and weighted moving average strategies.

From consistent data to learning algorithms

We aim at bringing out some local relations helping to understand stand-alone cancers under certain conditions. Prior localization is one of the most convenient and intuitive strategies to tackle the unavoidable trade off between accuracy and interpretability that we face when we run learning algorithms.

For medical reasons it is crucial to ensure interpretability. Accuracy is also required to provide safeguards in order to use the models' results to take public policy decisions.

Methodological remarks

For every 4-uple (cancer type, country, gender, age group), the merged dataset provides panel data. We observe the same cancer incidence rate and regressors between 1961 and 2017, having imputed missing data.

To estimate multivariate linear regressions, we can use Ordinary Least Squares techniques, which require particular assumptions: weak exogeneity, linearity, homoscedasticity, independence of errors, lack of perfect multicollinearity.

When we have panel data, we have to be more stringent in terms of assumptions, in particular with the strict exogeneity requirement. We also add to the standard error term a term controlling the unobserved time-invariant individual effect. Without this new set of assumptions we would cause omitted variable bias with a standard OLS regression. Two important models are the fixed effects model and the random effects model.

We will take these constraints into account every time we have to run a linear regression after having selected a relevant subset of covariates.

Features selection automation algorithms

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:

  • simplification of models to make them easier to interpret by researchers/users
  • shorter training times
  • to avoid the curse of dimensionality
  • enhanced generalization by reducing overfitting (formally, reduction of variance)

The central premise when using a feature selection technique is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.

In our context, we can run models on observations pertaining to a single country -in which case we will have at most 56 observations by 3-uple (cancer type, gender, age group)- or a continent or a cluster of homogeneous countries within different partitions we have built. With a total number of variables exceding 6000, it is necessary to select a subset of relevant features. A standard rule of thumb is that for n observations, it is good to have around $\ln n$ variables. For example, at a country modelling level, it would be desirable to keep only around 4 variables by country and 3-uple (cancer type, gender, age group).

Probabilistic feature relevance

A feature $X_{i}$ can be:

  • Strongly relevant: $X_{i}$ brings information that no other feature contains.
  • Weakly relevant: $X_{i}$ brings information that also exists in other features and brings information in conjunction with other features.
  • Irrelevant: neither strongly relevant nor weakly relevant.
  • Relevant: not irrelevant.

Main Methods

A feature selection algorithm can be seen as the combination of a search technique for proposing new subsets of features, along with an evaluation measure which scores the different subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and this is computationally intractable. The choice of evaluation metric heavily influences the algorithm, and these evaluation metrics define the three main categories of features selection algorithms: wrappers, filters and embedded methods:

  • Wrapper methods use a predictive model and error rates to score subsets of features. They are very computationally intensive. In traditional statistics, the most popular form of feature selection is stepwise regression, which is a very greedy wrapper technique. Other wrapper techniques are forward selection and backward elimination. Forward selection starts with an empty set and incrementally adds feature in order to find all strongly relevant features. It may not find some weakly relevant features. Backward elimination starts with the full set of features and incrementally removes features to keep at the end all strongly relevant features. It may eliminate some weakly relevant features. Wrapper methods also feed the risk of overfitting the validation set.
  • Filter methods use a proxy measure instead of the error rate to score a subset: mutual information, pointwise mutual information, Pearson product-moment correlation coefficient, inter/intra class distance or the scores of significance tests for each features combination e.g.. Filters are usually less computationally intensive than wrappers but usually give lower prediction performance. Many filters provide a features ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods can also be used as a preprocessing step for wrapper methods.
  • Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples, and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.

L1 regulatization can weed an exponential number of irrelevant features. SVMs do not have magic properties for filtering out irrelevant features. They perform best when dealing with lots of relevant features. $L_{1/2}$ regularization is non convex and therefore hard to optimize but works better than L1 in practice. Other greedy algorithms that incorporate features one by one are:

  • Decision trees: Each decision can be seen as a feature. Pruning the decision tree prunes the features.
  • Ensembles: Random Forests or Boosting methods e.g.

We can mention other methods to select features:

  • we can use a ranking by decreasing order of variance, and add correlation constraints to append to the list a new feature if and only if its absolute pairwise correlation with the other covariates in the list is below a threshold.
  • we can rank the variables by decreasing order of R2 when we run simple regression of the incidence rate on each feature.

OpenCancer module

We let the user choose one of the following methods:

  • Variance ranking
  • R2 ranking arising from simple regressions of the incidence on each feature
  • Forward selection: the criterion used to retain the variables can be correlation, mutual information (in the latter cases we consider the variables by decreasing order of variance), AIC or validation error
  • LASSO or L1 regularization
  • Feature Importance using Random Forest
  • Compressed sensing techniques

Models estimation and prediction

Depending on the method, we have to maximize a likelihood or minimize an empirical loss. For some methods, we have to specify hyperparameters, e.g. the learning rate in a gradient descent context.

The user can interact with the models by changing the values of hyperparameters and fine-tune them depending on the metrics of interest.

To keep up with the interpretability medical imperative, the models we implement at a local level are mainly:

  • the linear regression -with panel data-
  • the linear regression with L0, LASSO or elastic net penalties, the ridge regression
  • the generalized linear regression (with a beta specification for the incidence rate e.g.) with L0, L1, L2 penalties.
  • the MissGLasso algorithm which provides more consistent estimations of the inverse covariance matrix in the high dimensional multivariate normal model in presence of missing data under MAR assumption, see Städler, N. and Bühlmann, P.[32]

We also enable association rules computation.

Predictive rules

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal[33], Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule <math>\{\mathrm{onions, potatoes}\} \Rightarrow \{\mathrm{burger}\}</math> found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.


Example database with 5 transactions and 5 items
transaction ID milk bread butter beer diapers
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 1
4 1 1 1 0 0
5 0 1 0 0 0

Following the original definition by Agrawal, Imieliński, Swami the problem of association rule mining is defined as:

Let <math>I=\{i_1, i_2,\ldots,i_n\}</math> be a set of <math>n</math> binary attributes called items.

Let <math>D = \{t_1, t_2, \ldots, t_m\}</math> be a set of transactions called the database.

Each transaction in <math>D</math> has a unique transaction ID and contains a subset of the items in <math>I</math>.

A rule is defined as an implication of the form:

<math>X \Rightarrow Y</math>, where <math>X, Y \subseteq I</math>.

In Agrawal, Imieliński, Swami a rule is defined only between a set and a single item, <math>X \Rightarrow i_j</math> for <math>i_j \in I</math>.

Every rule is composed by two different sets of items, also known as itemsets, <math>X</math> and <math>Y</math>, where <math>X</math> is called antecedent or left-hand-side (LHS) and <math>Y</math> consequent or right-hand-side (RHS).

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is <math>I= \{\mathrm{milk, bread, butter, beer, diapers}\}</math> and in the table is shown a small database containing the items, where, in each entry, the value 1 means the presence of the item in the corresponding transaction, and the value 0 represents the absence of an item in that transaction.

An example rule for the supermarket could be <math>\{\mathrm{butter, bread}\} \Rightarrow \{\mathrm{milk}\}</math> meaning that if butter and bread are bought, customers also buy milk.

Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Useful Concepts

In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best-known constraints are minimum thresholds on support and confidence.

Let <math>X</math> be an itemset, <math>X \Rightarrow Y</math> an association rule and <math>T</math> a set of transactions of a given database.


Support is an indication of how frequently the itemset appears in the dataset.

The support of <math>X</math> with respect to <math>T</math> is defined as the proportion of transactions <math>t</math> in the dataset which contains the itemset <math>X</math>.

<math>\mathrm{supp}(X) = \frac{|\{t \in T; X \subseteq t\}|}{|T|}</math>

In the example dataset, the itemset <math>X=\{\mathrm{beer, diapers}\}</math> has a support of <math>1/5=0.2</math> since it occurs in 20% of all transactions (1 out of 5 transactions). The argument of <math>\mathrm{supp}()</math> is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).


Confidence is an indication of how often the rule has been found to be true.

The confidence value of a rule, <math>X \Rightarrow Y</math> , with respect to a set of transactions <math>T</math>, is the proportion of the transactions that contains <math>X</math> which also contains <math>Y</math>.

Confidence is defined as:

<math>\mathrm{conf}(X \Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X)</math>.

For example, the rule <math>\{\mathrm{butter, bread}\} \Rightarrow \{\mathrm{milk}\}</math> has a confidence of <math>0.2/0.2=1.0</math> in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well).

Note that <math>\mathrm{supp}(X \cup Y)</math> means the support of the union of the items in X and Y. This is somewhat confusing since we normally think in terms of probabilities of events and not sets of items. We can rewrite <math>\mathrm{supp}(X \cup Y)</math> as the probability <math>P(E_X \wedge E_Y)</math>, where <math>E_X</math> and <math>E_Y</math> are the events that a transaction contains itemset <math>X</math> and <math>Y</math>, respectively.

Thus confidence can be interpreted as an estimate of the conditional probability <math>P(E_Y | E_X)</math>, the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.


The lift of a rule is defined as:

<math> \mathrm{lift}(X\Rightarrow Y) = \frac{ \mathrm{supp}(X \cup Y)}{ \mathrm{supp}(X) \times \mathrm{supp}(Y) } </math>

or the ratio of the observed support to that expected if X and Y were independent.Template:Citation needed

For example, the rule <math>\{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\}</math> has a lift of <math>\frac{0.2}{0.4 \times 0.4} = 1.25 </math>.

If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

The value of lift is that it considers both the confidence of the rule and the overall data set.


The conviction of a rule is defined as <math> \mathrm{conv}(X\Rightarrow Y) =\frac{ 1 - \mathrm{supp}(Y) }{ 1 - \mathrm{conf}(X\Rightarrow Y)}</math>.

For example, the rule <math>\{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\}</math> has a conviction of <math>\frac{1 - 0.4}{1 - 0.5} = 1.2 </math>, and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule <math>\{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\}</math> would be correct 20% more often (1.2 times as often) if the association between X and Y was purely random chance.


Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

  1. A minimum support threshold is applied to find all frequent itemsets in a database.
  2. A minimum confidence constraint is applied to these frequent itemsets in order to form rules.

While the second step is straightforward, the first step needs more attention.

Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets (item combinations). The set of possible itemsets is the power set over <math>I</math> and has size <math>2^n-1</math> (excluding the empty set which is not a valid itemset). Although the size of the power-set grows exponentially in the number of items <math>n</math> in <math>I</math>, efficient search is possible using the downward-closure property of support, which guarantees that for a frequent itemset, all its subsets are also frequent and thus no infrequent itemset can be a subset of a frequent itemset. Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all frequent itemsets.

Rules provide relevant tools to solve our trade-off between accuracy and interpretability. Their pure statistical and probabilistic foundations constitute a interesting advantage. Since there is no parametric model, the set of requirements is alleviated compared to the vast majority of other statistical and machine learning techniques. The association rule concept is also by construction a relatively interesting way to evaluate causality -and not only correlation as in the linear regression case-, which is of primary importance in the medical field. To do that, we have to rearrange our dataset: numeric variables must be categorized, using quantiles e.g. so we can measure support and confidence of rules linking binary events in the lhs to a binarized incidence in the rhs. Again binarization needs to be implemented from a threshold over which we consider that incidence is 'high'.

Predictive Platform

To appreciate the results, we are developping a Graphical User Interface with Angular which enables to select a space granularity level (country, continent, clusters of similar countries as explained above) and display the predicted incidence on an interactive world map with an appropriate coloring scheme.

The user also has to choose the 3-uple (cancer type, gender, age group) under consideration through an appropriate sidebar panel. We also allow the platform's user to focus on a particular field of interest, which amounts to keeping a subset of variables pertaining to the latter. For instance, if you select the class 'Food', the predictive models will be run on a restricted dataset containing only food-related features. At this stage the different available domains are:

  • Food
  • Education
  • Economic factors
  • Demographic factors
  • Social protection and health factors
  • Environmental factors

Then the user adjusts the different preprocessing specifications detailed above:

  • NA threshold constraint used for subsampling
  • Binarization for outliers, missing data, 'carcinogenic events'
  • Interaction features protocol
  • Lags and/or Weighted Moving Average inclusion
  • Mix of imputation strategies and weights for each one
  • Parameters for each imputation method detailed above
  • First features selection protocol if the user wants to implement a preselection before the learning and modelling steps. For instance the user may want to rank the variables by decreasing order of variance and keep the variables having the largest variance and whose pairwise correlation among the set of selected variables does not exceed a certain threshold, in which case the user has to specify the latter threshold.

These specifications provide a fine-tuned learning dataset and the user can now specify the processing protocol:

  • Model under consideration
  • Learning rate if any
  • Number of iterations
  • Last layer of features selection in case of backward elimination or forward selection

Then the user can click on a particular country and the final variables pertaining to this country will appear on a specific widget. By default, the variables are at their current values. The user can change the values and observe the impact on cancer incidence that is predicted by the model.

Time Series Platform

We also provide a Time Series platform enabling the user to visualize the series for each 3-uple (cancer type, country or zone, gender) the user can specify. The obtain main time series statistics: Autocorrelation, Partial Autocorrelation, Ljung-Box stat. Cross-correlations can be displayed for every feature. ARIMA models can be estimated for every 3-uple (cancer type, country or zone, gender). Forecasts arising from the ARIMA modelling are also available.

In terms of causality assessment, we complete cross-correlation measures and implement the Granger tests to test every assumption such as: "For a gender and a country, a feature f causes a specific cancer's incidence in Granger's sense".

Numerical application : some results pertaining to colon cancer

Now we provide some results we obtain with the colon cancer as a case study.

Computational framework

To benefit from a vast number of preprocessing, statistical and machine learning libraries, we have chosen to develop in the R environment. We also have worked with C++ pointers rather than dataframes imported in R memory (hence in RAM), a trick made possible by the R bigmemory package. The memory gain has a cost in terms of flexibility since working with pointers requires C++ functions. However, a series of packages (mostly biglasso and biganalytics) allow to apply statistical functions to pointers. For instance, the big.simplelasso function we created has been designed to perform a feature selection on an OpenCancer dataframe that is imported as a pointer. By default, we perform cross-validation on 5 folds.

To see details, we encourage the reader to read the vignettes available via the url https://github.com/EpidemiumOpenCancer/OpenCancer/tree/master/vignettes.


Now we focus on the colon cancer, by gender and age group. We start by cleaning the data. We apply deductive corrections in case of deterministic relations between variables. We identify the false zeros and replace them by "NA". We include Outliers and NA dummies. We include lags exceeding an absolute cross correlation -with colon cancer incidence- of 0.3. We include indicator variables for two carcinogenic events, the Tchernobyl and Fukushima nuclear power plants explosions. Then we only retain the features having at least 60% of non missing values via the subsampling function.

To impute missing data, we start by applying local -i.e. country-specific- linear interpolation strategies. A more realistic approach would be to implement stochastic multiple imputations described above but the Gibbs sampler is relatively unsteady due to the large number of variables and the structure of numerous missing data. A preselection of variables based on criteria taking into account variance, pairwise correlation, proportion of usable cases -like for the best-fit imputation algorithm described above and referred to as the "supervised features extractor"- is a sound preliminary step to stabilize MCMC procedures.

Having imputed missing data, we run a LASSO by 3-uple (country or zone, gender, age group) to induce sparsity and retain the most important variables.

We run a final linear regression on the most important variables, adjusting the number of variables to the number of observations in the stratum under consideration.


Here we provide temporary results based on a crude linear regression of colon cancer's incidence rate on four features having been selected by LASSO. As an illustration, the four "most important" features are in our case (with a (+) and resp. a (-) indicating that the incidence rate increases, resp. decreases with the factor):

  1. PM2.5 air pollution, mean annual exposure (micrograms per cubic meter) (+)
  2. Smoking prevalence, males (% of adults) (+)
  3. Domestic supply quantity of fruits - Excluding wine (-)
  4. Unemployment rate (%) (+)


Linking and merging macro open data is a question of primary importance to take advantage of the massive data which are provided by the international organizations. Such operations then enable to lead analytical studies of public interest.

In our case we aim at building a tool helping health authorities to take public policy decisions in terms of cancer prevention. We have developped automatic methods to clean the data as rigorously as possible with a systematic missing data assessment. We have developped tools to extract the most important variables and build interpretable and robust models, which is crucial in a medical point of view. We have deployed a Time Series platform to take advantage of both the temporal statistics we can exhibit and the interesting diagnosis we can provide in terms of causality assessment through the Granger's test. We are building an interactive GUI to visualize how incidence predictions react to changes in the features' values for a range of models satisfying interpretability constraints. Every step is under the user's control through appropriate and intuitive parameters. We are adding causality diagnosis enabled by ATE estimation in a linear regression framework and association rules in a non parametric and probabilistic framework.

Clinical trials are probably the most efficient way to understand and quantify the impact of a carcinogenic substance on a particular cancer incidence rate. But real life constraints are too much complex: evreryone is exposed to a variable and uncountable mix of carcinogenic substances, habits, behaviours. In a clinical laboratory, it would be prohibitive to study the impact of every combination of substances we observe in real life. The universe is too big and the price of a single clinical analysis is far from being negligible. Of course we can not reach the same diagnosis accuracy with simple aggregate open data we can grasp on public websites. But it is now feasible, for nothing but computational time, to assess the impact on cancer incidence of every feature interaction and the joint impact of any subset of features which are available in our datasets. With open data we can therefore mimic clinical trials conditions for arbitrary subsets of features at a much lower cost. In spite of the massive missing data issue, we argue that such approaches are appropriate to provide parametric and robust tools, which become highly relevant in a public policy context.

  1. Arnold, M.& Sierra, MS. & Laversanne, M.& Soerjomataram, I.& Jemal, A.& Bray, F. : Global patterns and trends in colorectal cancer incidence and mortality. (2016)
  2. Favoriti, P., Carbone, G., Greco, M., Pirozzi, F., Pirozzi, R. E. M., & Corcione, F. (2016). Worldwide burden of colorectal cancer : A review
  3. Kritchevsky, D. (1995) : Epidemiology of fibre, resistant starch and colorectal cancer.
  4. Chao, A., Thun, M. J., Connell, C. J., McCullough, M. L., Jacobs, E. J., Flanders, W. D., Calle, E. E. (2005) : Meat consumption and risk of colorectal cancer.
  5. Amersi, F., Agustin, M., & Ko, C. Y. (2005). Colorectal cancer: Epidemiology, risk factors, and health services. Clinics in Colon and Rectal Surgery,
  6. Boyle, P., & Langman, J. S. (2000). ABC of colorectal cancer: Epidemiology.
  7. Hu, F. B., Manson, J. E., Liu, S., Hunter, D., Colditz, G. A., Michels, K. B., Giovannucci, E. (1999). Prospective study of adult onset diabetes mellitu
  8. Jess, T., Loftus, E. V., Jr., Velayos, F. S., Harmsen, W. S., Zinsmeister, A. R., Smyrk, T. C., Sandborn, W. J. (2006). Risk of intestinal cancer in inflammatory bowel disease: A population-based study from olmsted county,
  9. Chan, A. T., Giovannucci, E. L., Meyerhardt, J. A., Schernhammer, E. S., Curhan, G. C., & Fuchs, C. S. (2005). Long-term use of aspirin and nonsteroidal anti-inflammatory drugs and risk of colorectal cancer.
  10. Won Jin Lee, Dale P. Sandler, Aaron Blair, Claudine Samanic, Amanda J. Cross, Michael C. R. Alavanja, Pesticide use and colorectal cancer risk in the Agricultural Health Study,Int J Cancer. Author manuscript; available in PMC 2010 Aug 27. Published in final edited form as: Int J Cancer. 2007 Jul 15; 121(2): 339–346. doi: 10.1002/ijc.22635
  11. El-Tawil, AM, Colorectal cancer and pollution, World J Gastroenterol. 2010 Jul 28; 16(28): 3475–3477. Published online 2010 Jul 28. doi: 10.3748/wjg.v16.i28.3475
  12. Butterworth, A. S., Higgins, J. P., & Pharoah, P. (2006). Relative and absolute risk of colorectal cancer for individuals with a family history: A meta-analysis.
  13. Dove-Edwin, I., Sasieni, P., Adams, J., & Thomas, H. J. (2005). Prevention of colorectal cancer by colonoscopic surveillance in individuals with a family history of colorectal cancer: 16 year, prospective, follow-up study.
  14. Schulz KF, Grimes DA, Generation of allocation sequences in randomised trials: chance, not choice, Lancet, volume = 359, issue = 9305, pages = 515–9, year = 2002, doi = 10.1016/S0140-6736(02)07683-3, url = http://www.who.int/entity/rhl/LANCET_515-519.pdf
  15. Rosenberger, James, Design of Experiments, url=https://onlinecourses.science.psu.edu/stat503/node/16%7Cpublisher=Pennsylvania State University|accessdate=24 September 2012
  16. Granger, C. W. J., Investigating Causal Relations by Econometric Models and Cross-spectral Methods, volume=37, issue=3, journal=Econometrica, year=1969,doi=10.2307/1912791
  17. Stekhoven, D.J. and Bühlmann, P., MissForest—non-parametric missing value imputation for mixed-type data, Bioinformatics, Volume 28, Issue 1, 1 January 2012, Pages 112–118, https://doi.org/10.1093/bioinformatics/btr597
  18. Dax, A., Imputing Missing Entries of a Data Matrix: A review, Columbia International Publishing Journal of Advanced Computing, Vol. 3 No. 3 pp. 98-222, 2014, doi:10.7726/jac.2014.100
  19. Efron., B., Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426), 1994.
  20. Little, R., & Rubin., D., Statistical analysis with missing data. Wiley, New York, 2002.
  21. Schomaker, M., & Heumann, C., Bootstrap Inference when using Multiple Imputation, 2017
  22. Schafer, J., & Graham., J., Missing data: our view of the state of the art. Psychological Methods, 7:147-177, 2002.
  23. Kennickell AB. Imputation of the 1989 Survey of Consumer Finances: Stochastic Relaxation and Multiple Imputation." ASA 1991 Proceedings of the Section on Survey Research Methods, pp. 1-10, 1991.
  24. Brand JPL (1999). Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets. Ph.D. thesis, Erasmus University, Rotterdam.
  25. Raghunathan TE, Lepkowski JM, van Hoewyk J, Solenberger P. A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models." Survey Methodology, 27, 85-95, 2001
  26. Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C. Dependency Networks for Inference, Collaborative Filtering, and Data Visualisation." Journal of Machine Learning Research, 1, 49-75, 2001.
  27. Rubin DB. Nested Multiple Imputation of Nmes via Partially Incompatible MCMC. Statistica Neerlandica, 57(1), 3-18, 2003.
  28. Gelman, A., Parameterization and Bayesian Modeling." Journal of the American Statistical Association, 99(466), 537-545, 2004.
  29. van Buuren S, Oudshoorn K (2000). Multivariate Imputation by Chained Equations: MICE V1.0 User's Manual, volume Pg/Vgz/00.038. TNO Prevention and Health, Leiden. URL http://www.stefvanbuuren.nl/publications/mice%20v1.0%20manual% 20tno00038%202000.pdf.
  30. van Buuren S (2007). Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification. Statistical Methods in Medical Research, 16(3), 219-242.
  31. Gelman, A., Raghunathan, T.E., Using conditional distributions for missing-data imputations, 2001
  32. Städler, N. and Bühlmann, P., Missing values: sparse inverse covariance estimation and an extension to sparse regression, 2012
  33. Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. p. 207. doi:10.1145/170035.170072. ISBN 0897915925.