Cancerfood

From Epidemium
Jump to: navigation, search

•Project team

•Project tools

•Project objectives & scope

•Data set

•Part1: Defining human diets

oMethodology

oFirst results

oNext steps

•Part 2 : mortality prediction

oMethodology

oFirst results

oNext steps

1. Project team

•Tsvetina Bacheva

Data Analyst & Visualization expert

•Stéphane Héraud

Machine learning & Project Manager expert

•Nicolas Jeymond

Data Analyst expert

2. Project tools

•Code

Python (Keras, TF)

•Data visualization

Tableau

Matplotlib

•Project management

Trello

3. Project objectives & scope

•Study the relation between cancer mortality and human diet across the world in the last 50 years.

•Use algorithms to extract information from high volume of data provided by the nations.

•Use a scientific methodology.

•Share the results for discussion.

4. Data set

•Food and Agriculture Organization (FAO)

Foods production, usage and consumption per year and country between 1955 and 2012.

Data provided by 300+ countries

•World Health Organization (WHO)

Cause of death link to cancer code per year and country between 1950 and 2012.

Data provided by 80+ countries

5. Part1: Defining human diets

The problematic here is to start from a macro point of view and get some insight of the relations between cancer and human diet. For this aim, we will try to group countries per food consumption and then analyze and describe the different diets and see if it does correspond to the accepted diet definition.

Questions to answer :

•What are the main indicators characterizing a diet?

–Method : Characterize diets by determining foods with higher consumption variance across all countries using Principal Components Analysis

–Deliverables:

•PCA Algorithm description: hypothesis and limitations.

•Run PCA in 1970 and 2010 and describe the first principal component with top 20 foods characterizing a diet in the 2 cases.

–Open discussion: Does this diet definition seem relevant ? How does it differ to other diet definition ?

•How many different diets can we find across the world?

–Method: take reduced dimensionality and uncorrelated data set from PCA analysis and run clustering algorithm ‘K mean’ to determine K diets around the world and classify the different countries in 1970 and 2010

–Deliverable:

•Algorithm choice justification, hypothesis and limitations.

•Data viz: show world map with diets country classification in  1970, 1990 and 2010.

•Results discussion: how many diets exist in the world, evolution through time, clustering quality.

–Open discussion : Does it seem in accordance with our knowledge?

•How do diet maps match with cancer map?

–Deliverable: Select a few cancers and compare diet world map with cancer mortality map.

Results discussion: can we see a correlation between some cancer and the diet per country.

Diet Study : Principal Components Analysis

Defining main indicators for a diet using Principal Components Analysis (PCA)

-Idea:

Usually, human diets are determined through food  consumption for categories like meat, vegetables, cereals etc.. We will use an algorithm to define what are the different food components which characterize diets around the world with no input other than food consumption per country in 2010.

To achieve this goal, the assumption is that different diets should be characterized by a maximum difference in food consumption. In this new referential which is determined by the algorithm, 2 countries will have a certain distance in terms of food consumption. The higher the distance, the higher the difference in diet for the 2 countries.

-Then a second algorithm will classify each country and group them per diet.

-PCA Principle:

converts original variables (food categories slide 7) to a set of uncorrelated variables with highest possible variance. The new variables, or PCA components, are a linear combination of original components. First principal component has largest possible variance. PCA is sensitive to scaling of original variables : need to normalize all variables. (food categories slide 7).

-PCA Usage:

identify patterns in dataset with high number of variables. Reduce dimensionality.  Express data in such a way to highlight differences.

Step by step-

Step 1: data prep and batch normalization : DONE

-to get all variables with mean = 0 and standard deviation is 1.

-Step 2: run PCA on all FAO variables : DONE

-Next:

-First Principal component description with top 20 contributors

-Check clustering for top 3 components.

-Select a subset of uncorrelated variables to reduce dimensionality while keeping most variance of the dataset.

-dataviz PCA with top components and check if cluster do exist.

Diet Study Step : Kmeans, clustering algorithm

How many different diets can we find across the world ? K Means clustering algorithm usage

-Principle:

unsupervised learning algorithm which aims to find groups in the data, with the number of groups represented by K. K is not discovered by the algorithm. All countries with their variables for food consumption will be assigned to a group based on their similarity or proximity to the cluster mean. Each cluster would represent a diet.

-Usage:

cluster analysis.

Step by step:

-Step 1: get data from PCA with reduced dimensionality compared to original data: Done

-Step 2: run Kmeans for a set of K: Done

-Next:

-Determine optimum K

-Dataviz : world map with the diets for optimum k at different time: 1970, 1990, 2010.

-Discuss results

6. Part 2: Mortality Prediction

•How many years of data are needed to make a prediction for mortality using LSTM algo?

-Principle:

Long Short term

-Usage:

Time series prediction algorithm

-

Step by step

-Step 1:

data prep

-Analyze sparsity: done

-Dataviz for sparsity

-Data selection: Country and FAO parameters selection based on sparsity info.

-Step 2:

data batch generator to feed neural network: Done

-From FAO and mortality dataset, generator will build batches to train the neural networks.

-can be defined the number of years history data, the validation and tests sets periods

-Step 3: train neural network and vary lookback time: started.

-Next:

-Calculate naive approach baseline performance.

-Using linear extrapolation of mortality for a defined cancer.

-Train models, analyze results and report.

-Dataviz