Exploratory Data Analysis Project: Obesity Levels Based on Eating Habits and Physical Condition
I chose to analyze a dataset that estimated obesity levels based on the eating habits and physical condition of people from Mexico, Peru, and Colombia. I was interested in the relationship between body fat levels and lifestyle habits, as the findings can be applicable to the health and wellbeing of many, if not most, of us. It is especially relevant because BMI has been on the news a lot recently due to the fact that those with a high BMI are eligible to get the COVID-19 vaccine earlier, as obese individuals are three times more likely to be hospitalized from the virus. I used a dataset from UC Irvine’s Machine Learning Repository, which included the data of 2111 individuals ages 14 to 61.
The dataset had 17 attributes, many of which have acronyms for ease of coding, so I will give briefly describe all of the attributes:
- Gender: female or male
- Age: numeric
- Height: numeric, in meters
- Weight: numeric, in kilograms
- family_history (family history of obesity): yes or no
- FCHCF (frequent consumption of high caloric food): yes or no
- FCV (frequency of consumption of vegetables: 1, 2, or 3; 1 = never, 2 = sometimes, 3 = always
- NMM (number of main meals): 1, 2, 3 or 4
- CFBM (consumption of food between meals): 1, 2, 3, or 4; 1=no, 2=sometimes, 3=frequently, 4=always
- Smoke: yes or no
- CW (consumption of water): 1, 2, or 3; 1 = less than a liter, 2 = 1–2 liters, 3 = more than 2 liters
- CCM (calorie consumption monitoring): yes or no
- PAF (physical activity frequency per week): 0, 1, 2, or 3; 0 = none, 1 = 1 to 2 days, 2= 2 to 4 days, 3 = 4 to 5 days
- TUT (time using technology devices a day): 0, 1, or 2; 0 = 0–2 hours, 1 = 3–5 hours, 2 = more than 5 hours
- CA (consumption of alcohol): 1, 2, 3, or 4; 1= never, 2 = sometimes, 3 = frequently, 4 = always
- Transportation: automobile, motorbike, bike, public transportation, or walking
- Obesity**: insufficient weight, normal weight, level I overweight, level II overweight, type I obesity, type 2 obesity, type 3 obesity; these categories are listed from lowest to highest body fat
**Note: I will refer to these categories as “weight classifications” and “body fat levels”, but they are not determined solely by one’s weight or body fat level.
Given these attributes, I approached this project with the goal of trying to find the answers to the following questions:
- As the makeup of respondents is important to the conclusions derived from this dataset, what kinds of characteristics do the people in this dataset have?
- Can BMI be used as a quantitative substitute for the qualitative weight classification category?
- Which eating habit and physical condition variables are most related to obesity levels? This question has many subquestions related to individual variables and groups of variables.
Examine and Prepare the Data
To analyze the dataset, I needed to first load the dataset into Colaboratory and evaluate it for errors and quality before creating visualizations.
- Import libraries and read the dataset
First, I imported the libraries I would need to understand my data and create visualizations, such as pandas and seaborn. After, I used google.colab to import the CSV file, and then loaded the data into data frame “df” using pandas. The first five data points are shown through the .head function, and the data frame is converted into tabular form for good measure in “tab”.
2. Calculate BMI
While all of the attributes given in the dataset may be related to an individual's body fat classification, these classifications are usually formed with respect to body mass index (BMI), which gives a better look into the relationships between these attributes and obesity. Because of this, I thought it would be helpful to include BMI as a column, so I calculated it from height and weight and put the values into a separate column. I then ran .head again to check if this was calculated and placed correctly. Lastly, I ran .shape to check the array’s dimensions. This came out to be (2111, 18), which confirms that there are 2111 rows and that the BMI column was counted to make 18 columns, as there were 17 before.
3. Drop duplicates
Now that columns have been looked over, I wanted to look at the dataset’s rows and check if there were any duplicates. This returned (24, 18), indicating that there are 24 duplicated rows. I didn’t want to include this data in my analysis, so I called .drop_duplicates(), keeping the latter of the duplicate values. I called .shape to check that this worked, and this gave me (2087, 18), which is what I expected from a reduction of 24 data points from the original 2,111 total.
4. Finish preparing the dataset
I also used .heatmap to give me a quick look into if there are any missing values in my dataset, as that can also skew results. As you can see below, the entire dataset is blue, indicating that there are no missing values; a missing value would have been yellow.
Finally, I ran .info to get a summary of my dataset before I move forwards. This is helpful because it lets me check if all my columns are there and in the correct order, as well as the different data types and the number of columns and rows.
Graph 1: How are the respondents broken down by weight classification?
This gives us a good introductory look into the data we’re working with and who we’ll be evaluating.
The visualization shows that the distribution of respondents among the weight categories in terms of what is most common (type I obesity) to what is least common (insufficient weight). The number of respondents per category is pretty evenly distributed between the weight categories and the immediate category preceding and following it. We can see that the datapoints are more skewed towards respondents who fall into the three obese categories, as they make up the top three categories by the number of respondents.
This graph shows us that the dataset is not very representative of the populations of Mexico, Peru, and Colombia, as all three countries have many more overweight people than obese people, yet this dataset is 40% obese and 25% overweight. Similarly, only 1.4% of Mexicans are underweight, yet people of insufficient weight account for more than 10% of this dataset. The sample’s deviation from the population should be kept in mind when further analyzing the dataset, but its more even distribution of people of different weight categories, as opposed to the actual population’s distribution, may help us in drawing conclusions.
Graphs 2–3: How are the heights and weights of the respondents distributed?
Weight and height are integral to the estimation of a person’s weight classification, as they are the two variables in one’s BMI calculation. Therefore, looking at how both factors are distributed across the respondents sheds more light on the sample we are working with. The weight data is almost bimodal and has an average around the 80kg mark, while the height data has more of a symmetric, normal curve and has an average around the 1.7 meters mark. Neither variable is skewed toward a side.
Graphs 4–7: How are respondents responding to yes/no questions?
I decided to make four subplots for the four yes/no questions in the dataset, as we know these questions exist but do not know the actual breakdown, which is important to consider for data analysis. For example, smokers only account for 1% of the overall respondents, so it is helpful to keep the extremely sample size of smokers in mind when evaluating the effect of smoking on obesity and extrapolating conclusions from a sample that may not be representative. The same goes for evaluating the effect of calorie consumption, as less than 5% of respondents partake in that.
On the other hand, a large majority of respondents have a family history of weight issues and consume high-calorie food often, so conclusions drawn from this data may be more accurate, but it is also important to consider that this data may also be skewed towards a population that is more obese than average.
Graph 8: What is the average age by weight classification?
I wondered if age had anything to do with whether an individual is overweight or obese, and decided to illustrate this with a seaborn barplot. I included a breakdown between genders because gender may also have an impact on body fat. As I ordered the categories from the lowest weight category to the highest, it can be seen that the mean age for a weight category averages higher as the weight category increases, suggesting that as a person ages, they may be more susceptible to being overweight. This graph also implies that females tend to have a higher average age than males by a couple of years for each category besides type II obesity. As the error bars for all of the categories and genders are short, this means that the data did not vary much and does not include a lot of uncertainty.
Graph 9: What is the relationship between weight and height?
As weight and height are known to closely correspond with an individual’s weight category, I wanted to visualize the linear relationship between the two variables as a function of a third variable, gender, through a regression. This graph shows us there is a trending upwards relationship between weight and height with both genders, with the regression line for females slightly steeper than that of males, meaning that the same increase in weight for females corresponds to a slightly larger increase in height. We can also see that the data points for males are more clustered than the datapoints for females, especially over the weight range, which means that females have a wider range in terms of their weight. However, given the deviation of the data points from the regression line, it is clear that the line does not perfectly fit the data.
Graph 10: How is BMI distributed by weight category?
This graph shows that there is a clear relationship between BMI levels and different weight categories, which helps confirm that BMI was used to characterize weight categories. The medians of each category are separated by similar intervals of around 5 kg/m², although some intervals were smaller than others. There are also not too many outliers, indicating that there is a strong correlation, so BMI can be used as a quantitative variable when evaluating the effect of different variables on weight category.
Graphs 11–14: What is the relationship between BMI and various factors?
As BMI is a measure of obesity, I found it helpful to chart how different categorical variables may vary based on it and see which factors were impactful.
The first subplot illustrates how those that frequently consume high-calorie food have a median BMI higher than the median BMI of those who don’t (around 6 kg/m² higher). This indicates that calorie count is likely a contributing factor to increased body fat.
The second subplot shows that there is little relationship between alcohol consumption and BMI, as those who frequently drink alcohol had the same median BMI as those who do not drink alcohol at all. Additionally, only one person responded “always”, so I would be interested in getting more data on that answer.
The third subplot suggests that a family history of obesity is also a contributing factor, as the median BMI of those with a family history of obesity is around 11 kg/m² higher.
Lastly, the fourth subplot does not show any relationship between gender and BMI as females and males had the same median BMI; however, based on the quartiles, the BMI of females is more spread out, which echoes the earlier observation that the weight of females had a wider range than that of males.
Graphs 15–16: How does an individual’s movement habits impact their BMI?
This figure shows two variables associated with movement and how they impact BMI. The first subplot’s regression line shows that there is a negative trend in a person’s frequency of physical activity and their BMI; however, many points are very distant from the regression line, suggesting that frequency of physical activity may actually not be correlated.
The second subplot shows the relationship between different modes of transportation and BMI. The modes are presented from most physically strenuous modes to least physically strenuous. Sorting them in this order shows us that there is a trend between more strenuous activity, such as walking or biking, with a lower BMI.
While the data analysis here has been done through visualizations, it is also helpful to call .corr() to get a correlation matrix on the numeric variables.
However, it may be even more helpful to quickly turn this matrix into a heatmap, as it lets the user efficiently recognize variables that are highly correlated or uncorrelated. The drawback is that .corr() only takes quantitative data into account, so we are not able to compare the numbers from these variables to those of the qualitative, categorical variables, which is a large downside because qualitative variables account for half of the variables.
This data analysis suggests that factors such as a family history of obesity and eating high-calorie food can strongly influence weight classification, while other factors such as age had less influence, and factors such as gender had no influence.