Picture Pandas - Little Intro guide (Olympics data)

Last updated on May 13, 2020 6 min read Data Science

Pandas is a python library for data analysis and manipulation.

This post is not a complete guide to pandas. For a more comprehensive guide, you can take this kaggle course
This is also not a complete Exploratory data analysis of the Olympic dataset.
Rather it is a little intro guide covering some basic operations in pandas, with an approach of asking the questions as you explore the data.

Frame the question approach

So the idea is you are given a dataset, but more often than not it is not the ideal dataset that you can use directly in the machine learning pipeline. You need to get it as close to the ideal state as possible. And the way to do this is to ask questions to your dataset. Ask questions so that you get answers from the dataset and transform that dataset into something that can be used to get results. And the questions depend upon what results you want, or what your end goal is.

Also, note that the questions don't necessarily have to come first. More often than not exploring the data, and combining different operations can give rise to a solution that can answer a question. So it is hard to tell whether the chicken came first or the egg.

Eventually, the end goal is to make a dataset that is much more useful in the context of that is set out to be solved.

Dataset

The dataset considered here is of 120 years of Olympic history athlete and results. Download this dataset if you want to follow along.

What will be covered?

Read and select data
Summary statistics
Group and sort data
create DataFrame
Combine data (merge, concat)

Pre- requesite

Basic python knowledge
Jupyter notebook setup
Install pandas (pip install pandas)

#dependencies
import pandas as pd

1. Read

To read the data, we can use either read_csv or read_excel depending on the data file format.

#load dataset
df = pd.read_csv('./data/athlete_events.csv')
#head() displays the top rows
df.head()

Fig: Olympic athlete events dataset

This dataset has the athlete performance for all the Olympic events held to date. Note that the loaded dataset is of type DataFrame.

2. Select

In most cases, we do not require the complete dataset. We only need a portion of the dataset for analysis, hence we “select” the data.

Fig: Select data

There are two methods to select - iloc and loc

iloc is to select rows and columns using numerical constraints.

format is dataframe.iloc[x,y]

# select
df.iloc[1:3,0:2]
df.iloc[1:3,2:5]

Fig: A portion of dataset is selected using iloc

loc can be used to specify rows/column names, very handy while specifying certain columns

df.loc[1:5,['Age','Team']]

Fig: selecting data using loc

Conversely, We can also use pandas filter() method If we want to filter certain columns

Fig: filter specific columns

df.filter(['Age','Medal','Team'])

Fig: filter age, medal, team

There is another way to select wherein we can also give conditions for the selection.

suppose we are interested in number of Gold,silver and bronze medals by old people in Italy - question

df_Italy_old = df.loc[(df.Team=='Italy')&(df.Age>=40)] #and - both conditions have to be true
df_Italy_old['Medal'].value_counts()

Fig: value count of medals by people over 40 from Italy

As we can see people above 40 from Italy have won 16 Gold medals.

3. Summary statistics

Summary stat gives us a quantifiable description of the data.

You might have encountered it in your high school statistics class.They include count,mean,median,mode,standard deviation, etc.

We can obtain a summary stat of a dataframe using the “describe” method.

df.describe()

Fig: Summary statistics

As we can observe only the columns with int/float as their types are displayed in summary stat.
notice that count for fields height, weight, and age are less than the id. Indicating missing values in those fields.

One thing to remember is that the median is a better representation of the center than mean if the data contains too many extreme values/outliers. For example in the data, the max age of a person is 97 years old, this might affect the mean value but does not affect the median value.

Also, an individual summary stat can also be found by using mean(), count(), and other functions -

df.Age.mean()

25.556898357297374

df.Height.median()

175.0

4. Group and sort values

Fig: Grouping the data

Sticking with the medals brought in by aged people in the olympics, If we were to extract the information of countries with the highest medals by aged people, how would we go about it?

first, select people above 50
then, filter only the columns ‘Team’ and ‘Medal’
then use groupby() function to group the data based on ‘Team’
Since we need the count to the medals, use count() function(). Other criteria include mean(), median(), sum(),etc
sort the values using the key ‘Medal’ in descending order. Sorting can also be done using multiple keys.
and Finally, display the top ten countries

We can join all these methods together in the following manner -

df_old_people = df[df.Age>=50]
df_old_people.filter(['Team','Medal']).groupby('Team').count().sort_values('Medal',ascending=False).head(10)

Fig: Teams with most medals from aged athletes(above 50 years)

We notice that France tops the list of countries when it comes to medals by aged people!!

Now let us try to explore France, and see which athlete above 50 has the highest medals

df[(df.Team=='France') & (df.Age>=50)].groupby('Name').count().sort_values('Medal',ascending=False)

Fig: Aged athletes(above 50 years) with most medals for France

It is a shooter named Lon Ernest Moreaux! Here is more information about this particular athlete's performance

df[(df.Age>=50)&(df.Name=='Lon Ernest Moreaux')]

Fig: Details about Lon Ernest Moreaux

Create dataframe

We have learned how to read an existing dataset, but what if we wanted to create one from scratch?

So we'll leave the Olympics dataset behind and begin by creating dataframe from scratch, further using new dataframes for Combining operations.

#we need numpy to create an array
import numpy as np

df2 = pd.DataFrame(np.array([[1,2,3],[3,3,5]]),columns=['a','b','c'])
df2   # original dataframe

Fig: df2

Combine

There are two methods to combine the dataframes - merge and concat.

Merge

Merge is used to combine two dataframe with a common column.

Fig: Merge the data

Let us create a dataframe df3.

df3 = pd.DataFrame(np.array([[2],[3]]),columns=['e'])
df3 # dataframe to be merged

Fig: df3

Notice column values for ‘b’ and ‘e’ are same, hence we can merge them in the following way -

df4 = df2.merge(df3,left_on='b',right_on='e',how='left')
df4 # merged dataframe

Fig: df4

‘left’ indicates left outer join

Concat

Fig: Concat the data

Concat is used to add additional rows to the existing dataframe, notice that column names should be the same to concat.

Let us create a dataframe df5.

df5 = pd.DataFrame(np.array([[5,7,8]]),columns=['a','b','c'])
df5 # dataframe to be concatincated

Fig: df5

Now we can concat df2 and df5 based on same columnn names.

df6 = pd.concat([df2,df5])
df6 # concatinated dataframe

Fig: df6

Conclusion

We have learned how to use key pandas operations such as read, select, group, select, merge, concat.
We learned how operations can be chained together to answer complex questions.
We figured out which countries have the most medals brought in by aged people.

Where to go from here?

Explore how you can handle missing values(Nan)
learn about map and apply functions.
Take a look at the official user guide to explore further.

This is a part of my #100daysofcode challenge, check out the github repo here .

pandas python Data Science guide