Data Management in Python

Setting Up Environment

Clearing Environment

In [1]:
%reset -f
%who
Interactive namespace is empty.

Importing Packages

In [2]:
import numpy as np
import pandas as pd

Importing and Inspecting Data

In [3]:
covid_data = pd.read_csv('https://data.humdata.org/hxlproxy/data/download/time_series_covid19_confirmed_global_narrow.csv?dest=data_edit&filter01=merge&merge-url01=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2Fe%2F2PACX-1vTglKQRXpkKSErDiWG6ycqEth32MY0reMuVGhaslImLjfuLU0EUgyyu2e-3vKDArjqGX7dXEBV8FJ4f%2Fpub%3Fgid%3D1326629740%26single%3Dtrue%26output%3Dcsv&merge-keys01=%23country%2Bname&merge-tags01=%23country%2Bcode%2C%23region%2Bmain%2Bcode%2C%23region%2Bsub%2Bcode%2C%23region%2Bintermediate%2Bcode&filter02=merge&merge-url02=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2Fe%2F2PACX-1vTglKQRXpkKSErDiWG6ycqEth32MY0reMuVGhaslImLjfuLU0EUgyyu2e-3vKDArjqGX7dXEBV8FJ4f%2Fpub%3Fgid%3D398158223%26single%3Dtrue%26output%3Dcsv&merge-keys02=%23adm1%2Bname&merge-tags02=%23country%2Bcode%2C%23region%2Bmain%2Bcode%2C%23region%2Bsub%2Bcode%2C%23region%2Bintermediate%2Bcode&merge-replace02=on&merge-overwrite02=on&filter03=explode&explode-header-att03=date&explode-value-att03=value&filter04=rename&rename-oldtag04=%23affected%2Bdate&rename-newtag04=%23date&rename-header04=Date&filter05=rename&rename-oldtag05=%23affected%2Bvalue&rename-newtag05=%23affected%2Binfected%2Bvalue%2Bnum&rename-header05=Value&filter06=clean&clean-date-tags06=%23date&filter07=sort&sort-tags07=%23date&sort-reverse07=on&filter08=sort&sort-tags08=%23country%2Bname%2C%23adm1%2Bname&tagger-match-all=on&tagger-default-tag=%23affected%2Blabel&tagger-01-header=province%2Fstate&tagger-01-tag=%23adm1%2Bname&tagger-02-header=country%2Fregion&tagger-02-tag=%23country%2Bname&tagger-03-header=lat&tagger-03-tag=%23geo%2Blat&tagger-04-header=long&tagger-04-tag=%23geo%2Blon&header-row=1&url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv')
In [4]:
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25873 entries, 0 to 25872
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Province/State            8037 non-null   object
 1   Country/Region            25873 non-null  object
 2   Lat                       25873 non-null  object
 3   Long                      25873 non-null  object
 4   Date                      25873 non-null  object
 5   Value                     25873 non-null  object
 6   ISO 3166-1 Alpha 3-Codes  25285 non-null  object
 7   Region Code               25383 non-null  object
 8   Sub-region Code           25383 non-null  object
 9   Intermediate Region Code  9409 non-null   object
dtypes: object(10)
memory usage: 2.0+ MB
In [5]:
covid_data
Out[5]:
Province/State Country/Region Lat Long Date Value ISO 3166-1 Alpha 3-Codes Region Code Sub-region Code Intermediate Region Code
0 #adm1+name #country+name #geo+lat #geo+lon #date #affected+infected+value+num #country+code #region+main+code #region+sub+code #region+intermediate+code
1 NaN Afghanistan 33.0 65.0 2020-04-28 1828 AFG 142 34 NaN
2 NaN Afghanistan 33.0 65.0 2020-04-27 1703 AFG 142 34 NaN
3 NaN Afghanistan 33.0 65.0 2020-04-26 1531 AFG 142 34 NaN
4 NaN Afghanistan 33.0 65.0 2020-04-25 1463 AFG 142 34 NaN
... ... ... ... ... ... ... ... ... ... ...
25868 NaN Zimbabwe -20.0 30.0 2020-01-26 0 ZWE 2 202 14
25869 NaN Zimbabwe -20.0 30.0 2020-01-25 0 ZWE 2 202 14
25870 NaN Zimbabwe -20.0 30.0 2020-01-24 0 ZWE 2 202 14
25871 NaN Zimbabwe -20.0 30.0 2020-01-23 0 ZWE 2 202 14
25872 NaN Zimbabwe -20.0 30.0 2020-01-22 0 ZWE 2 202 14

25873 rows × 10 columns

Removing First Row

We are also reseting the row index

In [6]:
covid_data.drop(index = 0, inplace = True)
covid_data.reset_index(drop = True, inplace = True)
In [7]:
covid_data
Out[7]:
Province/State Country/Region Lat Long Date Value ISO 3166-1 Alpha 3-Codes Region Code Sub-region Code Intermediate Region Code
0 NaN Afghanistan 33.0 65.0 2020-04-28 1828 AFG 142 34 NaN
1 NaN Afghanistan 33.0 65.0 2020-04-27 1703 AFG 142 34 NaN
2 NaN Afghanistan 33.0 65.0 2020-04-26 1531 AFG 142 34 NaN
3 NaN Afghanistan 33.0 65.0 2020-04-25 1463 AFG 142 34 NaN
4 NaN Afghanistan 33.0 65.0 2020-04-24 1351 AFG 142 34 NaN
... ... ... ... ... ... ... ... ... ... ...
25867 NaN Zimbabwe -20.0 30.0 2020-01-26 0 ZWE 2 202 14
25868 NaN Zimbabwe -20.0 30.0 2020-01-25 0 ZWE 2 202 14
25869 NaN Zimbabwe -20.0 30.0 2020-01-24 0 ZWE 2 202 14
25870 NaN Zimbabwe -20.0 30.0 2020-01-23 0 ZWE 2 202 14
25871 NaN Zimbabwe -20.0 30.0 2020-01-22 0 ZWE 2 202 14

25872 rows × 10 columns

Changing Column Names

In [8]:
covid_data.columns = ['state', 'country', 'latitude', 'longitude', 'date', 'total_cases', 
                      'iso_country_code', 'region_code', 'subregion_code', 'intermediate_region_code']
In [9]:
covid_data
Out[9]:
state country latitude longitude date total_cases iso_country_code region_code subregion_code intermediate_region_code
0 NaN Afghanistan 33.0 65.0 2020-04-28 1828 AFG 142 34 NaN
1 NaN Afghanistan 33.0 65.0 2020-04-27 1703 AFG 142 34 NaN
2 NaN Afghanistan 33.0 65.0 2020-04-26 1531 AFG 142 34 NaN
3 NaN Afghanistan 33.0 65.0 2020-04-25 1463 AFG 142 34 NaN
4 NaN Afghanistan 33.0 65.0 2020-04-24 1351 AFG 142 34 NaN
... ... ... ... ... ... ... ... ... ... ...
25867 NaN Zimbabwe -20.0 30.0 2020-01-26 0 ZWE 2 202 14
25868 NaN Zimbabwe -20.0 30.0 2020-01-25 0 ZWE 2 202 14
25869 NaN Zimbabwe -20.0 30.0 2020-01-24 0 ZWE 2 202 14
25870 NaN Zimbabwe -20.0 30.0 2020-01-23 0 ZWE 2 202 14
25871 NaN Zimbabwe -20.0 30.0 2020-01-22 0 ZWE 2 202 14

25872 rows × 10 columns

Removing Latitude and Longitude Columns

In [10]:
covid_data.drop(columns = ['latitude', 'longitude'], inplace = True)
covid_data
Out[10]:
state country date total_cases iso_country_code region_code subregion_code intermediate_region_code
0 NaN Afghanistan 2020-04-28 1828 AFG 142 34 NaN
1 NaN Afghanistan 2020-04-27 1703 AFG 142 34 NaN
2 NaN Afghanistan 2020-04-26 1531 AFG 142 34 NaN
3 NaN Afghanistan 2020-04-25 1463 AFG 142 34 NaN
4 NaN Afghanistan 2020-04-24 1351 AFG 142 34 NaN
... ... ... ... ... ... ... ... ...
25867 NaN Zimbabwe 2020-01-26 0 ZWE 2 202 14
25868 NaN Zimbabwe 2020-01-25 0 ZWE 2 202 14
25869 NaN Zimbabwe 2020-01-24 0 ZWE 2 202 14
25870 NaN Zimbabwe 2020-01-23 0 ZWE 2 202 14
25871 NaN Zimbabwe 2020-01-22 0 ZWE 2 202 14

25872 rows × 8 columns

Changing Column Types

In [11]:
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25872 entries, 0 to 25871
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   state                     8036 non-null   object
 1   country                   25872 non-null  object
 2   date                      25872 non-null  object
 3   total_cases               25872 non-null  object
 4   iso_country_code          25284 non-null  object
 5   region_code               25382 non-null  object
 6   subregion_code            25382 non-null  object
 7   intermediate_region_code  9408 non-null   object
dtypes: object(8)
memory usage: 1.6+ MB

We will convert numeric fields to floats then ints (because we have to for some reason)

In [12]:
covid_data = covid_data.astype({'date' : 'datetime64', 
                               'total_cases' : 'float64'})
covid_data = covid_data.astype({'total_cases' : 'int64'})
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25872 entries, 0 to 25871
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   state                     8036 non-null   object        
 1   country                   25872 non-null  object        
 2   date                      25872 non-null  datetime64[ns]
 3   total_cases               25872 non-null  int64         
 4   iso_country_code          25284 non-null  object        
 5   region_code               25382 non-null  object        
 6   subregion_code            25382 non-null  object        
 7   intermediate_region_code  9408 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(6)
memory usage: 1.6+ MB

Converting Some States To Countries

In [13]:
covid_data['state'].unique()
Out[13]:
array([nan, 'Australian Capital Territory', 'New South Wales',
       'Northern Territory', 'Queensland', 'South Australia', 'Tasmania',
       'Victoria', 'Western Australia', 'Alberta', 'British Columbia',
       'Diamond Princess', 'Grand Princess', 'Manitoba', 'New Brunswick',
       'Newfoundland and Labrador', 'Northwest Territories',
       'Nova Scotia', 'Ontario', 'Prince Edward Island', 'Quebec',
       'Recovered', 'Saskatchewan', 'Yukon', 'Anhui', 'Beijing',
       'Chongqing', 'Fujian', 'Gansu', 'Guangdong', 'Guangxi', 'Guizhou',
       'Hainan', 'Hebei', 'Heilongjiang', 'Henan', 'Hong Kong', 'Hubei',
       'Hunan', 'Inner Mongolia', 'Jiangsu', 'Jiangxi', 'Jilin',
       'Liaoning', 'Macau', 'Ningxia', 'Qinghai', 'Shaanxi', 'Shandong',
       'Shanghai', 'Shanxi', 'Sichuan', 'Tianjin', 'Tibet', 'Xinjiang',
       'Yunnan', 'Zhejiang', 'Faroe Islands', 'Greenland',
       'French Guiana', 'French Polynesia', 'Guadeloupe', 'Martinique',
       'Mayotte', 'New Caledonia', 'Reunion', 'Saint Barthelemy',
       'Saint Pierre and Miquelon', 'St Martin', 'Aruba',
       'Bonaire, Sint Eustatius and Saba', 'Curacao', 'Sint Maarten',
       'Anguilla', 'Bermuda', 'British Virgin Islands', 'Cayman Islands',
       'Channel Islands', 'Falkland Islands (Malvinas)', 'Gibraltar',
       'Isle of Man', 'Montserrat', 'Turks and Caicos Islands'],
      dtype=object)

Making a list of states that we want to consider countries

In [14]:
country_states = ['Diamond Princess', 'Grand Princess', 'Hong Kong', 
                  'Faroe Islands', 'Greenland', 'French Guiana', 
                  'French Polynesia', 'Guadeloupe', 'Martinique',
                  'Mayotte', 'New Caledonia', 'Reunion', 'Saint Barthelemy',
                  'Saint Pierre and Miquelon', 'St Martin', 'Aruba',
                  'Bonaire, Sint Eustatius and Saba', 'Curacao', 'Sint Maarten',
                  'Anguilla', 'Bermuda', 'British Virgin Islands', 'Cayman Islands',
                  'Channel Islands', 'Falkland Islands (Malvinas)', 'Gibraltar',
                  'Isle of Man', 'Montserrat', 'Turks and Caicos Islands']

Converting states to countries, when they are in country_states

In [15]:
covid_data.loc[covid_data['state'].isin(country_states), 'country'] = covid_data.loc[covid_data['state'].isin(country_states), 'state'].copy()

Checking our work

In [16]:
covid_data['country'][covid_data['country'].isin(country_states)].unique()
Out[16]:
array(['Diamond Princess', 'Grand Princess', 'Hong Kong', 'Faroe Islands',
       'Greenland', 'French Guiana', 'French Polynesia', 'Guadeloupe',
       'Martinique', 'Mayotte', 'New Caledonia', 'Reunion',
       'Saint Barthelemy', 'Saint Pierre and Miquelon', 'St Martin',
       'Aruba', 'Bonaire, Sint Eustatius and Saba', 'Curacao',
       'Sint Maarten', 'Anguilla', 'Bermuda', 'British Virgin Islands',
       'Cayman Islands', 'Channel Islands', 'Falkland Islands (Malvinas)',
       'Gibraltar', 'Isle of Man', 'Montserrat',
       'Turks and Caicos Islands'], dtype=object)
In [17]:
pd.Series(country_states).to_numpy()
Out[17]:
array(['Diamond Princess', 'Grand Princess', 'Hong Kong', 'Faroe Islands',
       'Greenland', 'French Guiana', 'French Polynesia', 'Guadeloupe',
       'Martinique', 'Mayotte', 'New Caledonia', 'Reunion',
       'Saint Barthelemy', 'Saint Pierre and Miquelon', 'St Martin',
       'Aruba', 'Bonaire, Sint Eustatius and Saba', 'Curacao',
       'Sint Maarten', 'Anguilla', 'Bermuda', 'British Virgin Islands',
       'Cayman Islands', 'Channel Islands', 'Falkland Islands (Malvinas)',
       'Gibraltar', 'Isle of Man', 'Montserrat',
       'Turks and Caicos Islands'], dtype=object)
In [18]:
del(country_states)

Removing state column

In [19]:
covid_data.drop(columns = 'state', inplace = True)
covid_data
Out[19]:
country date total_cases iso_country_code region_code subregion_code intermediate_region_code
0 Afghanistan 2020-04-28 1828 AFG 142 34 NaN
1 Afghanistan 2020-04-27 1703 AFG 142 34 NaN
2 Afghanistan 2020-04-26 1531 AFG 142 34 NaN
3 Afghanistan 2020-04-25 1463 AFG 142 34 NaN
4 Afghanistan 2020-04-24 1351 AFG 142 34 NaN
... ... ... ... ... ... ... ...
25867 Zimbabwe 2020-01-26 0 ZWE 2 202 14
25868 Zimbabwe 2020-01-25 0 ZWE 2 202 14
25869 Zimbabwe 2020-01-24 0 ZWE 2 202 14
25870 Zimbabwe 2020-01-23 0 ZWE 2 202 14
25871 Zimbabwe 2020-01-22 0 ZWE 2 202 14

25872 rows × 7 columns

Extracting Country-Level Data

And removing duplicates to get country-level data

In [20]:
country_data = covid_data[['country', 'iso_country_code', 'region_code', 'subregion_code', 'intermediate_region_code']].copy()
country_data.drop_duplicates(subset = 'country', inplace = True)
country_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 213 entries, 0 to 25774
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   country                   213 non-null    object
 1   iso_country_code          209 non-null    object
 2   region_code               210 non-null    object
 3   subregion_code            210 non-null    object
 4   intermediate_region_code  96 non-null     object
dtypes: object(5)
memory usage: 10.0+ KB
In [21]:
country_data
Out[21]:
country iso_country_code region_code subregion_code intermediate_region_code
0 Afghanistan AFG 142 34 NaN
98 Albania ALB 150 39 NaN
196 Algeria DZA 2 15 NaN
294 Andorra AND 150 39 NaN
392 Angola AGO 2 202 17
... ... ... ... ... ...
25382 West Bank and Gaza PSE 142 145 NaN
25480 Western Sahara ESH 2 15 NaN
25578 Yemen YEM 142 145 NaN
25676 Zambia ZMB 2 202 14
25774 Zimbabwe ZWE 2 202 14

213 rows × 5 columns

Removing Country Level Data From covid_data

In [22]:
covid_data.drop(columns = ['iso_country_code', 'region_code',
                           'subregion_code', 'intermediate_region_code'], 
                inplace = True)
covid_data
Out[22]:
country date total_cases
0 Afghanistan 2020-04-28 1828
1 Afghanistan 2020-04-27 1703
2 Afghanistan 2020-04-26 1531
3 Afghanistan 2020-04-25 1463
4 Afghanistan 2020-04-24 1351
... ... ... ...
25867 Zimbabwe 2020-01-26 0
25868 Zimbabwe 2020-01-25 0
25869 Zimbabwe 2020-01-24 0
25870 Zimbabwe 2020-01-23 0
25871 Zimbabwe 2020-01-22 0

25872 rows × 3 columns

Aggregating Data Over states

Combining (summing) rows that are from the same countries on the same day (summing over states)

Then sorting data by country then date

In [23]:
covid_data = covid_data.groupby(['country','date'])['total_cases'].sum().reset_index()
covid_data.sort_values(by = ['country','date'], inplace = True)
covid_data
Out[23]:
country date total_cases
0 Afghanistan 2020-01-22 0
1 Afghanistan 2020-01-23 0
2 Afghanistan 2020-01-24 0
3 Afghanistan 2020-01-25 0
4 Afghanistan 2020-01-26 0
... ... ... ...
20869 Zimbabwe 2020-04-24 29
20870 Zimbabwe 2020-04-25 31
20871 Zimbabwe 2020-04-26 31
20872 Zimbabwe 2020-04-27 32
20873 Zimbabwe 2020-04-28 32

20874 rows × 3 columns

Calculating daily_cases Column

In [24]:
covid_data['yesterday_total_cases'] = covid_data['total_cases'].shift(1).copy()
covid_data['daily_cases'] = covid_data['total_cases'] - covid_data['yesterday_total_cases']
covid_data.drop(columns = 'yesterday_total_cases', inplace = True)
covid_data.loc[covid_data['date'] == '2020-01-22T00:00:00.000000000', 'daily_cases'] = covid_data.loc[
    covid_data['date'] == '2020-01-22T00:00:00.000000000', 'total_cases']
covid_data = covid_data.astype({'daily_cases' : 'int64'})
In [25]:
covid_data
Out[25]:
country date total_cases daily_cases
0 Afghanistan 2020-01-22 0 0
1 Afghanistan 2020-01-23 0 0
2 Afghanistan 2020-01-24 0 0
3 Afghanistan 2020-01-25 0 0
4 Afghanistan 2020-01-26 0 0
... ... ... ... ...
20869 Zimbabwe 2020-04-24 29 1
20870 Zimbabwe 2020-04-25 31 2
20871 Zimbabwe 2020-04-26 31 0
20872 Zimbabwe 2020-04-27 32 1
20873 Zimbabwe 2020-04-28 32 0

20874 rows × 4 columns

Merging Region-Level Information to covid_data

In [26]:
country_data['region'] = np.NaN
country_data.loc[country_data['subregion_code'] == '143', 'region'] = 'Central Asia'
country_data.loc[country_data['subregion_code'] == '145', 'region'] = 'Western Asia'
country_data.loc[country_data['subregion_code'] == '15', 'region'] = 'North Africa'
country_data.loc[country_data['subregion_code'] == '151', 'region'] = 'Eastern Europe'
country_data.loc[country_data['subregion_code'] == '154', 'region'] = 'Northern Europe'
country_data.loc[country_data['subregion_code'] == '155', 'region'] = 'Western Europe'
country_data.loc[country_data['subregion_code'] == '202', 'region'] = 'Non-Northern Africa'
country_data.loc[country_data['subregion_code'] == '21', 'region'] = 'Northern America'
country_data.loc[country_data['subregion_code'] == '30', 'region'] = 'Eastern Asia'
country_data.loc[country_data['subregion_code'] == '34', 'region'] = 'Southern Asia'
country_data.loc[country_data['subregion_code'] == '35', 'region'] = 'Southeastern Asia'
country_data.loc[country_data['subregion_code'] == '39', 'region'] = 'Southern Europe'
country_data.loc[country_data['subregion_code'] == '419', 'region'] = 'Central and South America'
country_data.loc[country_data['subregion_code'].isin(['53', '54', '61']), 'region'] = 'Oceania'
In [27]:
covid_data = covid_data.merge(country_data[['country', 'region']], how = 'left')
covid_data
Out[27]:
country date total_cases daily_cases region
0 Afghanistan 2020-01-22 0 0 Southern Asia
1 Afghanistan 2020-01-23 0 0 Southern Asia
2 Afghanistan 2020-01-24 0 0 Southern Asia
3 Afghanistan 2020-01-25 0 0 Southern Asia
4 Afghanistan 2020-01-26 0 0 Southern Asia
... ... ... ... ... ...
20869 Zimbabwe 2020-04-24 29 1 Non-Northern Africa
20870 Zimbabwe 2020-04-25 31 2 Non-Northern Africa
20871 Zimbabwe 2020-04-26 31 0 Non-Northern Africa
20872 Zimbabwe 2020-04-27 32 1 Non-Northern Africa
20873 Zimbabwe 2020-04-28 32 0 Non-Northern Africa

20874 rows × 5 columns

Extracting Continent Data

These continents divide countries slightly differently than the actual continents do

In [28]:
covid_data['continent'] = covid_data['region'].copy()
covid_data.loc[covid_data['continent'].str.contains(r'Africa$').fillna(False), 'continent'] = 'Africa'
covid_data.loc[covid_data['continent'].str.contains(r'Asia$').fillna(False), 'continent'] = 'Asia'
covid_data.loc[covid_data['continent'].str.contains(r'Europe$').fillna(False), 'continent'] = 'Europe'
covid_data = covid_data[['continent', 'region', 'country', 'date', 'total_cases', 'daily_cases']]
In [29]:
print(covid_data['continent'].unique())
covid_data
['Asia' 'Europe' 'Africa' 'Central and South America' 'Oceania'
 'Northern America' nan]
Out[29]:
continent region country date total_cases daily_cases
0 Asia Southern Asia Afghanistan 2020-01-22 0 0
1 Asia Southern Asia Afghanistan 2020-01-23 0 0
2 Asia Southern Asia Afghanistan 2020-01-24 0 0
3 Asia Southern Asia Afghanistan 2020-01-25 0 0
4 Asia Southern Asia Afghanistan 2020-01-26 0 0
... ... ... ... ... ... ...
20869 Africa Non-Northern Africa Zimbabwe 2020-04-24 29 1
20870 Africa Non-Northern Africa Zimbabwe 2020-04-25 31 2
20871 Africa Non-Northern Africa Zimbabwe 2020-04-26 31 0
20872 Africa Non-Northern Africa Zimbabwe 2020-04-27 32 1
20873 Africa Non-Northern Africa Zimbabwe 2020-04-28 32 0

20874 rows × 6 columns

Subsetting and Concatenating Rows

Creating Eurasia data from Asia and Europe data

In [30]:
asia_data = covid_data.loc[covid_data['continent'] == 'Asia',]
europe_data = covid_data.loc[covid_data['continent'] == 'Europe',]
eurasia_data = pd.concat([asia_data , europe_data], ignore_index = True)
eurasia_data
Out[30]:
continent region country date total_cases daily_cases
0 Asia Southern Asia Afghanistan 2020-01-22 0 0
1 Asia Southern Asia Afghanistan 2020-01-23 0 0
2 Asia Southern Asia Afghanistan 2020-01-24 0 0
3 Asia Southern Asia Afghanistan 2020-01-25 0 0
4 Asia Southern Asia Afghanistan 2020-01-26 0 0
... ... ... ... ... ... ...
9403 Europe Northern Europe United Kingdom 2020-04-24 143464 5386
9404 Europe Northern Europe United Kingdom 2020-04-25 148377 4913
9405 Europe Northern Europe United Kingdom 2020-04-26 152840 4463
9406 Europe Northern Europe United Kingdom 2020-04-27 157149 4309
9407 Europe Northern Europe United Kingdom 2020-04-28 161145 3996

9408 rows × 6 columns

In [31]:
del(asia_data, eurasia_data, europe_data)

Merging Columns

Adding in ISO Country Codes to Covid Data

In [32]:
covid_data = pd.merge(covid_data, country_data[['country', 'iso_country_code']], how = 'left', on = 'country')
covid_data = covid_data[['continent', 'region', 'country', 'iso_country_code', 'date', 'total_cases', 'daily_cases']]
covid_data
Out[32]:
continent region country iso_country_code date total_cases daily_cases
0 Asia Southern Asia Afghanistan AFG 2020-01-22 0 0
1 Asia Southern Asia Afghanistan AFG 2020-01-23 0 0
2 Asia Southern Asia Afghanistan AFG 2020-01-24 0 0
3 Asia Southern Asia Afghanistan AFG 2020-01-25 0 0
4 Asia Southern Asia Afghanistan AFG 2020-01-26 0 0
... ... ... ... ... ... ... ...
20869 Africa Non-Northern Africa Zimbabwe ZWE 2020-04-24 29 1
20870 Africa Non-Northern Africa Zimbabwe ZWE 2020-04-25 31 2
20871 Africa Non-Northern Africa Zimbabwe ZWE 2020-04-26 31 0
20872 Africa Non-Northern Africa Zimbabwe ZWE 2020-04-27 32 1
20873 Africa Non-Northern Africa Zimbabwe ZWE 2020-04-28 32 0

20874 rows × 7 columns

Reshaping to Wide Data

In [33]:
covid_wide_data = covid_data
covid_wide_data = covid_wide_data.astype({'date' : 'str'}) 
covid_wide_data = covid_wide_data.pivot(index = 'country', columns = 'date', values = ['total_cases', 'daily_cases']) 
covid_wide_data.columns = ['_'.join(column).strip() for column in covid_wide_data.columns.values]
covid_wide_data.index.name = None
covid_wide_data
Out[33]:
total_cases_2020-01-22 total_cases_2020-01-23 total_cases_2020-01-24 total_cases_2020-01-25 total_cases_2020-01-26 total_cases_2020-01-27 total_cases_2020-01-28 total_cases_2020-01-29 total_cases_2020-01-30 total_cases_2020-01-31 ... daily_cases_2020-04-19 daily_cases_2020-04-20 daily_cases_2020-04-21 daily_cases_2020-04-22 daily_cases_2020-04-23 daily_cases_2020-04-24 daily_cases_2020-04-25 daily_cases_2020-04-26 daily_cases_2020-04-27 daily_cases_2020-04-28
Afghanistan 0 0 0 0 0 0 0 0 0 0 ... 63 30 66 84 103 72 112 68 172 125
Albania 0 0 0 0 0 0 0 0 0 0 ... 14 22 25 25 29 15 34 14 10 14
Algeria 0 0 0 0 0 0 0 0 0 0 ... 95 89 93 99 97 120 129 126 135 132
Andorra 0 0 0 0 0 0 0 0 0 0 ... 9 4 0 6 0 8 7 0 5 0
Angola 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
West Bank and Gaza 0 0 0 0 0 0 0 0 0 0 ... 19 12 17 8 6 4 -142 0 0 1
Western Sahara 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Yemen 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Zambia 0 0 0 0 0 0 0 0 0 0 ... 4 4 5 4 2 8 0 4 0 7
Zimbabwe 0 0 0 0 0 0 0 0 0 0 ... 0 0 3 0 0 1 2 0 1 0

213 rows × 196 columns

Reshaping to Long Data

In [34]:
covid_long_data = covid_wide_data.stack()
covid_long_data = covid_long_data.to_frame()
covid_long_data.reset_index(inplace = True)
covid_long_data.columns = ['country', 'case_type_and_date', 'cases']
covid_long_data['case_type'] = ''
covid_long_data.loc[covid_long_data['case_type_and_date'].str.contains(r'^total_cases_') ,'case_type'] = 'total'
covid_long_data.loc[covid_long_data['case_type_and_date'].str.contains(r'^daily_cases_') ,'case_type'] = 'daily'
covid_long_data['case_type_and_date'] = covid_long_data['case_type_and_date'].str[12:]
covid_long_data.columns = ['country', 'date', 'cases', 'case_type']
covid_long_data = covid_long_data[['country', 'date', 'case_type', 'cases']]
covid_long_data 
Out[34]:
country date case_type cases
0 Afghanistan 2020-01-22 total 0
1 Afghanistan 2020-01-23 total 0
2 Afghanistan 2020-01-24 total 0
3 Afghanistan 2020-01-25 total 0
4 Afghanistan 2020-01-26 total 0
... ... ... ... ...
41743 Zimbabwe 2020-04-24 daily 1
41744 Zimbabwe 2020-04-25 daily 2
41745 Zimbabwe 2020-04-26 daily 0
41746 Zimbabwe 2020-04-27 daily 1
41747 Zimbabwe 2020-04-28 daily 0

41748 rows × 4 columns

In [35]:
del(covid_wide_data, covid_long_data)

Removing Rows with Missing Data

In [36]:
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20874 entries, 0 to 20873
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   continent         20580 non-null  object        
 1   region            20580 non-null  object        
 2   country           20874 non-null  object        
 3   iso_country_code  20482 non-null  object        
 4   date              20874 non-null  datetime64[ns]
 5   total_cases       20874 non-null  int64         
 6   daily_cases       20874 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 1.3+ MB
In [37]:
covid_data.dropna(axis = 'index', how = 'any', inplace = True)
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20482 entries, 0 to 20873
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   continent         20482 non-null  object        
 1   region            20482 non-null  object        
 2   country           20482 non-null  object        
 3   iso_country_code  20482 non-null  object        
 4   date              20482 non-null  datetime64[ns]
 5   total_cases       20482 non-null  int64         
 6   daily_cases       20482 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 1.3+ MB

Aggregating Data to Obtain Total Cases By Region

In [38]:
aggregate_data = covid_data.loc[covid_data['date'] == covid_data['date'].max(), ['region', 'total_cases']]
aggregate_data = aggregate_data.groupby('region').sum()
aggregate_data.index.name = None
aggregate_data
Out[38]:
total_cases
Central Asia 5674
Central and South America 189978
Eastern Asia 108904
Eastern Europe 155584
Non-Northern Africa 21375
North Africa 14303
Northern America 1063841
Northern Europe 228025
Oceania 8320
Southeastern Asia 41932
Southern Asia 147740
Southern Europe 477098
Western Asia 190554
Western Europe 461806
In [39]:
del(aggregate_data)

Example Plot

In [40]:
us_data = covid_data.loc[covid_data['country'] == 'US',]
us_data = us_data.loc[us_data['date'] >= '2020-03-01',]
us_data = us_data.reset_index()
us_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   index             59 non-null     int64         
 1   continent         59 non-null     object        
 2   region            59 non-null     object        
 3   country           59 non-null     object        
 4   iso_country_code  59 non-null     object        
 5   date              59 non-null     datetime64[ns]
 6   total_cases       59 non-null     int64         
 7   daily_cases       59 non-null     int64         
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 3.8+ KB
In [41]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(us_data['date'], us_data['daily_cases'])
plt.show()
In [42]:
del(us_data, plt, fig, ax)