I am digressing a bit in my learning of programming in Python, and I come back to a task which I have kept failing at so far, namely at reading data out of the World Economic Outlook database, published by the International Monetary Fund (https://www.imf.org/en/Publications/WEO/weo-database/2020/October ). This is good training in data cleansing. When I click on that link, I can choose two different formats: TAB delimited values or SDMX. The former download as an Excel file, essentially. Frankly, I feel not up to treating the latter: it is a dynamic format, essentially based on an XML tree. Still to learn, for me. This is one of those cases when I prefer staying in the cave. Daylight can wait. I stick to Excel. I download it, I open it in Excel and I preliminarily cleanse the spreadsheet of the most salient stuff, such as e.g. title rows in the heading above the column labels.
Preliminary cleansing done, I copy the Excel workbook to the working directory of my Anaconda, which is, by default, the root directory of my user profile. I create a new notebook in JupyterLab, and I start by importing whatever I think can be useful:
>> import numpy as np
>> import pandas as pd
>> import os
>> import math
I check the presence of the Excel file with the ‘os.listdir()’ command, and I go:
>> WEO=pd.DataFrame(pd.read_excel(‘WEOOct2020all.xlsx’,sheet_name=’WEOOct2020all’,header=0))
Seems cool. The kernel has swallowed the command. Just in case, I check with ‘WEO.describe()’, and I get:
Estimates Start After | |
count | 7585.000000 |
mean | 2015.186421 |
std | 80.240679 |
min | 0.000000 |
25% | 2018.000000 |
50% | 2019.000000 |
75% | 2019.000000 |
max | 2020.000000 |
WTF? ‘Estimates start after’ is the last column of a two-dimensional table in Excel, and this column gives the year up to which the database provides actual empirics, and after which it is just projections. Besides this one, the database contains numerical columns corresponding to years, starting from 1980. When I go ‘WEO.columns’, I get:
Index([ ‘Country’, ‘Subject Descriptor’,
‘Units’, ‘Scale’,
1980, 1981,
1982, 1983,
1984, 1985,
1986, 1987,
1988, 1989,
1990, 1991,
1992, 1993,
1994, 1995,
1996, 1997,
1998, 1999,
2000, 2001,
2002, 2003,
2004, 2005,
2006, 2007,
2008, 2009,
2010, 2011,
2012, 2013,
2014, 2015,
2016, 2017,
2018, 2019,
2020, 2021,
2022, 2023,
2024, 2025,
‘Estimates Start After’],
dtype=’object’)
Aha! These columns are there, only Python sees them as non-numerical and does not compute any stats from them. As we say in Poland, I am trying to get my man from another angle. I open the source XLSX file in Excel and I save a copy thereof in the CSV format, in the working directory of my Anaconda. I remember that when saved out of an XLSX file, CSVs tend to have the semi-column as separator, instead of the coma. To everyone their ways, mind you. Thus, I go:
>> WEO2=pd.DataFrame(pd.read_csv(‘WEOOct2020all.csv’,header=0,sep=’;’))
When I check with ‘WEO2.info()’, I get:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 8775 entries, 0 to 8774
Data columns (total 51 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Country 8775 non-null object
1 Subject Descriptor 8775 non-null object
2 Units 8775 non-null object
3 Scale 3900 non-null object
4 1980 3879 non-null object
5 1981 4008 non-null object
6 1982 4049 non-null object
7 1983 4091 non-null object
8 1984 4116 non-null object
9 1985 4192 non-null object
10 1986 4228 non-null object
11 1987 4249 non-null object
12 1988 4338 non-null object
13 1989 4399 non-null object
14 1990 4888 non-null object
15 1991 5045 non-null object
16 1992 5428 non-null object
17 1993 5621 non-null object
18 1994 5748 non-null object
19 1995 6104 non-null object
20 1996 6247 non-null object
21 1997 6412 non-null object
22 1998 6584 non-null object
23 1999 6662 non-null object
24 2000 7071 non-null object
25 2001 7193 non-null object
26 2002 7289 non-null object
27 2003 7323 non-null object
28 2004 7391 non-null object
29 2005 7428 non-null object
30 2006 7433 non-null object
31 2007 7441 non-null object
32 2008 7452 non-null object
33 2009 7472 non-null object
34 2010 7475 non-null object
35 2011 7477 non-null object
36 2012 7484 non-null object
37 2013 7493 non-null object
38 2014 7523 non-null object
39 2015 7545 non-null object
40 2016 7547 non-null object
41 2017 7551 non-null object
42 2018 7547 non-null object
43 2019 7539 non-null object
44 2020 7501 non-null object
45 2021 7449 non-null object
46 2022 7389 non-null object
47 2023 7371 non-null object
48 2024 7371 non-null object
49 2025 7371 non-null object
50 Estimates Start After 7585 non-null float64
dtypes: float64(1), object(50)
memory usage: 3.4+ MB
There is some progress, still it is not the entire progress I expected. I still don’t have numerical data, in ‘float64’ type, where I expect it to have. I dig a bit and I see the source of the problem. In the WEO database there is plenty of empty cells, especially before the year 2000. They correspond to missing data, quite simply. In the source XLSX file, they are either just empty, or filled with something that looks like a double hyphen: ‘- -‘. Python shows the contents of these cells as ‘NaN’, which stands for ‘Not a Number’. That double hyphen is the most annoying of the two, as Excel does not see it in the command ‘Replace’. I need to use Python. I do two phases of cleansing:
>> WEO3=WEO2.replace(np.nan,”, regex=True)
>> WEO4=WEO3.replace(‘–‘,”, regex=True)
I check with ‘WEO4.info()’ aaaaand… Bingo! Columns from ‘1980’ to ‘2025’ are of the type ‘float64’.
The WEO database is made of several variables stacked one underneath the other in consecutive rows. You have one country, and for that country you have variables such as GDP, fiscal balance and whatnot. Essentially, it is a database presented in the two-dimensional format with multiple indexes, embedded one inside the other. The complexity of indexes replaces the multitude of dimensions in the actual data. I start intuitively, with creating lists of column labels corresponding, respectively, to numerical data, and to index descriptors:
>> Numerical_Data=[‘1980’, ‘1981’,
‘1982’, ‘1983’, ‘1984’, ‘1985’, ‘1986’, ‘1987’, ‘1988’, ‘1989’, ‘1990’,
‘1991’, ‘1992’, ‘1993’, ‘1994’, ‘1995’, ‘1996’, ‘1997’, ‘1998’, ‘1999’,
‘2000’, ‘2001’, ‘2002’, ‘2003’, ‘2004’, ‘2005’, ‘2006’, ‘2007’, ‘2008’,
‘2009’, ‘2010’, ‘2011’, ‘2012’, ‘2013’, ‘2014’, ‘2015’, ‘2016’, ‘2017’,
‘2018’, ‘2019’, ‘2020’, ‘2021’, ‘2022’, ‘2023’, ‘2024’, ‘2025’]
>> Index_descriptors=[‘Country’, ‘Subject Descriptor’, ‘Units’, ‘Scale’,’Estimates Start After’]
Now, I mess around a bit with those dictionaries and with indexing that big dataset. In a moment, you will understand why I do so. I go:
>> Subject_Descriptors=pd.unique(WEO4[‘Subject Descriptor’]) # I made a data frame out of unique index labels in the column ‘Subject Descriptor’. I get:
>> array([‘Gross domestic product, constant prices’,
‘Gross domestic product, current prices’,
‘Gross domestic product, deflator’,
‘Gross domestic product per capita, constant prices’,
‘Gross domestic product per capita, current prices’,
‘Output gap in percent of potential GDP’,
‘Gross domestic product based on purchasing-power-parity (PPP) share of world total’,
‘Implied PPP conversion rate’, ‘Total investment’,
‘Gross national savings’, ‘Inflation, average consumer prices’,
‘Inflation, end of period consumer prices’,
‘Six-month London interbank offered rate (LIBOR)’,
‘Volume of imports of goods and services’,
‘Volume of Imports of goods’,
‘Volume of exports of goods and services’,
‘Volume of exports of goods’, ‘Unemployment rate’, ‘Employment’,
‘Population’, ‘General government revenue’,
‘General government total expenditure’,
‘General government net lending/borrowing’,
‘General government structural balance’,
‘General government primary net lending/borrowing’,
‘General government net debt’, ‘General government gross debt’,
‘Gross domestic product corresponding to fiscal year, current prices’,
‘Current account balance’], dtype=object)
In other words, each country is characterized in the WEOOct2020 database with the above characteristics. I need to group and extract data so as to have those variables separated. The kind of transformation which I want to nail down is to transpose those variables with years. In the source version of WEOOct2020, years are separate columns that cut across three basic indexes: countries, for one, the above presented subject descriptors, for two, and finally the indexes of units and scale. The latter is important to the extent that most macroeconomic aggregates are presented either as absolute amounts or as percentages of the country’s GDP. Probably you remember from math classes at school, and those of physics and chemistry too, actually, that confusing units of measurement is a cardinal sin in science. What I want to do is to flip the thing on its side. I want each country to be associated with a series of index labels corresponding to years, and variables associated with proper units of measurement being the columns of the dataset.
In other words, now, years are the main quantitative categories of the WEOOct202 data frame, and categorial variables are index labels, or phenomenological units of observation. I want these two to change places, as it essentially should be: categorial variables should become phenomenological categories, and years should gracefully step down to the status of observational units.
As I don’t know what to do, I reach to what I know how to do, i.e. to creating some sort of dictionaries out of index labels. What I did for subject descriptors, I do for units,
>> Units=pd.unique(WEO4[‘Units’])
…which yields:
array([‘National currency’, ‘Percent change’, ‘U.S. dollars’,
‘Purchasing power parity; international dollars’, ‘Index’,
‘Purchasing power parity; 2017 international dollar’,
‘Percent of potential GDP’, ‘Percent’,
‘National currency per current international dollar’,
‘Percent of GDP’, ‘Percent of total labor force’, ‘Persons’],
dtype=object)
and…
>> Countries=pd.unique(WEO4[‘Country’])
…which yields:
array([‘Afghanistan’, ‘Albania’, ‘Algeria’, ‘Angola’,
‘Antigua and Barbuda’, ‘Argentina’, ‘Armenia’, ‘Aruba’,
‘Australia’, ‘Austria’, ‘Azerbaijan’, ‘The Bahamas’, ‘Bahrain’,
‘Bangladesh’, ‘Barbados’, ‘Belarus’, ‘Belgium’, ‘Belize’, ‘Benin’,
‘Bhutan’, ‘Bolivia’, ‘Bosnia and Herzegovina’, ‘Botswana’,
‘Brazil’, ‘Brunei Darussalam’, ‘Bulgaria’, ‘Burkina Faso’,
‘Burundi’, ‘Cabo Verde’, ‘Cambodia’, ‘Cameroon’, ‘Canada’,
‘Central African Republic’, ‘Chad’, ‘Chile’, ‘China’, ‘Colombia’,
‘Comoros’, ‘Democratic Republic of the Congo’, ‘Republic of Congo’,
‘Costa Rica’, “CÙte d’Ivoire”, ‘Croatia’, ‘Cyprus’,
‘Czech Republic’, ‘Denmark’, ‘Djibouti’, ‘Dominica’,
‘Dominican Republic’, ‘Ecuador’, ‘Egypt’, ‘El Salvador’,
‘Equatorial Guinea’, ‘Eritrea’, ‘Estonia’, ‘Eswatini’, ‘Ethiopia’,
‘Fiji’, ‘Finland’, ‘France’, ‘Gabon’, ‘The Gambia’, ‘Georgia’,
‘Germany’, ‘Ghana’, ‘Greece’, ‘Grenada’, ‘Guatemala’, ‘Guinea’,
‘Guinea-Bissau’, ‘Guyana’, ‘Haiti’, ‘Honduras’, ‘Hong Kong SAR’,
‘Hungary’, ‘Iceland’, ‘India’, ‘Indonesia’,
‘Islamic Republic of Iran’, ‘Iraq’, ‘Ireland’, ‘Israel’, ‘Italy’,
‘Jamaica’, ‘Japan’, ‘Jordan’, ‘Kazakhstan’, ‘Kenya’, ‘Kiribati’,
‘Korea’, ‘Kosovo’, ‘Kuwait’, ‘Kyrgyz Republic’, ‘Lao P.D.R.’,
‘Latvia’, ‘Lebanon’, ‘Lesotho’, ‘Liberia’, ‘Libya’, ‘Lithuania’,
‘Luxembourg’, ‘Macao SAR’, ‘Madagascar’, ‘Malawi’, ‘Malaysia’,
‘Maldives’, ‘Mali’, ‘Malta’, ‘Marshall Islands’, ‘Mauritania’,
‘Mauritius’, ‘Mexico’, ‘Micronesia’, ‘Moldova’, ‘Mongolia’,
‘Montenegro’, ‘Morocco’, ‘Mozambique’, ‘Myanmar’, ‘Namibia’,
‘Nauru’, ‘Nepal’, ‘Netherlands’, ‘New Zealand’, ‘Nicaragua’,
‘Niger’, ‘Nigeria’, ‘North Macedonia’, ‘Norway’, ‘Oman’,
‘Pakistan’, ‘Palau’, ‘Panama’, ‘Papua New Guinea’, ‘Paraguay’,
‘Peru’, ‘Philippines’, ‘Poland’, ‘Portugal’, ‘Puerto Rico’,
‘Qatar’, ‘Romania’, ‘Russia’, ‘Rwanda’, ‘Samoa’, ‘San Marino’,
‘S„o TomÈ and PrÌncipe’, ‘Saudi Arabia’, ‘Senegal’, ‘Serbia’,
‘Seychelles’, ‘Sierra Leone’, ‘Singapore’, ‘Slovak Republic’,
‘Slovenia’, ‘Solomon Islands’, ‘Somalia’, ‘South Africa’,
‘South Sudan’, ‘Spain’, ‘Sri Lanka’, ‘St. Kitts and Nevis’,
‘St. Lucia’, ‘St. Vincent and the Grenadines’, ‘Sudan’, ‘Suriname’,
‘Sweden’, ‘Switzerland’, ‘Syria’, ‘Taiwan Province of China’,
‘Tajikistan’, ‘Tanzania’, ‘Thailand’, ‘Timor-Leste’, ‘Togo’,
‘Tonga’, ‘Trinidad and Tobago’, ‘Tunisia’, ‘Turkey’,
‘Turkmenistan’, ‘Tuvalu’, ‘Uganda’, ‘Ukraine’,
‘United Arab Emirates’, ‘United Kingdom’, ‘United States’,
‘Uruguay’, ‘Uzbekistan’, ‘Vanuatu’, ‘Venezuela’, ‘Vietnam’,
‘West Bank and Gaza’, ‘Yemen’, ‘Zambia’, ‘Zimbabwe’], dtype=object)
Once again, I don’t really know what I am doing. I just intuitively look for some sort of landmarks in that landscape of data. By the way, this is what we all do when we don’t know what to do: we look for reliable ways to partition observable reality into categories.
Now, I want to make sure that Python has the same views as me as for what index descriptors are in that dataset. I go:
>> pd.MultiIndex.from_frame(WEO4[Index_descriptors])
… and I get:
MultiIndex([(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
(‘Afghanistan’, …),
…
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …),
( ‘Zimbabwe’, …)],
names=[‘Country’, ‘Subject Descriptor’, ‘Units’, ‘Scale’, ‘Estimates Start After’], length=8775)
Seems OK.
Now, I need to fuse somehow the index of Subject Descriptor with the Index of Units, so as to have categories ready for flipping. I keep sort of feeling my way forward, rather than seeing it clearly. Love it, actually. I create an empty data series to contain the merged indexes of ‘Subject Descriptor’ and ‘Units’:
>> Variable=pd.Series(‘object’) # The ‘object’ part means that I want to have words in that data series
Now, I append and I check:
>> WEO4.append(Variable,ignore_index=True)
Aaaand… it doesn’t work. When I check ‘WEO4.info()’, I get the list of columns I had before, without the ‘Variable’. In other words, Python acknowledged that I want to append that columns, and it sort of appended, but just sort of. There is that thing I have already learnt with Python: there is a huge difference between having sort of expected output, on the one hand, and having it 100%, on the other hand. The one hand is bloody frustrating.
I try another trick, the ‘df.insert’ command. I do:
>> WEO4.insert(1,’Variable’,’ ‘)
I check with ‘WEO4.info()’ aaaaand….this time, it worked sort of. I get the new column ‘Variable’, yes, and I have all my numerical columns, the one with ‘Year’ headers, turned back into the ‘object’ format. I f**king love programming. I do:
>> for i in range(0,len(WEO4.columns)):
WEO4.iloc[:,i]=pd.to_numeric(WEO4.iloc[:,i], errors=’ignore’)
… and I check with ‘WEO4.info()’ once again. Victory: numerical is back to numerical.
Now, I am looking for a method to sort of concatenate smartly the contents of two incumbent columns, namely ‘Subject Descriptor’ and ‘Units’, into the new vessel, i.e. the column ‘Variable’. I found the simplest possible method, which is straightforward addition:
>> WEO4[“Variable”]=WEO4[“Subject Descriptor”]+WEO4[“Units”]
I typed it, I executed, and, as strange as it seems, Python seems to be OK with that. Well, after all, Python is a language, and languages have that thing: they add words to each other. It is called ‘making sentences’. Cool. I check by creating an array of unique values in the index labels of ‘Variable:
>> Variables=pd.unique(WEO4[‘Variable’])
I check by just typing:
>> Variables
… and running it as a command. I get:
array([‘Gross domestic product, constant pricesNational currency’,
‘Gross domestic product, constant pricesPercent change’,
‘Gross domestic product, current pricesNational currency’,
‘Gross domestic product, current pricesU.S. dollars’,
‘Gross domestic product, current pricesPurchasing power parity; international dollars’,
‘Gross domestic product, deflatorIndex’,
‘Gross domestic product per capita, constant pricesNational currency’,
‘Gross domestic product per capita, constant pricesPurchasing power parity; 2017 international dollar’,
‘Gross domestic product per capita, current pricesNational currency’,
‘Gross domestic product per capita, current pricesU.S. dollars’,
‘Gross domestic product per capita, current pricesPurchasing power parity; international dollars’,
‘Output gap in percent of potential GDPPercent of potential GDP’,
‘Gross domestic product based on purchasing-power-parity (PPP) share of world totalPercent’,
‘Implied PPP conversion rateNational currency per current international dollar’,
‘Total investmentPercent of GDP’,
‘Gross national savingsPercent of GDP’,
‘Inflation, average consumer pricesIndex’,
‘Inflation, average consumer pricesPercent change’,
‘Inflation, end of period consumer pricesIndex’,
‘Inflation, end of period consumer pricesPercent change’,
‘Six-month London interbank offered rate (LIBOR)Percent’,
‘Volume of imports of goods and servicesPercent change’,
‘Volume of Imports of goodsPercent change’,
‘Volume of exports of goods and servicesPercent change’,
‘Volume of exports of goodsPercent change’,
‘Unemployment ratePercent of total labor force’,
‘EmploymentPersons’, ‘PopulationPersons’,
‘General government revenueNational currency’,
‘General government revenuePercent of GDP’,
‘General government total expenditureNational currency’,
‘General government total expenditurePercent of GDP’,
‘General government net lending/borrowingNational currency’,
‘General government net lending/borrowingPercent of GDP’,
‘General government structural balanceNational currency’,
‘General government structural balancePercent of potential GDP’,
‘General government primary net lending/borrowingNational currency’,
‘General government primary net lending/borrowingPercent of GDP’,
‘General government net debtNational currency’,
‘General government net debtPercent of GDP’,
‘General government gross debtNational currency’,
‘General government gross debtPercent of GDP’,
‘Gross domestic product corresponding to fiscal year, current pricesNational currency’,
‘Current account balanceU.S. dollars’,
‘Current account balancePercent of GDP’], dtype=object)
Cool. It seems to have worked.