Once again, I don’t really know what I am doing, and I love the feeling

I am digressing a bit in my learning of programming in Python, and I come back to a task which I have kept failing at so far, namely at reading data out of the World Economic Outlook database, published by the International Monetary Fund (https://www.imf.org/en/Publications/WEO/weo-database/2020/October ). This is good training in data cleansing. When I click on that link, I can choose two different formats: TAB delimited values or SDMX. The former download as an Excel file, essentially. Frankly, I feel not up to treating the latter: it is a dynamic format, essentially based on an XML tree. Still to learn, for me. This is one of those cases when I prefer staying in the cave. Daylight can wait. I stick to Excel. I download it, I open it in Excel and I preliminarily cleanse the spreadsheet of the most salient stuff, such as e.g. title rows in the heading above the column labels.

Preliminary cleansing done, I copy the Excel workbook to the working directory of my Anaconda, which is, by default, the root directory of my user profile. I create a new notebook in JupyterLab, and I start by importing whatever I think can be useful:

>> import numpy as np

>> import pandas as pd

>> import os

>> import math    

I check the presence of the Excel file with the ‘os.listdir()’ command, and I go:

>> WEO=pd.DataFrame(pd.read_excel(‘WEOOct2020all.xlsx’,sheet_name=’WEOOct2020all’,header=0))     

Seems cool. The kernel has swallowed the command. Just in case, I check with ‘WEO.describe()’, and I get:

 Estimates Start After
count7585.000000
mean2015.186421
std80.240679
min0.000000
25%2018.000000
50%2019.000000
75%2019.000000
max2020.000000

WTF? ‘Estimates start after’ is the last column of a two-dimensional table in Excel, and this column gives the year up to which the database provides actual empirics, and after which it is just projections. Besides this one, the database contains numerical columns corresponding to years, starting from 1980. When I go ‘WEO.columns’, I get:

Index([              ‘Country’,    ‘Subject Descriptor’,

                       ‘Units’,                 ‘Scale’,

                          1980,                    1981,

                          1982,                    1983,

                          1984,                    1985,

                          1986,                    1987,

                          1988,                    1989,

                          1990,                    1991,

                          1992,                    1993,

                          1994,                    1995,

                          1996,                    1997,

                          1998,                    1999,

                          2000,                    2001,

                          2002,                    2003,

                          2004,                    2005,

                          2006,                    2007,

                          2008,                    2009,

                          2010,                    2011,

                          2012,                    2013,

                          2014,                    2015,

                          2016,                    2017,

                          2018,                    2019,

                          2020,                    2021,

                          2022,                    2023,

                          2024,                    2025,

       ‘Estimates Start After’],

      dtype=’object’)

Aha! These columns are there, only Python sees them as non-numerical and does not compute any stats from them. As we say in Poland, I am trying to get my man from another angle. I open the source XLSX file in Excel and I save a copy thereof in the CSV format, in the working directory of my Anaconda. I remember that when saved out of an XLSX file, CSVs tend to have the semi-column as separator, instead of the coma. To everyone their ways, mind you. Thus, I go:

>> WEO2=pd.DataFrame(pd.read_csv(‘WEOOct2020all.csv’,header=0,sep=’;’))

When I check with ‘WEO2.info()’, I get:

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 8775 entries, 0 to 8774

Data columns (total 51 columns):

 #   Column                 Non-Null Count  Dtype 

—  ——                 ————–  —– 

 0   Country                8775 non-null   object

 1   Subject Descriptor     8775 non-null   object

 2   Units                  8775 non-null   object

 3   Scale                  3900 non-null   object

 4   1980                   3879 non-null   object

 5   1981                   4008 non-null   object

 6   1982                   4049 non-null   object

 7   1983                   4091 non-null   object

 8   1984                   4116 non-null   object

 9   1985                   4192 non-null   object

 10  1986                   4228 non-null   object

 11  1987                   4249 non-null   object

 12  1988                   4338 non-null   object

 13  1989                   4399 non-null   object

 14  1990                   4888 non-null   object

 15  1991                   5045 non-null   object

 16  1992                   5428 non-null   object

 17  1993                   5621 non-null   object

 18  1994                   5748 non-null   object

 19  1995                   6104 non-null   object

 20  1996                   6247 non-null   object

 21  1997                   6412 non-null   object

 22  1998                   6584 non-null   object

 23  1999                   6662 non-null   object

 24  2000                   7071 non-null   object

 25  2001                   7193 non-null   object

 26  2002                   7289 non-null   object

 27  2003                   7323 non-null   object

 28  2004                   7391 non-null   object

 29  2005                   7428 non-null   object

 30  2006                   7433 non-null   object

 31  2007                   7441 non-null   object

 32  2008                   7452 non-null   object

 33  2009                   7472 non-null   object

 34  2010                   7475 non-null   object

 35  2011                   7477 non-null   object

 36  2012                   7484 non-null   object

 37  2013                   7493 non-null   object

 38  2014                   7523 non-null   object

 39  2015                   7545 non-null   object

 40  2016                   7547 non-null   object

 41  2017                   7551 non-null   object

 42  2018                   7547 non-null   object

 43  2019                   7539 non-null   object

 44  2020                   7501 non-null   object

 45  2021                   7449 non-null   object

 46  2022                   7389 non-null   object

 47  2023                   7371 non-null   object

 48  2024                   7371 non-null   object

 49  2025                   7371 non-null   object

 50  Estimates Start After  7585 non-null   float64

dtypes: float64(1), object(50)

memory usage: 3.4+ MB

There is some progress, still it is not the entire progress I expected. I still don’t have numerical data, in ‘float64’ type, where I expect it to have. I dig a bit and I see the source of the problem. In the WEO database there is plenty of empty cells, especially before the year 2000. They correspond to missing data, quite simply. In the source XLSX file, they are either just empty, or filled with something that looks like a double hyphen: ‘- -‘. Python shows the contents of these cells as ‘NaN’, which stands for ‘Not a Number’. That double hyphen is the most annoying of the two, as Excel does not see it in the command ‘Replace’. I need to use Python. I do two phases of cleansing:

>> WEO3=WEO2.replace(np.nan,”, regex=True)

>> WEO4=WEO3.replace(‘–‘,”, regex=True)

 I check with ‘WEO4.info()’ aaaaand… Bingo! Columns from ‘1980’ to ‘2025’ are of the type ‘float64’.

The WEO database is made of several variables stacked one underneath the other in consecutive rows. You have one country, and for that country you have variables such as GDP, fiscal balance and whatnot. Essentially, it is a database presented in the two-dimensional format with multiple indexes, embedded one inside the other. The complexity of indexes replaces the multitude of dimensions in the actual data. I start intuitively, with creating lists of column labels corresponding, respectively, to numerical data, and to index descriptors:

>> Numerical_Data=[‘1980’, ‘1981’,

       ‘1982’, ‘1983’, ‘1984’, ‘1985’, ‘1986’, ‘1987’, ‘1988’, ‘1989’, ‘1990’,

       ‘1991’, ‘1992’, ‘1993’, ‘1994’, ‘1995’, ‘1996’, ‘1997’, ‘1998’, ‘1999’,

       ‘2000’, ‘2001’, ‘2002’, ‘2003’, ‘2004’, ‘2005’, ‘2006’, ‘2007’, ‘2008’,

       ‘2009’, ‘2010’, ‘2011’, ‘2012’, ‘2013’, ‘2014’, ‘2015’, ‘2016’, ‘2017’,

       ‘2018’, ‘2019’, ‘2020’, ‘2021’, ‘2022’, ‘2023’, ‘2024’, ‘2025’]

>>  Index_descriptors=[‘Country’, ‘Subject Descriptor’, ‘Units’, ‘Scale’,’Estimates Start After’]

Now, I mess around a bit with those dictionaries and with indexing that big dataset. In a moment, you will understand why I do so. I go:

>> Subject_Descriptors=pd.unique(WEO4[‘Subject Descriptor’]) # I made a data frame out of unique index labels in the column ‘Subject Descriptor’.  I get:

>> array([‘Gross domestic product, constant prices’,

       ‘Gross domestic product, current prices’,

       ‘Gross domestic product, deflator’,

       ‘Gross domestic product per capita, constant prices’,

       ‘Gross domestic product per capita, current prices’,

       ‘Output gap in percent of potential GDP’,

       ‘Gross domestic product based on purchasing-power-parity (PPP) share of world total’,

       ‘Implied PPP conversion rate’, ‘Total investment’,

       ‘Gross national savings’, ‘Inflation, average consumer prices’,

       ‘Inflation, end of period consumer prices’,

       ‘Six-month London interbank offered rate (LIBOR)’,

       ‘Volume of imports of goods and services’,

       ‘Volume of Imports of goods’,

       ‘Volume of exports of goods and services’,

       ‘Volume of exports of goods’, ‘Unemployment rate’, ‘Employment’,

       ‘Population’, ‘General government revenue’,

       ‘General government total expenditure’,

       ‘General government net lending/borrowing’,

       ‘General government structural balance’,

       ‘General government primary net lending/borrowing’,

       ‘General government net debt’, ‘General government gross debt’,

       ‘Gross domestic product corresponding to fiscal year, current prices’,

       ‘Current account balance’], dtype=object)

In other words, each country is characterized in the WEOOct2020 database with the above characteristics. I need to group and extract data so as to have those variables separated. The kind of transformation which I want to nail down is to transpose those variables with years. In the source version of WEOOct2020, years are separate columns that cut across three basic indexes: countries, for one, the above presented subject descriptors, for two, and finally the indexes of units and scale. The latter is important to the extent that most macroeconomic aggregates are presented either as absolute amounts or as percentages of the country’s GDP. Probably you remember from math classes at school, and those of physics and chemistry too, actually, that confusing units of measurement is a cardinal sin in science. What I want to do is to flip the thing on its side. I want each country to be associated with a series of index labels corresponding to years, and variables associated with proper units of measurement being the columns of the dataset.

In other words, now, years are the main quantitative categories of the WEOOct202 data frame, and categorial variables are index labels, or phenomenological units of observation. I want these two to change places, as it essentially should be: categorial variables should become phenomenological categories, and years should gracefully step down to the status of observational units.

As I don’t know what to do, I reach to what I know how to do, i.e. to creating some sort of dictionaries out of index labels. What I did for subject descriptors, I do for units,     

>> Units=pd.unique(WEO4[‘Units’])

…which yields:

array([‘National currency’, ‘Percent change’, ‘U.S. dollars’,

       ‘Purchasing power parity; international dollars’, ‘Index’,

       ‘Purchasing power parity; 2017 international dollar’,

       ‘Percent of potential GDP’, ‘Percent’,

       ‘National currency per current international dollar’,

       ‘Percent of GDP’, ‘Percent of total labor force’, ‘Persons’],

      dtype=object)

and…

>> Countries=pd.unique(WEO4[‘Country’])

…which yields:

array([‘Afghanistan’, ‘Albania’, ‘Algeria’, ‘Angola’,

       ‘Antigua and Barbuda’, ‘Argentina’, ‘Armenia’, ‘Aruba’,

       ‘Australia’, ‘Austria’, ‘Azerbaijan’, ‘The Bahamas’, ‘Bahrain’,

       ‘Bangladesh’, ‘Barbados’, ‘Belarus’, ‘Belgium’, ‘Belize’, ‘Benin’,

       ‘Bhutan’, ‘Bolivia’, ‘Bosnia and Herzegovina’, ‘Botswana’,

       ‘Brazil’, ‘Brunei Darussalam’, ‘Bulgaria’, ‘Burkina Faso’,

       ‘Burundi’, ‘Cabo Verde’, ‘Cambodia’, ‘Cameroon’, ‘Canada’,

       ‘Central African Republic’, ‘Chad’, ‘Chile’, ‘China’, ‘Colombia’,

       ‘Comoros’, ‘Democratic Republic of the Congo’, ‘Republic of Congo’,

       ‘Costa Rica’, “CÙte d’Ivoire”, ‘Croatia’, ‘Cyprus’,

       ‘Czech Republic’, ‘Denmark’, ‘Djibouti’, ‘Dominica’,

       ‘Dominican Republic’, ‘Ecuador’, ‘Egypt’, ‘El Salvador’,

       ‘Equatorial Guinea’, ‘Eritrea’, ‘Estonia’, ‘Eswatini’, ‘Ethiopia’,

       ‘Fiji’, ‘Finland’, ‘France’, ‘Gabon’, ‘The Gambia’, ‘Georgia’,

       ‘Germany’, ‘Ghana’, ‘Greece’, ‘Grenada’, ‘Guatemala’, ‘Guinea’,

       ‘Guinea-Bissau’, ‘Guyana’, ‘Haiti’, ‘Honduras’, ‘Hong Kong SAR’,

       ‘Hungary’, ‘Iceland’, ‘India’, ‘Indonesia’,

       ‘Islamic Republic of Iran’, ‘Iraq’, ‘Ireland’, ‘Israel’, ‘Italy’,

       ‘Jamaica’, ‘Japan’, ‘Jordan’, ‘Kazakhstan’, ‘Kenya’, ‘Kiribati’,

       ‘Korea’, ‘Kosovo’, ‘Kuwait’, ‘Kyrgyz Republic’, ‘Lao P.D.R.’,

       ‘Latvia’, ‘Lebanon’, ‘Lesotho’, ‘Liberia’, ‘Libya’, ‘Lithuania’,

       ‘Luxembourg’, ‘Macao SAR’, ‘Madagascar’, ‘Malawi’, ‘Malaysia’,

       ‘Maldives’, ‘Mali’, ‘Malta’, ‘Marshall Islands’, ‘Mauritania’,

       ‘Mauritius’, ‘Mexico’, ‘Micronesia’, ‘Moldova’, ‘Mongolia’,

       ‘Montenegro’, ‘Morocco’, ‘Mozambique’, ‘Myanmar’, ‘Namibia’,

       ‘Nauru’, ‘Nepal’, ‘Netherlands’, ‘New Zealand’, ‘Nicaragua’,

       ‘Niger’, ‘Nigeria’, ‘North Macedonia’, ‘Norway’, ‘Oman’,

       ‘Pakistan’, ‘Palau’, ‘Panama’, ‘Papua New Guinea’, ‘Paraguay’,

       ‘Peru’, ‘Philippines’, ‘Poland’, ‘Portugal’, ‘Puerto Rico’,

       ‘Qatar’, ‘Romania’, ‘Russia’, ‘Rwanda’, ‘Samoa’, ‘San Marino’,

       ‘S„o TomÈ and PrÌncipe’, ‘Saudi Arabia’, ‘Senegal’, ‘Serbia’,

       ‘Seychelles’, ‘Sierra Leone’, ‘Singapore’, ‘Slovak Republic’,

       ‘Slovenia’, ‘Solomon Islands’, ‘Somalia’, ‘South Africa’,

       ‘South Sudan’, ‘Spain’, ‘Sri Lanka’, ‘St. Kitts and Nevis’,

       ‘St. Lucia’, ‘St. Vincent and the Grenadines’, ‘Sudan’, ‘Suriname’,

       ‘Sweden’, ‘Switzerland’, ‘Syria’, ‘Taiwan Province of China’,

       ‘Tajikistan’, ‘Tanzania’, ‘Thailand’, ‘Timor-Leste’, ‘Togo’,

       ‘Tonga’, ‘Trinidad and Tobago’, ‘Tunisia’, ‘Turkey’,

       ‘Turkmenistan’, ‘Tuvalu’, ‘Uganda’, ‘Ukraine’,

       ‘United Arab Emirates’, ‘United Kingdom’, ‘United States’,

       ‘Uruguay’, ‘Uzbekistan’, ‘Vanuatu’, ‘Venezuela’, ‘Vietnam’,

       ‘West Bank and Gaza’, ‘Yemen’, ‘Zambia’, ‘Zimbabwe’], dtype=object)

Once again, I don’t really know what I am doing. I just intuitively look for some sort of landmarks in that landscape of data. By the way, this is what we all do when we don’t know what to do: we look for reliable ways to partition observable reality into categories.

Now, I want to make sure that Python has the same views as me as for what index descriptors are in that dataset. I go:

>> pd.MultiIndex.from_frame(WEO4[Index_descriptors])

… and I get:

MultiIndex([(‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            (‘Afghanistan’, …),

            …

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …),

            (   ‘Zimbabwe’, …)],

           names=[‘Country’, ‘Subject Descriptor’, ‘Units’, ‘Scale’, ‘Estimates Start After’], length=8775)

Seems OK.

Now, I need to fuse somehow the index of Subject Descriptor with the Index of Units, so as to have categories ready for flipping. I keep sort of feeling my way forward, rather than seeing it clearly. Love it, actually. I create an empty data series to contain the merged indexes of ‘Subject Descriptor’ and ‘Units’:

>> Variable=pd.Series(‘object’) # The ‘object’ part means that I want to have words in that data series

Now, I append and I check:

>> WEO4.append(Variable,ignore_index=True)

Aaaand… it doesn’t work. When I check ‘WEO4.info()’, I get the list of columns I had before, without the ‘Variable’. In other words, Python acknowledged that I want to append that columns, and it sort of appended, but just sort of. There is that thing I have already learnt with Python: there is a huge difference between having sort of expected output, on the one hand, and having it 100%, on the other hand. The one hand is bloody frustrating.  

I try another trick, the ‘df.insert’ command. I do:

>> WEO4.insert(1,’Variable’,’ ‘)

I check with ‘WEO4.info()’ aaaaand….this time, it worked sort of. I get the new column ‘Variable’, yes, and I have all my numerical columns, the one with ‘Year’ headers, turned back into the ‘object’ format. I f**king love programming. I do:

>> for i in range(0,len(WEO4.columns)):

    WEO4.iloc[:,i]=pd.to_numeric(WEO4.iloc[:,i], errors=’ignore’)

… and I check with ‘WEO4.info()’ once again. Victory: numerical is back to numerical.

Now, I am looking for a method to sort of concatenate smartly the contents of two incumbent columns, namely ‘Subject Descriptor’ and ‘Units’, into the new vessel, i.e. the column ‘Variable’. I found the simplest possible method, which is straightforward addition:

>> WEO4[“Variable”]=WEO4[“Subject Descriptor”]+WEO4[“Units”]

I typed it, I executed, and, as strange as it seems, Python seems to be OK with that. Well, after all, Python is a language, and languages have that thing: they add words to each other. It is called ‘making sentences’. Cool. I check by creating an array of unique values in the index labels of ‘Variable:

>> Variables=pd.unique(WEO4[‘Variable’])

I check by just typing:

>> Variables

… and running it as a command. I get:

array([‘Gross domestic product, constant pricesNational currency’,

       ‘Gross domestic product, constant pricesPercent change’,

       ‘Gross domestic product, current pricesNational currency’,

       ‘Gross domestic product, current pricesU.S. dollars’,

       ‘Gross domestic product, current pricesPurchasing power parity; international dollars’,

       ‘Gross domestic product, deflatorIndex’,

       ‘Gross domestic product per capita, constant pricesNational currency’,

       ‘Gross domestic product per capita, constant pricesPurchasing power parity; 2017 international dollar’,

       ‘Gross domestic product per capita, current pricesNational currency’,

       ‘Gross domestic product per capita, current pricesU.S. dollars’,

       ‘Gross domestic product per capita, current pricesPurchasing power parity; international dollars’,

       ‘Output gap in percent of potential GDPPercent of potential GDP’,

       ‘Gross domestic product based on purchasing-power-parity (PPP) share of world totalPercent’,

       ‘Implied PPP conversion rateNational currency per current international dollar’,

       ‘Total investmentPercent of GDP’,

       ‘Gross national savingsPercent of GDP’,

       ‘Inflation, average consumer pricesIndex’,

       ‘Inflation, average consumer pricesPercent change’,

       ‘Inflation, end of period consumer pricesIndex’,

       ‘Inflation, end of period consumer pricesPercent change’,

       ‘Six-month London interbank offered rate (LIBOR)Percent’,

       ‘Volume of imports of goods and servicesPercent change’,

       ‘Volume of Imports of goodsPercent change’,

       ‘Volume of exports of goods and servicesPercent change’,

       ‘Volume of exports of goodsPercent change’,

       ‘Unemployment ratePercent of total labor force’,

       ‘EmploymentPersons’, ‘PopulationPersons’,

       ‘General government revenueNational currency’,

       ‘General government revenuePercent of GDP’,

       ‘General government total expenditureNational currency’,

       ‘General government total expenditurePercent of GDP’,

       ‘General government net lending/borrowingNational currency’,

       ‘General government net lending/borrowingPercent of GDP’,

       ‘General government structural balanceNational currency’,

       ‘General government structural balancePercent of potential GDP’,

       ‘General government primary net lending/borrowingNational currency’,

       ‘General government primary net lending/borrowingPercent of GDP’,

       ‘General government net debtNational currency’,

       ‘General government net debtPercent of GDP’,

       ‘General government gross debtNational currency’,

       ‘General government gross debtPercent of GDP’,

       ‘Gross domestic product corresponding to fiscal year, current pricesNational currency’,

       ‘Current account balanceU.S. dollars’,

       ‘Current account balancePercent of GDP’], dtype=object)

Cool. It seems to have worked.

Leave a Reply