I intend to work on iterations. My general purpose with learning to program in Python is to create my own algorithms of artificial neural networks, in line with what I have already done in that respect using just Excel. Iteration is the essence of artificial intelligence, to the extent that the latter manifests as an intelligent structure producing many alternative versions of itself. Many means one at a time over many repetitions.
When I run my neural networks in Excel, they do a finite number of iterations. That would be a Definite Iteration in Python, thus the structure based on the ‘for’ expression. I am helping myself with the tutorial available at https://realpython.com/python-for-loop/ . Still, as programming is supposed to enlarge my Excel-forged intellectual horizons, I want to understand and practice the ‘while’ loop in Python, thus Indefinite Iteration (https://realpython.com/python-while-loop/ ).
Anyway, programming a loop is very different from looping over multiple rows of an Excel sheet. The latter simply makes a formula repeat over many rows, whilst the former requires defining the exact operation to iterate, the input domain which the iteration takes as data, and the output dataset to store the outcome of iteration.
It is time, therefore, to describe exactly the iteration I want to program in Python. As a matter of fact, we are talking about a few different iterations. The first one is the standardisation of my source data. I can use two different ways of standardising it, depending on the neural activation function I use. The baseline method is to standardise each variable over its maximum, and then it fits every activation function I use. It is standardised value of x, AKA s(x), being calculated as s(x) = x/max(x).
If I focus just on the hyperbolic tangent as activation function, I can use the first method, or I can standardise by mean-reversion, where s(x) = [x – avg(x)]/std(x). In a first step, I subtract from x the average expected value of x – this is the the [x – avg(x)] expression – and then I divide the resulting difference by the standard deviation of x, or std(x).
The essential difference between those two modes of standardisation is the range of standardised values. When denominated in units of the max(x), standardised values range in 0 ≥ std(x) ≥ 1. When I standardise by mean-reversion, I have -1 ≥ std(x) ≥ 1.
The piece of programming I start that specific learning of mine with consists in transforming my source Data Frame ‘df’ into its standardised version ’s_df’ by dividing values in each column of df by their maximums. As I think of all that, it comes to my mind what I have recently learnt, namely that operations on Numpy arrays, in Python, are much faster than the same operations on data frames built with Python Pandas. I check if I can make a Data Frame out of an imported CSV file, and then turn it into a Numpy array.
Let’s walse. I start by opening JupyterLab at https://hub.gke2.mybinder.org/user/jupyterlab-jupyterlab-demo-nocqldur/lab and creating a notebook with Python 3 as its kernel. Then, I import the libraries which I expect to use one way or another: NumPy, Pandas, Matplot, OS, and Math. In other words, I do:
>> import numpy as np
>> import pandas as pd
>> import matplotlib.pyplot as plt
>> import math
>> import os
Then, I upload a CSV file and I import it into a Data Frame. It is a database I used in my research on cities and urbanization, its name is ‘DU_DG database.csv’, and, as it is transformed from an Excel file, I take care to specify that separators are semi-columns.
>> DU_DG=pd.DataFrame(pd.read_csv(‘DU_DG database.csv’,sep=’;’))
The resulting Data Frame is structured as:
Index([‘Index’, ‘Country’, ‘Year’, ‘DU/DG’, ‘Population’,
‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’,
‘urban population absolute’,
‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’,
‘Cereal yield (kg per hectare)’],
dtype=’object’)
Import being successful (I just check with commands ‘DU_DG.shape’ and ‘DU_DG.head()’), I am trying to create a NumPy array. Of course, there is not much sense in translating names of countries and labels of years into a NumPy array. I try to select numerical columns ‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, and ‘Cereal yield (kg per hectare)’, by commanding:
>> DU_DGnumeric=np.array(DU_DG[‘DU/DG’,’Population’,’GDP (constant 2010 US$)’,’Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’,’agricultural land km2′,’Cereal yield (kg per hectare)’])
The answer I get from Python 3 is a gentle ‘f**k you!’, i.e. an elaborate error message.
KeyError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2656 try:
-> 2657 return self._engine.get_loc(key)
2658 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, ‘Cereal yield (kg per hectare)’)
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-18-e438a5ba1aa2> in <module>
—-> 1 DU_DGnumeric=np.array(DU_DG[‘DU/DG’,’Population’,’GDP (constant 2010 US$)’,’Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’,’agricultural land km2′,’Cereal yield (kg per hectare)’])
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (‘DU/DG’, ‘Population’, ‘GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’, ‘urban population absolute’, ‘Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’, ‘Cereal yield (kg per hectare)’)
Didn’t work, obviously. I try something else. I proceed in two steps. First, I create a second Data Frame out of the numerical columns of DU_DG. I go:
>> DU_DGNumCol=pd.DataFrame(DU_DG.columns[‘DU/DG’, ‘Population’,’GDP (constant 2010 US$)’, ‘Broad money (% of GDP)’,’urban population absolute’,’Energy use (kg of oil equivalent per capita)’, ‘agricultural land km2’,’Cereal yield (kg per hectare)’])
Python seems to have accepted the command without reserves, and yet something strange happens. Informative commands about that second Data Frame, i.e. DU_DGNumCol, such as ‘DU_DGNumCol.head()’, ‘DU_DGNumCol.shape’ or ‘DU_DGNumCol.info‘ don’t work, as if DU_DGNumCol had no structure at all.
Cool. I investigate. I want to check how does Python see data in my DU_DG data frame. I do ‘DU_DG.describe()’ first, and, to my surprise, I can see descriptive statistics just for columns ‘Index’ and ‘Year’. The legitimate WTF? question pushes me to type ‘DU_DG.info()’ and here is what I get:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 896 entries, 0 to 895
Data columns (total 11 columns):
Index 896 non-null int64
Country 896 non-null object
Year 896 non-null int64
DU/DG 896 non-null object
Population 896 non-null object
GDP (constant 2010 US$) 896 non-null object
Broad money (% of GDP) 896 non-null object
urban population absolute 896 non-null object
Energy use (kg of oil equivalent per capita) 896 non-null object
agricultural land km2 896 non-null object
Cereal yield (kg per hectare) 896 non-null object
dtypes: int64(2), object(9)
memory usage: 77.1+ KB
I think I understand. My numerical data has been imported as object, and I want it to be float values. Once again, I have the same valuable lesson: before I do anything with my data, in Python, I need to check and curate it. It is strangely connected to my theory of collective intelligence. Our human perception accepts empirical experience for further processing, especially for collective processing at the level of culture, only if said experience has the right form. We tend to ignore phenomena, which manifest in a form we are not used to process cognitively.
Just by sheer curiosity, I take another dataset and I repeat the whole sequence of import from CSV, and definition of data type. This time, I take a reduced version of Penn Tables 9.1. The full citation due in this case is: Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), “The Next Generation of the Penn World Table” American Economic Review, 105(10), 3150-3182, available for download at www.ggdc.net/pwt. The ‘reduced’ part means that I took out of the database all the rows (i.e. country <> year observations) with at least one empty cell. I go:
>> PWT=pd.DataFrame(pd.read_csv(‘PWT 9_1 no empty cells.csv’))
…aaaaand it lands. Import successful. I test the properties of PWT data frame:
>> PWT.info()
yields:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 3006 entries, 0 to 3005
Data columns (total 44 columns):
country 3006 non-null object
year 3006 non-null int64
rgdpe 3006 non-null float64
rgdpo 3006 non-null float64
pop 3006 non-null float64
emp 3006 non-null float64
emp / pop 3006 non-null float64
avh 3006 non-null float64
hc 3006 non-null float64
ccon 3006 non-null float64
cda 3006 non-null float64
cgdpe 3006 non-null float64
cgdpo 3006 non-null float64
cn 3006 non-null float64
ck 3006 non-null float64
ctfp 3006 non-null float64
cwtfp 3006 non-null float64
rgdpna 3006 non-null float64
rconna 3006 non-null float64
rdana 3006 non-null float64
rnna 3006 non-null float64
rkna 3006 non-null float64
rtfpna 3006 non-null float64
rwtfpna 3006 non-null float64
labsh 3006 non-null float64
irr 3006 non-null float64
delta 3006 non-null float64
xr 3006 non-null float64
pl_con 3006 non-null float64
pl_da 3006 non-null float64
pl_gdpo 3006 non-null float64
csh_c 3006 non-null float64
csh_i 3006 non-null float64
csh_g 3006 non-null float64
csh_x 3006 non-null float64
csh_m 3006 non-null float64
csh_r 3006 non-null float64
pl_c 3006 non-null float64
pl_i 3006 non-null float64
pl_g 3006 non-null float64
pl_x 3006 non-null float64
pl_m 3006 non-null float64
pl_n 3006 non-null float64
pl_k 3006 non-null float64
dtypes: float64(42), int64(1), object(1)
memory usage: 1.0+ MB
>> PWT.describe()
gives nice descriptive statistics. This dataset has been imported in the format I want. I do the same thing I attempted with the DU_DG dataset: I try to convert it into a NumPy array and to check the shape obtained. I do:
>> PWTNumeric=np.array(PWT)
>> PWTNumeric.shape
I get (3006,44), i.e. 3006 rows over 44 columns.
I try to wrap my mind around standardising values in PWT. I start gently. I slice one column out of PWT, namely the AVH variable, which stands for the average number of hours worked per person per year. I do:
>> AVH=pd.DataFrame(PWT[‘avh’])
>> stdAVH=pd.DataFrame(AVH/AVH.max())
Apparently, it worked. I check with ‘stdAVH.describe()’ and I get a nice distribution of values between 0 and 1.
I do the same thing with mean-reversion. I create the ‘mrAVH’ data frame according to the s(x) = [x – avg(x)]/std(x) drill. I do:
>> mrAVH=pd.DataFrame((AVH-AVH.mean())/AVH.std())
…and I get a nice distribution of mean reverted values.
Cool. Now, it is time to try and iterate the same standardisation over many columns in the same Data Frame. I have already rummaged a bit and apparently it is not going to as simple as in Excel. It usually isn’t.
That would be all in that update. A short summary is due. It works again. I mean, learning something and keeping a journal of how exactly I learn, that thing works. I feel that special vibe, like ‘What the hell, even if it sucks, it is interesting’. Besides the technical details of programming, I have already learnt two big things about data analysis in Python. Firstly, however comfortable it is to use libraries such as NumPy or Pandas, being really efficient requires the understanding of small details at the very basic level, e.g. conversion of data types, and, as a matter of fact, the practical workability of different data types, selection of values in a data set, by row and by column, iteration over rows and columns etc. Secondly, once again, data works well in Python when it has been properly curated prior to analysis. Learning quick algorithmic ways to curate that data, without having to do is manually in Excel, is certainly an asset, which I need to learn.