Clustering GDPs

In this blog post, we are going to have a look at GDPs of various countries and try to group them according to a suitable measure.

The raw notebook can be found here.

Data

The purchasing power parity gross domestic product (GDP (PPP)) have been obtained from the paper of James and Gakidou. The reference currency is the 2005 USD defined according to the IMF.

Time series spans between 1950 and 2018 contains points for 210 countries or territories (countries for short in the following).

Data format

The time series’ are loaded to a pandas dataframe. The countries are encoded by their three letter ISO 3166-1 alpha-3 abbreviations as row indices. Each year is stored in a separate column.

path_to_db = r'path\to\GDP\data\1478-7954-10-12-S3.xlsx'
sheet_name = 'Sheet1'

df = pd.read_excel(path_to_db, sheet_name = sheet_name, header = None)

# split columns by countries
groups = df.groupby(0)
df = pd.DataFrame.from_dict({n : g[1].values for n, g in groups}, orient = 'index')

df.head()

	0	1	2	3	4	5	6	7	8	9	...	56	57	58	59	60	61	62	63	64	65
ABW	1266.2420	1327.6650	1393.9080	1467.9210	1546.2200	1623.0460	1717.7520	1792.5970	1867.8290	1943.5910	...	20333.6900	20143.8800	19444.6300	15729.5800	16451.7000	17275.2300	18179.3200	19139.9100	20157.2100	21228.3800
AFG	179.4909	182.1864	185.5338	192.9397	194.3511	195.4881	200.4814	198.2413	204.9097	207.1753	...	263.0708	292.6143	294.5654	351.5893	372.7821	387.9407	404.6445	419.7369	436.9097	455.4637
AGO	1164.6930	1192.4080	1220.7880	1248.8910	1212.4560	1281.8420	1252.5360	1338.9390	1385.0590	1377.6910	...	2289.3820	2675.2520	2945.2890	2878.3360	2958.8980	3076.2210	3174.6280	3268.8260	3347.4790	3384.9220
AIA	1443.0830	1502.3340	1566.0870	1637.5250	1712.6170	1784.9380	1875.6700	1943.4890	2010.6670	2077.3580	...	13845.4000	15838.2700	15682.0600	11612.8400	12059.6800	12573.3900	13137.4100	13733.3200	14360.5000	15016.1900
ALB	778.0425	809.2832	814.2191	845.1029	869.4968	912.0341	923.9318	976.7833	1017.5170	1056.7130	...	2731.5010	2889.3540	3093.5480	3179.8590	3246.2210	3333.4310	3436.3720	3562.8720	3711.6290	3877.9590

5 rows × 66 columns

General trends

Firstly, we investigate how GDPs have changed over time. To this end, a histogram is created in which each bin corresponds to 2% of the maximum GDP of all years.

def create_histogram(X, bins = None):
    """
    Creates a histogram for all colmns.
    Parameters:
        X (np.ndarray (n_row, n_col)) : raw data
        bins (None or np.histogram bins) : the bins. If None, 50 equal width bins are created in 
        the [0, np,max(X)] range
    Returns:
        hst (np.ndarray (50, n_col)) : histogram
    """
    
    if bins is None:
        bins = np.linspace(0, np.max(X), num = 50)
    
    hst = np.array([np.histogram(x, bins = bins, density = True)[0] for x in X.T])
    hst = hst * np.diff(bins)
    return hst

histo_year = create_histogram(df.values)

png

It is readily seen the maximum and overall GDPs have increased over the years. The distribution has become more spread out.

Moments

The median and the first four moments of the GDP distribution are plotted below. It is worth comparing the median and the mean. Both of them steadily increase, (apart from the years following the collapse of the Eastern Block between 1989–1991). However, the median shows an increase of 450% whereas the mean only tripled. This means the countries, in general, became richer with respect to the whole distribution. This is also seen in the overall reduction of skew. The median, however, decreased by ~20% after 1990 compared to the dent of ~2% in the mean. This implies those countries with lower GDPs were hit harder than those with higher GDPs.

df_year_stats = pd.DataFrame({'mean' : df.mean(axis = 0),
                              'median' : df.median(axis = 0),
                              'std' : df.std(axis = 0), 
                              'skew' : df.skew(axis = 0)})

png

Relative GDP

It is revealing to compare the GDPs to their maximum value in each year. In doing so, the relative wealth distribution can be assessed. The df_scaled_year_stats lists the median, mean and standard deviation of relative to the highest GDP in each year.

def scale_by_max(X):
    maxval = np.max(X)
    
    if maxval == 0.0:
        maxval = 0.0
    scaled = X / maxval

    return scaled

df_scaled = df.apply(scale_by_max, axis = 0)

df_scaled_year_stats = pd.DataFrame({'maximum' : df.max(axis = 0),
                                     'mean' : df_scaled.mean(axis = 0),
                                     'median' : df_scaled.median(axis = 0),
                                     'std' : df_scaled.std(axis = 0)})

histo_scaled_year = create_histogram(df_scaled.values)

png

The wealth of the countries has been increasing relative to that of the wealthiest one as time progresses. In the year of 1950, 70% percent of the countries has a GDP not exceeding 2% of the corresponding maximum. By 2010, this ratio becomes 40%.

Further signs of equalisation, that the mean doubles and the median triples. It is still astounding, though, that half of the countries only have at most 3% of the maximum GDP. It is also interesting to observe, that the separation between the wealthiest country and the rest has been increasing since the middle of previous decade. This effect was moderated by the crisis in 2008, as shown by the less negative slope of the mean and std curves.

png

Entropy

Entropy can measure how uniformly the values are distributed between ranges of equal width. The bins are taken from the histogram above, that is each bin corresponds to $ 0.02\cdot (\text{GDP}{max}- 0)$ If $w{ij} \in \left[0, 1 \right]$ is the value of the i -th in the j-th year, the entropy can be calculated as

$H(j) = -\sum\limits_{i = 1}^{N_{bins}}w_{ij}\ln w_{ij}$ The entropy is then normalised by the maximum possible value of entropy, where each bin has equal number of countries.

\[\tilde{H(j)} = \frac{H(j)} { -ln(N_{bins})} \in \left[0, 1\right]\]

The entropy increases over the decades, indicating the GDPs becoming less concentrated around one value i.e. the bins become more evenly populated. It does not, however, states anything about the values themselves. No matter whether 90% being in the bin and 10% being in the last bin or vica versa, they would both result in the same entropy. Nevertheless, entropy can provide us with a single number indicator, how uniform the GDP distribution is. Please note, if all countries had equal wealth, the entropy would be zero - all values in one slot. The entropy thus is not suitable to measure inequality. To that end we will invoke the Gini coefficient.

entropy = calc_entropy(histo_year)
entropy_scaled = calc_entropy(histo_scaled_year)

png

Inequality - Lorenz curve

The Lorenz curve is the fraction cumulative wealth plotted against the population ratio. If the wealth is equally distributed, the curve is a linear between the points (0, 0) and (1, 1), which line is also an upper bound of the Lorenz curve. The more the curve deviates from the diagonal, the less equal the wealth distribution is. The Gini coefficient is related to the integral of the Lorenz curve. It is the ratio of the area enclosed by the line of equality and the Lorenz curve and the area under the line of equality. It expresses the distance from equality; 0 meaning total equality, 1 complete inequality.

The coefficient decreases over time, by a modest 0.1 point. (Remember, the median in terms of the highest GDP was still 3% in 2015!) The late fall of communism leaves behind a number of poor countries with economic struggles which results in the increase of the coefficient for a good twenty years from the early eighties. By 2000, the Gini coefficient starts to decrease again, showing some countries had started to recover.

lorenz_curve = calculate_lorenz_curve(df.values)
gini_coeff = calculate_gini(df.values)

png

Uncovering groups of countries

We have so far have been concerned with characterising the distribution. A logical next step is to uncover what underlying components form distribution; whether there are groups of countries that evolve similarly over the time.

Temporal characteristics

If there are countries that move together the correlation between rankings in two subsequent years should show this. We are going to use Spearman’s $\rho$ and Kendall’s $\tau$ correlations. (Pandas uses scipy.stats Kendall $\tau$ implementation) which is the $\tau$-b version. Spearman $\rho$ former takes into account the magnitude of changes in two consecutive rankings, whereas Kendall rank correlation calculates the number changes in rankings regardless their magnitude. Therefore Kendall’s tau is expected to be more sensitive to any variation of the ranking.

The correlation coefficients are calculated for a number increasing delays, namely 1, 5, 10 and 20 years and are shown below. The almost constant nature of Spearman’s $\rho$ indicates that there are no major reorderings in the countries relative positions. However, the minor - or local, if you wish - changes are magnified by Kendall’s $\tau$ coefficient.

year_spearman_corr = df.corr(method = 'spearman')
year_kendall_corr = df.corr(method = 'kendall')

png

The country rankings barely change between two subsequent years. Correlation coefficients at larger delays are indicative of longer term dynamics. E.g. the five year forward coefficient rapidly decreases from 1985 reaching its minimum at 1989, showing that GDP rankings fluctuated between the years 1990 and 1994. From 1989 the correlation increases as the new order stabilises throughout those years. The 10 year forward correlation exhibits a small but noticeable slump starting at 1990. This means the rankings starting at 2000 differ more and more to those ten years earlier. The correlation begins to increase at 1995 so there is again a period of stabilization between 1995 and 2005.

Clustering

It would be interesting to see whether there are groups of countries whose GDPs move together and are close to each other over time. Provided there are, we wish to find the GDP most characteristic of each group. A number of clustering methods can achieve these goals. To name a few, k-medoids, Markov cluster algorithm (MCA), affinity propagation. For MCA has been used on numerous occasions in the previous posts, we are going to choose the k-medoids method. A slim implementation thereof can be found in the bottom of this post.

Distance between GDP time series

What distance should be used in the clustering algorithm? We are looking for similarities in two ways

the series are close to each other with respect to their values
their dynamics are similar

Correlation is insensitive to the proximity of the time series, therefore it is not a good indicator of (1.). The ordinary $L_{2}$ distance ignores the relative magnitude (mean) of the series. This is illustrated in the figure below. $L_{2}$ distance between series (a) and (b) is the same as between series (b) and (c). However, (c) is only 1.5 times greater than (b) whereas (b) is twice the magnitude of (a).

png

What distance should be used in the clustering algorithm? Correlation is insensitive to the proximity of the timeseries. The $L_{2}$ distance ignores the relative magnitude (mean) of the series. E.g. $L_{2}$ distance between series (a) and (b) is the same as between series (b) and (c). However, (c) is only 1.5 greater than (b) whereas (b) is twice the magnitude of (a).

The overlap, defined below, takes into account both the similarity of the time dependence as correlation does. The spatial distance is also accounted for in the denominator:

\[D_{ij} = \frac{\int\limits_{t} (f_{i}(t) - f_{j}(t))^{2} \mathrm{d}t}{\int\limits_{t} f_{i}(t)^{2}\mathrm{d}t + \int\limits_{t} f_{j}(t)^{2}\mathrm{d}t} = 1 - 2 \cdot \frac{\int\limits_{t} f_{i}(t) \cdot f_{j}(t)\mathrm{d}t}{\int\limits_{t} f_{i}(t)^{2}\mathrm{d}t + \int\limits_{t} f_{j}(t)^{2}\mathrm{d}t}\]

This measure is implemented as the overlap function below.

overlap = lambda u, v: 1 - 2 * (u * v).sum() / ((u * u).sum() + (v * v).sum())
X = squareform(pdist(df.values, overlap))

k-medoids clustering

The optimal number of clusters is unknown thus a range of them is tried out. The goodness of clustering is measured by the silhouette score. The corresponding results are then saved in the results list of dictionaries. We do not perform cross validation, for we are interested in grouping all of the countries.

clusterer = KMedoids()

results = []

for n_clusters in range(2, 30):
    
    clusterer.set_params(n_clusters = n_clusters)
    labels = clusterer.fit_predict(X)
    score = clusterer.score
    sil_score = silhouette_score(X, labels)
    
    results.append({'n_clusters' : n_clusters,
                    'labels' : labels,
                    'sil_score' : sil_score,
                    'score' : score,
                    'medoids' :  clusterer.medoids})

Determining the number of clusters

There are a manifold of ways to ascertain the optimal number of clusters. One can plot the intra-cluster variance as a function of the number of clusters. This number can be the point where the variance decreases only by small amount compared to the previous values. This method would suggest a five as the optimum number of clusters (left panel of figure below). One can alternatively choose the maximum of the silhouette score. This measure inversely proportional to the intra cluster variance but also penalises the increasing number of clusters. There is a maximum at $n_{clusters} = 3$ and a secondary maximum at $n_{clusters} = 11$.

# TO HIDE

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(10, 5)

axes[0].scatter([x['n_clusters'] for x in results], [x['score'] for x in results], color = 'green', label = 'Var')
axes[1].scatter([x['n_clusters'] for x in results], [x['sil_score'] for x in results], color = 'green', label = 'silhouette score')
axes[0].legend(); axes[1].legend()
axes[0].grid(True); axes[1].grid(True)
axes[0].set_xlim((0, results[-1]['n_clusters'] + 1)); axes[1].set_xlim((0, results[-1]['n_clusters'] + 1));
axes[0].set_ylim(bottom = 0.0); axes[1].set_ylim(bottom = 0.0)
axes[0].set_xlabel(r'$n_{clusters}$'); axes[1].set_xlabel(r'$n_{clusters}$')
axes[0].set_ylabel(r'$Var$'); axes[1].set_ylabel(r'$silhoutte \,\, score$')
plt.show()

png

Before making a decision, the first twelve clusters are shown in the figure below. The exemplars of the clusters are drawn with bold lines. The other members of the clusters appear in half tone of the corresponding colors.

png

It is instructuve to compare the cluster size distributions. The clusters are ordered according to the mean of their exemplars.

sel = [1, 3, 9]
for result in [results[idx] for idx in sel]:
    print("n_clusters:", result['n_clusters'], "cluster sizes:", sorted(Counter(result['labels']).most_common()))

n_clusters: 3 cluster sizes: [(0, 67), (1, 80), (2, 63)]
n_clusters: 5 cluster sizes: [(0, 44), (1, 39), (2, 50), (3, 39), (4, 38)]
n_clusters: 11 cluster sizes: [(0, 11), (1, 16), (2, 17), (3, 23), (4, 26), (5, 23), (6, 19), (7, 20), (8, 21), (9, 27), (10, 7)]

$n_{clusters} = {3, 5}$ splits the time series into groups of roughly equal sizes. These groups are more likely to represent a geometric partitioning rather arising from the underlying structure. At $n_{clusters} = 3$, the third (brown) exemplar clearly falls between two strands of time series in a low density region. When $n_{clusters} = 11$, the clusters have strongly unequal cardinalities which suggests some of the structure is captured. For instance, the tenth - red - exemplar lies in the middle of a band of time series. We therefore set the number of clusters as 11.

The clusters

The clusters are shown below. The exemplar is plotted in full tome whilst the other members are in half tone. The series which do not belong to the particular cluster are drawn in grey.

df_stats = pd.DataFrame(index = df.index)
df_stats['mean'] = df.mean(axis = 1)
df_stats['AAR'] = np.mean(df.values[:,1:] / df.values[:, :-1], axis = 1) - 1
df_stats['label'] = results[9]['labels']
df_stats['cname'] = df_stats.index.map(lambda x: ccodes.get(x, x))

with open(r'path\to\ccodes.json' , 'r') as fproc:
    ccodes = json.load(fproc)

png

The first four clusters collect countries whose GDPs remain low and the associated average annual growths are smaller than 2%. The next metacluster includes labels 5–10 where the mean GDP becomes progressively larger and the AAR is – on average – larger than 2%. The last group of cluster contains only one cluster. The majority of the timeseries represent a drecreasing trend.

The primary separating dimension between clusters is the average value of the GDPs across years as shown in the figure below, where the clusters plotted against the mean and the average annual return (AAR). The AAR was chosen to represent the dynamics of the GDPs’ evolution.

png

Using the mean and the AAR to describe GDPs corresponds to a linear model, and as such it loses a wealth of information. This apparent from the statistics table below. The mean is a reasonably good separating variable, whereas the AAR has a rather poor discriminating power.

df_stats.groupby('label')[['mean', 'AAR']].describe().drop(['count', '25%', '75%'], axis = 1, level = 1)

	mean					AAR
	mean	std	min	50%	max	mean	std	min	50%	max
label
0	200.667123	51.618273	115.757428	216.478273	262.963111	0.017625	0.006906	0.008979	0.017365	0.032888
1	390.181510	97.923013	288.012800	365.500341	547.447041	0.022100	0.006737	0.014136	0.020379	0.038545
2	426.115283	92.886202	237.693017	431.601847	602.453659	0.005278	0.006111	-0.004056	0.005978	0.016868
3	847.386273	144.170860	620.250145	874.343756	1127.438545	0.018557	0.011233	0.005120	0.015508	0.049247
4	1479.747107	237.391618	814.175010	1467.340734	1845.074602	0.025489	0.015684	-0.001702	0.022071	0.066189
5	2490.528235	315.883235	1964.560733	2472.128227	3066.859439	0.023795	0.007599	0.011861	0.022081	0.037412
6	3827.121639	803.842542	2155.531905	3775.780500	5561.711500	0.024806	0.014834	0.005628	0.023129	0.069728
7	6300.830484	1052.993821	4450.766646	6115.175629	8619.286455	0.030364	0.017886	0.005252	0.027920	0.095589
8	11682.299887	2402.926593	7863.646803	11266.818530	15862.491985	0.033281	0.012924	0.013057	0.029593	0.057549
9	24691.065267	5082.077178	14681.687576	23907.654848	39449.978788	0.021609	0.017823	-0.038920	0.022235	0.073707
10	54237.905693	15460.532122	39840.211061	46557.844697	81882.728182	0.013774	0.015100	-0.009593	0.024289	0.027826

Future analysis

Ther are four main avenues to extend this short analysis

Create features from the time series with regression models, tokenise the series
Compare a manifold of distances, such as $L_{1}$, correlation
Investigate other clustering methods, such as Markov clustering, affinity propagation
Use biclustering which is more sensitive to the ordering of countries over time

Appendix

Misc functions

def calculate_lorenz_curve(X):
    """
    Calculates the Lorenz curve.
    Parameters:
        X (np.ndarray [pop_size, n_records]) : the wealth distribution over time.
    Returns:
        X_trf (np.ndarray [pop_size, n_records]) : Lorenz curve
    """

    X_trf = X.copy()
    X_trf = np.nancumsum(np.sort(X_trf, axis = 0), axis = 0) / np.nansum(X_trf, axis = 0)

    return X_trf

def calculate_gini(X):
    """
    Calculates the Gini coefficient.
    Parameters:
        X (np.ndarray [pop_size, n_records]) : the wealth distribution over time.
    Returns:
        gini_coeff (float) : Gini coefficient.
    """
    
    lc = calculate_lorenz_curve(X)
    pop_size = lc.shape[0]
    
    auc = np.sum(lc, axis = 0) / pop_size
    
    gini_coeff = 1.0 - 2.0 * auc
    
    return gini_coeff

def calc_entropy(X):
    """
    Calculates the normalised entropy along the last dimension.
    Parameters:
        X (np.ndarray [n_records, n_categories]) : observations
    Returns:
        entropy (np.ndarray [n_records]) : the normalised entropy
    """
    
    normed = normalize(X)
    
    logp = np.zeros(X.shape, dtype = np.float) 
    logp[normed != 0.0] = np.log(normed[normed != 0.0])
    
    entropy = -np.sum(logp * normed, axis = 1)
    entropy /= np.log(X.shape[1])
    
    return entropy

k-medoids clustering

class KMedoids:
    """
    A slim implementation of the k-medoids algorithm.
    """
    
    def __init__(self, n_clusters = 8, n_init = 20, maxiter = 100):
        """
        Parameters:
            n_clusters (int) : number of clusters
            n_init (int) : number of restarts
            maxiter (int) : maximum number of iterations
        """
        
        self._n_clusters = n_clusters
        self._n_init = n_init
        self._maxiter  = maxiter
        self._labels = None
        self._medoids = None
         
    @property
    def labels_(self):
        return self._labels
    
    @property
    def medoids(self):
        return self._medoids
    
    @property
    def score(self):
        return self._score
        
    def fit(self, X):
        """
        Performs a clustering.
        
        Parameters:
            x (np.ndarray) : distance matrix
        Returns:
            self
        """
        
        # input check
        n_dim = X.ndim
        if n_dim != 2:
            raise ValueError("2D arrays expected. Got: {0}".format(n_dim))
            
        m, n = X.shape
        if n != m:
            raise ValueError("Square distance matrix expected.")
        
        # clear up from possible previous fit
        self._labels = None
        self._medoids = None
        
        # it can only better than this
        score_best = X.sum()
        
        # choose best from batch of fits
        for idx in range(self._n_init): 
            score_, labels_, medoids_ = self._single_fit(X)
            
            if score_ < score_best:
                score_best = score_
                self._medoids = medoids_
                self._labels = labels_
                self._score = score_best
        
        # relabel clusters
        labels = np.full_like(self._labels, -1)
        
        for idx, label in enumerate(self._medoids):
            labels[self._labels == label] = idx
        
        self._labels = labels
        
        return self
      
    def fit_predict(self, X):
        """
        Performs a clustering.
        Parameters:
            X (np.ndarray) : distance matrix
        Returns:
            self.labels_ (np.ndarray of int) : list of cluster labels
        """
        
        return self.fit(X).labels_
     
    def set_params(self, **params):
        """
        Sets the named parameters of the estimator.
        """
        
        for k, v in params.items():
            if k == "maxiter":
                self._maxiter = v
            
            elif k == "n_init":
                self._n_init = v
            
            elif k == "n_clusters":
                self._n_clusters = v
            
    def _single_fit(self, X):
        """
        Performs a single clustering.
        """
        
        medoids  = np.sort(np.random.choice(X.shape[0], self._n_clusters, replace = False))
        lookup = np.arange(X.shape[0])
        
        i_iter = 0
        
        while i_iter < self._maxiter:
            
            i_iter += 1
            d_min = X[medoids[0]] + 0 # deepcopy
            labels = np.full(X.shape[0], medoids[0])
            
            # maximisation step
            # assign datum to closest medoid
            for idx in medoids[1:]:
                mask = d_min > X[idx]
                d_min[mask] = X[idx][mask] + 0 # deepcopy
                labels[mask] = idx

            # expectation step
            # find new medoids

            is_converged = True

            for idx in range(self._n_clusters):
                mask = labels == medoids[idx]
                minidx = np.argmin(X[np.ix_(mask, mask)].sum(axis = 1))
                
                old_medoid = medoids[idx] + 0
                medoids[idx] = lookup[mask][minidx]
        
                is_converged &= medoids[idx] == old_medoid
        
            if is_converged:
                break
    
        # calculate cost function
        cost = d_min.sum()
            
        return cost, labels, medoids