Clustering GDPs

22 Dec 2018

In this blog post, we are going to have a look at GDPs of various countries and try to group them according to a suitable measure.

The raw notebook can be found here.

Data

The purchasing power parity gross domestic product (GDP (PPP)) have been obtained from the paper of James and Gakidou. The reference currency is the 2005 USD defined according to the IMF.

Time series spans between 1950 and 2018 contains points for 210 countries or territories (countries for short in the following).

Data format

The time series’ are loaded to a pandas dataframe. The countries are encoded by their three letter ISO 3166-1 alpha-3 abbreviations as row indices. Each year is stored in a separate column.

path_to_db = r'path\to\GDP\data\1478-7954-10-12-S3.xlsx'
sheet_name = 'Sheet1'

df = pd.read_excel(path_to_db, sheet_name = sheet_name, header = None)

# split columns by countries
groups = df.groupby(0)
df = pd.DataFrame.from_dict({n : g[1].values for n, g in groups}, orient = 'index')

df.head()
0 1 2 3 4 5 6 7 8 9 ... 56 57 58 59 60 61 62 63 64 65
ABW 1266.2420 1327.6650 1393.9080 1467.9210 1546.2200 1623.0460 1717.7520 1792.5970 1867.8290 1943.5910 ... 20333.6900 20143.8800 19444.6300 15729.5800 16451.7000 17275.2300 18179.3200 19139.9100 20157.2100 21228.3800
AFG 179.4909 182.1864 185.5338 192.9397 194.3511 195.4881 200.4814 198.2413 204.9097 207.1753 ... 263.0708 292.6143 294.5654 351.5893 372.7821 387.9407 404.6445 419.7369 436.9097 455.4637
AGO 1164.6930 1192.4080 1220.7880 1248.8910 1212.4560 1281.8420 1252.5360 1338.9390 1385.0590 1377.6910 ... 2289.3820 2675.2520 2945.2890 2878.3360 2958.8980 3076.2210 3174.6280 3268.8260 3347.4790 3384.9220
AIA 1443.0830 1502.3340 1566.0870 1637.5250 1712.6170 1784.9380 1875.6700 1943.4890 2010.6670 2077.3580 ... 13845.4000 15838.2700 15682.0600 11612.8400 12059.6800 12573.3900 13137.4100 13733.3200 14360.5000 15016.1900
ALB 778.0425 809.2832 814.2191 845.1029 869.4968 912.0341 923.9318 976.7833 1017.5170 1056.7130 ... 2731.5010 2889.3540 3093.5480 3179.8590 3246.2210 3333.4310 3436.3720 3562.8720 3711.6290 3877.9590

5 rows × 66 columns

Firstly, we investigate how GDPs have changed over time. To this end, a histogram is created in which each bin corresponds to 2% of the maximum GDP of all years.

def create_histogram(X, bins = None):
    """
    Creates a histogram for all colmns.
    Parameters:
        X (np.ndarray (n_row, n_col)) : raw data
        bins (None or np.histogram bins) : the bins. If None, 50 equal width bins are created in 
        the [0, np,max(X)] range
    Returns:
        hst (np.ndarray (50, n_col)) : histogram
    """
    
    if bins is None:
        bins = np.linspace(0, np.max(X), num = 50)
    
    hst = np.array([np.histogram(x, bins = bins, density = True)[0] for x in X.T])
    hst = hst * np.diff(bins)
    return hst
histo_year = create_histogram(df.values)

png

It is readily seen the maximum and overall GDPs have increased over the years. The distribution has become more spread out.

Moments

The median and the first four moments of the GDP distribution are plotted below. It is worth comparing the median and the mean. Both of them steadily increase, (apart from the years following the collapse of the Eastern Block between 1989–1991). However, the median shows an increase of 450% whereas the mean only tripled. This means the countries, in general, became richer with respect to the whole distribution. This is also seen in the overall reduction of skew. The median, however, decreased by ~20% after 1990 compared to the dent of ~2% in the mean. This implies those countries with lower GDPs were hit harder than those with higher GDPs.

df_year_stats = pd.DataFrame({'mean' : df.mean(axis = 0),
                              'median' : df.median(axis = 0),
                              'std' : df.std(axis = 0), 
                              'skew' : df.skew(axis = 0)})

png

Relative GDP

It is revealing to compare the GDPs to their maximum value in each year. In doing so, the relative wealth distribution can be assessed. The df_scaled_year_stats lists the median, mean and standard deviation of relative to the highest GDP in each year.

def scale_by_max(X):
    maxval = np.max(X)
    
    if maxval == 0.0:
        maxval = 0.0
    scaled = X / maxval

    return scaled
df_scaled = df.apply(scale_by_max, axis = 0)

df_scaled_year_stats = pd.DataFrame({'maximum' : df.max(axis = 0),
                                     'mean' : df_scaled.mean(axis = 0),
                                     'median' : df_scaled.median(axis = 0),
                                     'std' : df_scaled.std(axis = 0)})
histo_scaled_year = create_histogram(df_scaled.values)

png

The wealth of the countries has been increasing relative to that of the wealthiest one as time progresses. In the year of 1950, 70% percent of the countries has a GDP not exceeding 2% of the corresponding maximum. By 2010, this ratio becomes 40%.

Further signs of equalisation, that the mean doubles and the median triples. It is still astounding, though, that half of the countries only have at most 3% of the maximum GDP. It is also interesting to observe, that the separation between the wealthiest country and the rest has been increasing since the middle of previous decade. This effect was moderated by the crisis in 2008, as shown by the less negative slope of the mean and std curves.

png

Entropy

Entropy can measure how uniformly the values are distributed between ranges of equal width. The bins are taken from the histogram above, that is each bin corresponds to $ 0.02\cdot (\text{GDP}{max}- 0)$ If $w{ij} \in \left[0, 1 \right]$ is the value of the i -th in the j-th year, the entropy can be calculated as

\(H(j) = -\sum\limits_{i = 1}^{N_{bins}}w_{ij}\ln w_{ij}\) The entropy is then normalised by the maximum possible value of entropy, where each bin has equal number of countries.

\[\tilde{H(j)} = \frac{H(j)} { -ln(N_{bins})} \in \left[0, 1\right]\]

The entropy increases over the decades, indicating the GDPs becoming less concentrated around one value i.e. the bins become more evenly populated. It does not, however, states anything about the values themselves. No matter whether 90% being in the bin and 10% being in the last bin or vica versa, they would both result in the same entropy. Nevertheless, entropy can provide us with a single number indicator, how uniform the GDP distribution is. Please note, if all countries had equal wealth, the entropy would be zero - all values in one slot. The entropy thus is not suitable to measure inequality. To that end we will invoke the Gini coefficient.

entropy = calc_entropy(histo_year)
entropy_scaled = calc_entropy(histo_scaled_year)

png

Inequality - Lorenz curve

The Lorenz curve is the fraction cumulative wealth plotted against the population ratio. If the wealth is equally distributed, the curve is a linear between the points (0, 0) and (1, 1), which line is also an upper bound of the Lorenz curve. The more the curve deviates from the diagonal, the less equal the wealth distribution is. The Gini coefficient is related to the integral of the Lorenz curve. It is the ratio of the area enclosed by the line of equality and the Lorenz curve and the area under the line of equality. It expresses the distance from equality; 0 meaning total equality, 1 complete inequality.

The coefficient decreases over time, by a modest 0.1 point. (Remember, the median in terms of the highest GDP was still 3% in 2015!) The late fall of communism leaves behind a number of poor countries with economic struggles which results in the increase of the coefficient for a good twenty years from the early eighties. By 2000, the Gini coefficient starts to decrease again, showing some countries had started to recover.

lorenz_curve = calculate_lorenz_curve(df.values)
gini_coeff = calculate_gini(df.values)

png

Uncovering groups of countries

We have so far have been concerned with characterising the distribution. A logical next step is to uncover what underlying components form distribution; whether there are groups of countries that evolve similarly over the time.

Temporal characteristics

If there are countries that move together the correlation between rankings in two subsequent years should show this. We are going to use Spearman’s $\rho$ and Kendall’s $\tau$ correlations. (Pandas uses scipy.stats Kendall $\tau$ implementation) which is the $\tau$-b version. Spearman $\rho$ former takes into account the magnitude of changes in two consecutive rankings, whereas Kendall rank correlation calculates the number changes in rankings regardless their magnitude. Therefore Kendall’s tau is expected to be more sensitive to any variation of the ranking.

The correlation coefficients are calculated for a number increasing delays, namely 1, 5, 10 and 20 years and are shown below. The almost constant nature of Spearman’s $\rho$ indicates that there are no major reorderings in the countries relative positions. However, the minor - or local, if you wish - changes are magnified by Kendall’s $\tau$ coefficient.

year_spearman_corr = df.corr(method = 'spearman')
year_kendall_corr = df.corr(method = 'kendall')

png

The country rankings barely change between two subsequent years. Correlation coefficients at larger delays are indicative of longer term dynamics. E.g. the five year forward coefficient rapidly decreases from 1985 reaching its minimum at 1989, showing that GDP rankings fluctuated between the years 1990 and 1994. From 1989 the correlation increases as the new order stabilises throughout those years. The 10 year forward correlation exhibits a small but noticeable slump starting at 1990. This means the rankings starting at 2000 differ more and more to those ten years earlier. The correlation begins to increase at 1995 so there is again a period of stabilization between 1995 and 2005.

Clustering

It would be interesting to see whether there are groups of countries whose GDPs move together and are close to each other over time. Provided there are, we wish to find the GDP most characteristic of each group. A number of clustering methods can achieve these goals. To name a few, k-medoids, Markov cluster algorithm (MCA), affinity propagation. For MCA has been used on numerous occasions in the previous posts, we are going to choose the k-medoids method. A slim implementation thereof can be found in the bottom of this post.

Distance between GDP time series

What distance should be used in the clustering algorithm? We are looking for similarities in two ways

  1. the series are close to each other with respect to their values
  2. their dynamics are similar

Correlation is insensitive to the proximity of the time series, therefore it is not a good indicator of (1.). The ordinary $L_{2}$ distance ignores the relative magnitude (mean) of the series. This is illustrated in the figure below. $L_{2}$ distance between series (a) and (b) is the same as between series (b) and (c). However, (c) is only 1.5 times greater than (b) whereas (b) is twice the magnitude of (a).

png

What distance should be used in the clustering algorithm? Correlation is insensitive to the proximity of the timeseries. The $L_{2}$ distance ignores the relative magnitude (mean) of the series. E.g. $L_{2}$ distance between series (a) and (b) is the same as between series (b) and (c). However, (c) is only 1.5 greater than (b) whereas (b) is twice the magnitude of (a).

The overlap, defined below, takes into account both the similarity of the time dependence as correlation does. The spatial distance is also accounted for in the denominator:

\[D_{ij} = \frac{\int\limits_{t} (f_{i}(t) - f_{j}(t))^{2} \mathrm{d}t}{\int\limits_{t} f_{i}(t)^{2}\mathrm{d}t + \int\limits_{t} f_{j}(t)^{2}\mathrm{d}t} = 1 - 2 \cdot \frac{\int\limits_{t} f_{i}(t) \cdot f_{j}(t)\mathrm{d}t}{\int\limits_{t} f_{i}(t)^{2}\mathrm{d}t + \int\limits_{t} f_{j}(t)^{2}\mathrm{d}t}\]

This measure is implemented as the overlap function below.

overlap = lambda u, v: 1 - 2 * (u * v).sum() / ((u * u).sum() + (v * v).sum())
X = squareform(pdist(df.values, overlap))

k-medoids clustering

The optimal number of clusters is unknown thus a range of them is tried out. The goodness of clustering is measured by the silhouette score. The corresponding results are then saved in the results list of dictionaries. We do not perform cross validation, for we are interested in grouping all of the countries.

clusterer = KMedoids()

results = []

for n_clusters in range(2, 30):
    
    clusterer.set_params(n_clusters = n_clusters)
    labels = clusterer.fit_predict(X)
    score = clusterer.score
    sil_score = silhouette_score(X, labels)
    
    results.append({'n_clusters' : n_clusters,
                    'labels' : labels,
                    'sil_score' : sil_score,
                    'score' : score,
                    'medoids' :  clusterer.medoids})

Determining the number of clusters

There are a manifold of ways to ascertain the optimal number of clusters. One can plot the intra-cluster variance as a function of the number of clusters. This number can be the point where the variance decreases only by small amount compared to the previous values. This method would suggest a five as the optimum number of clusters (left panel of figure below). One can alternatively choose the maximum of the silhouette score. This measure inversely proportional to the intra cluster variance but also penalises the increasing number of clusters. There is a maximum at $n_{clusters} = 3$ and a secondary maximum at $n_{clusters} = 11$.

# TO HIDE

fig, axes = plt.subplots(1, 2)
fig.set_size_inches(10, 5)

axes[0].scatter([x['n_clusters'] for x in results], [x['score'] for x in results], color = 'green', label = 'Var')
axes[1].scatter([x['n_clusters'] for x in results], [x['sil_score'] for x in results], color = 'green', label = 'silhouette score')
axes[0].legend(); axes[1].legend()
axes[0].grid(True); axes[1].grid(True)
axes[0].set_xlim((0, results[-1]['n_clusters'] + 1)); axes[1].set_xlim((0, results[-1]['n_clusters'] + 1));
axes[0].set_ylim(bottom = 0.0); axes[1].set_ylim(bottom = 0.0)
axes[0].set_xlabel(r'$n_{clusters}$'); axes[1].set_xlabel(r'$n_{clusters}$')
axes[0].set_ylabel(r'$Var$'); axes[1].set_ylabel(r'$silhoutte \,\, score$')
plt.show()

png

Before making a decision, the first twelve clusters are shown in the figure below. The exemplars of the clusters are drawn with bold lines. The other members of the clusters appear in half tone of the corresponding colors.

png

It is instructuve to compare the cluster size distributions. The clusters are ordered according to the mean of their exemplars.

sel = [1, 3, 9]
for result in [results[idx] for idx in sel]:
    print("n_clusters:", result['n_clusters'], "cluster sizes:", sorted(Counter(result['labels']).most_common()))
n_clusters: 3 cluster sizes: [(0, 67), (1, 80), (2, 63)]
n_clusters: 5 cluster sizes: [(0, 44), (1, 39), (2, 50), (3, 39), (4, 38)]
n_clusters: 11 cluster sizes: [(0, 11), (1, 16), (2, 17), (3, 23), (4, 26), (5, 23), (6, 19), (7, 20), (8, 21), (9, 27), (10, 7)]

$n_{clusters} = {3, 5}$ splits the time series into groups of roughly equal sizes. These groups are more likely to represent a geometric partitioning rather arising from the underlying structure. At $n_{clusters} = 3$, the third (brown) exemplar clearly falls between two strands of time series in a low density region. When $n_{clusters} = 11$, the clusters have strongly unequal cardinalities which suggests some of the structure is captured. For instance, the tenth - red - exemplar lies in the middle of a band of time series. We therefore set the number of clusters as 11.

The clusters

The clusters are shown below. The exemplar is plotted in full tome whilst the other members are in half tone. The series which do not belong to the particular cluster are drawn in grey.

df_stats = pd.DataFrame(index = df.index)
df_stats['mean'] = df.mean(axis = 1)
df_stats['AAR'] = np.mean(df.values[:,1:] / df.values[:, :-1], axis = 1) - 1
df_stats['label'] = results[9]['labels']
df_stats['cname'] = df_stats.index.map(lambda x: ccodes.get(x, x))

with open(r'path\to\ccodes.json' , 'r') as fproc:
    ccodes = json.load(fproc)

png

The first four clusters collect countries whose GDPs remain low and the associated average annual growths are smaller than 2%. The next metacluster includes labels 5–10 where the mean GDP becomes progressively larger and the AAR is – on average – larger than 2%. The last group of cluster contains only one cluster. The majority of the timeseries represent a drecreasing trend.

The primary separating dimension between clusters is the average value of the GDPs across years as shown in the figure below, where the clusters plotted against the mean and the average annual return (AAR). The AAR was chosen to represent the dynamics of the GDPs’ evolution.

png

Using the mean and the AAR to describe GDPs corresponds to a linear model, and as such it loses a wealth of information. This apparent from the statistics table below. The mean is a reasonably good separating variable, whereas the AAR has a rather poor discriminating power.

df_stats.groupby('label')[['mean', 'AAR']].describe().drop(['count', '25%', '75%'], axis = 1, level = 1)
mean AAR
mean std min 50% max mean std min 50% max
label
0 200.667123 51.618273 115.757428 216.478273 262.963111 0.017625 0.006906 0.008979 0.017365 0.032888
1 390.181510 97.923013 288.012800 365.500341 547.447041 0.022100 0.006737 0.014136 0.020379 0.038545
2 426.115283 92.886202 237.693017 431.601847 602.453659 0.005278 0.006111 -0.004056 0.005978 0.016868
3 847.386273 144.170860 620.250145 874.343756 1127.438545 0.018557 0.011233 0.005120 0.015508 0.049247
4 1479.747107 237.391618 814.175010 1467.340734 1845.074602 0.025489 0.015684 -0.001702 0.022071 0.066189
5 2490.528235 315.883235 1964.560733 2472.128227 3066.859439 0.023795 0.007599 0.011861 0.022081 0.037412
6 3827.121639 803.842542 2155.531905 3775.780500 5561.711500 0.024806 0.014834 0.005628 0.023129 0.069728
7 6300.830484 1052.993821 4450.766646 6115.175629 8619.286455 0.030364 0.017886 0.005252 0.027920 0.095589
8 11682.299887 2402.926593 7863.646803 11266.818530 15862.491985 0.033281 0.012924 0.013057 0.029593 0.057549
9 24691.065267 5082.077178 14681.687576 23907.654848 39449.978788 0.021609 0.017823 -0.038920 0.022235 0.073707
10 54237.905693 15460.532122 39840.211061 46557.844697 81882.728182 0.013774 0.015100 -0.009593 0.024289 0.027826

Future analysis

Ther are four main avenues to extend this short analysis

Appendix

Misc functions

def calculate_lorenz_curve(X):
    """
    Calculates the Lorenz curve.
    Parameters:
        X (np.ndarray [pop_size, n_records]) : the wealth distribution over time.
    Returns:
        X_trf (np.ndarray [pop_size, n_records]) : Lorenz curve
    """

    X_trf = X.copy()
    X_trf = np.nancumsum(np.sort(X_trf, axis = 0), axis = 0) / np.nansum(X_trf, axis = 0)

    return X_trf
def calculate_gini(X):
    """
    Calculates the Gini coefficient.
    Parameters:
        X (np.ndarray [pop_size, n_records]) : the wealth distribution over time.
    Returns:
        gini_coeff (float) : Gini coefficient.
    """
    
    lc = calculate_lorenz_curve(X)
    pop_size = lc.shape[0]
    
    auc = np.sum(lc, axis = 0) / pop_size
    
    gini_coeff = 1.0 - 2.0 * auc
    
    return gini_coeff
def calc_entropy(X):
    """
    Calculates the normalised entropy along the last dimension.
    Parameters:
        X (np.ndarray [n_records, n_categories]) : observations
    Returns:
        entropy (np.ndarray [n_records]) : the normalised entropy
    """
    
    normed = normalize(X)
    
    logp = np.zeros(X.shape, dtype = np.float) 
    logp[normed != 0.0] = np.log(normed[normed != 0.0])
    
    entropy = -np.sum(logp * normed, axis = 1)
    entropy /= np.log(X.shape[1])
    
    return entropy

k-medoids clustering

class KMedoids:
    """
    A slim implementation of the k-medoids algorithm.
    """
    
    def __init__(self, n_clusters = 8, n_init = 20, maxiter = 100):
        """
        Parameters:
            n_clusters (int) : number of clusters
            n_init (int) : number of restarts
            maxiter (int) : maximum number of iterations
        """
        
        self._n_clusters = n_clusters
        self._n_init = n_init
        self._maxiter  = maxiter
        self._labels = None
        self._medoids = None
         
    @property
    def labels_(self):
        return self._labels
    
    @property
    def medoids(self):
        return self._medoids
    
    @property
    def score(self):
        return self._score
        
    def fit(self, X):
        """
        Performs a clustering.
        
        Parameters:
            x (np.ndarray) : distance matrix
        Returns:
            self
        """
        
        # input check
        n_dim = X.ndim
        if n_dim != 2:
            raise ValueError("2D arrays expected. Got: {0}".format(n_dim))
            
        m, n = X.shape
        if n != m:
            raise ValueError("Square distance matrix expected.")
        
        # clear up from possible previous fit
        self._labels = None
        self._medoids = None
        
        # it can only better than this
        score_best = X.sum()
        
        # choose best from batch of fits
        for idx in range(self._n_init): 
            score_, labels_, medoids_ = self._single_fit(X)
            
            if score_ < score_best:
                score_best = score_
                self._medoids = medoids_
                self._labels = labels_
                self._score = score_best
        
        # relabel clusters
        labels = np.full_like(self._labels, -1)
        
        for idx, label in enumerate(self._medoids):
            labels[self._labels == label] = idx
        
        self._labels = labels
        
        return self
      
    def fit_predict(self, X):
        """
        Performs a clustering.
        Parameters:
            X (np.ndarray) : distance matrix
        Returns:
            self.labels_ (np.ndarray of int) : list of cluster labels
        """
        
        return self.fit(X).labels_
     
    def set_params(self, **params):
        """
        Sets the named parameters of the estimator.
        """
        
        for k, v in params.items():
            if k == "maxiter":
                self._maxiter = v
            
            elif k == "n_init":
                self._n_init = v
            
            elif k == "n_clusters":
                self._n_clusters = v
            
    def _single_fit(self, X):
        """
        Performs a single clustering.
        """
        
        medoids  = np.sort(np.random.choice(X.shape[0], self._n_clusters, replace = False))
        lookup = np.arange(X.shape[0])
        
        i_iter = 0
        
        while i_iter < self._maxiter:
            
            i_iter += 1
            d_min = X[medoids[0]] + 0 # deepcopy
            labels = np.full(X.shape[0], medoids[0])
            
            # maximisation step
            # assign datum to closest medoid
            for idx in medoids[1:]:
                mask = d_min > X[idx]
                d_min[mask] = X[idx][mask] + 0 # deepcopy
                labels[mask] = idx

            # expectation step
            # find new medoids

            is_converged = True

            for idx in range(self._n_clusters):
                mask = labels == medoids[idx]
                minidx = np.argmin(X[np.ix_(mask, mask)].sum(axis = 1))
                
                old_medoid = medoids[idx] + 0
                medoids[idx] = lookup[mask][minidx]
        
                is_converged &= medoids[idx] == old_medoid
        
            if is_converged:
                break
    
        # calculate cost function
        cost = d_min.sum()
            
        return cost, labels, medoids