Is it Cfa or Cfb?

29 Jan 2019

Goals

We attempt to construct a pipeline to predict the Koppen climate of a geographical location based on the observed temperatures and precipitation.

The raw notebook can be found here.

The data

Source

The temperature and precipitation data were obtained from the WorldClim - Global Climate Data database. The resolution is 10’ on an equidistant latitude and longitude grid. Please note, this leads to increasing weight of points advancing towards the polar regions. The Koppen classification can be found at WORLD MAPS OF Koppen–Geiger Climate Classification page of Universitat Wien. The data cover most of the landmass save for the Antarctica.

The two datasets were projected onto a common grid and saved to a csv store.

Structure

The loaded from the store have roughly 64,00 points:

path_to_store = r'path/to/temperature-precipitation-cls.csv'
df = pd.read_csv(path_to_store, index_col = 0 )

print("Number of records: {0}".format(len(df)))
Number of records: 64409

The fields in the column header have the following meaning:

df.head()
Lat Lon Cls 2 3 4 5 6 7 8 ... 16 17 18 19 20 21 22 23 24 25
1 -55.75 -67.75 ET 86.0 83.0 76.0 56.0 36.0 20.0 15.0 ... 86.0 81.0 75.0 66.0 67.0 63.0 51.0 51.0 75.0 66.0
2 -55.75 -67.25 ET 83.0 80.0 73.0 53.0 32.0 17.0 12.0 ... 72.0 70.0 61.0 53.0 52.0 55.0 44.0 43.0 58.0 59.0
3 -55.25 -70.75 ET 88.0 86.0 76.0 57.0 40.0 25.0 21.0 ... 145.0 132.0 121.0 113.0 105.0 107.0 103.0 92.0 105.0 107.0
4 -55.25 -70.25 ET 88.0 86.0 76.0 57.0 39.0 24.0 20.0 ... 131.0 119.0 110.0 104.0 95.0 97.0 90.0 80.0 95.0 96.0
5 -55.25 -69.75 ET 71.0 69.0 59.0 39.0 20.0 6.0 2.0 ... 120.0 108.0 99.0 94.0 86.0 88.0 80.0 72.0 89.0 88.0

5 rows × 27 columns

The latitude and longitude columns are removed for they will not be used in the following:

df.drop(['Lat', 'Lon'], axis = 1, inplace = True)

Preliminary checks

A quick sanity check is worth the time:

df.describe()
2 3 4 5 6 7 8 9 10 11 ... 16 17 18 19 20 21 22 23 24 25
count 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 ... 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000 64409.000000
mean -19.496421 -7.391420 27.436694 76.984102 124.579748 160.791535 178.200174 170.462063 139.925818 92.115248 ... 55.060628 53.122390 56.857893 63.541989 73.573740 73.104815 64.611576 58.248350 54.097471 54.739369
std 221.942406 218.333765 199.267312 164.706627 126.037252 99.320742 88.278280 91.803491 111.517567 146.584784 ... 75.628189 67.982472 70.617548 79.706175 90.766774 84.688932 72.526402 66.720411 66.163412 74.047716
min -512.000000 -469.000000 -442.000000 -357.000000 -206.000000 -116.000000 -82.000000 -95.000000 -165.000000 -275.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% -219.000000 -199.000000 -130.000000 -41.000000 35.000000 94.000000 119.000000 107.000000 56.000000 -22.000000 ... 11.000000 13.000000 14.000000 14.000000 19.000000 20.000000 19.000000 17.000000 13.000000 11.000000
50% -29.000000 -12.000000 34.000000 88.000000 135.000000 160.000000 180.000000 172.000000 142.000000 94.000000 ... 24.000000 28.000000 34.000000 43.000000 52.000000 53.000000 44.000000 36.000000 30.000000 25.000000
75% 214.000000 221.000000 229.000000 230.000000 232.000000 243.000000 249.000000 248.000000 242.000000 238.000000 ... 64.000000 62.000000 67.000000 78.000000 88.000000 89.000000 80.000000 71.000000 67.000000 63.000000
max 337.000000 332.000000 331.000000 339.000000 359.000000 384.000000 392.000000 382.000000 357.000000 325.000000 ... 609.000000 707.000000 759.000000 1471.000000 1728.000000 1232.000000 903.000000 919.000000 802.000000 705.000000

8 rows × 24 columns

The precipitation data exhibit a distribution heavily skewed by outliers. This is best visualised by plotting the density and cumulative distribution functions along with the (0.1, 0.2, 0.3, …, 0.9, 0.99) quantiles in the top row of the figure below.

df_deciles = df.describe(percentiles = [.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]).drop(['count', 'mean', 'std'])

The box plots of the monthly data in the middle row also emphasise the presence of large number outliers. In addition, the temperature and precipitation data are on different scales

png

We are foremost interested in how the observations are distributed within a class. For this end, the monthly mean temperature and precipitation and standard deviation thereof are calculated for all climates to see where they are positioned on a common scale. It is apparent that the temperature data encompass a range five to ten times larger than that of the the precipitation data

png

Ideally all of the features should be on similar scales, because it would make the distances in each feature subspace comparable. An other desirable property that each feature of each class is normally distributed. It would enable us to use methods require normally distributed data, such as discriminant analysis. Thus we will attempt to:

StandardScaler and MinMaxScaler are out of question for they cannot handle outliers. The outliers have two effects. Firstly, using the minmax scaler, the ranges where the bulk of the data lays would be squashed due to the large value of the outlying points. Secondly, the variance would be two large, which would again result in constraining the meaningful data to a small range. Secondly, the variance would be inflated by the spread of the outliers, and it thus would not reflect the variance of the bulk of the data. Therefore performing any variance based transformation, such as PCA, or fitting, e.g. discriminant analysis, would potentially lead to information loss.

RobustScaler uses specific quantiles to scale the data. Choosing the quantile range carefully, say 0.10 to 0.90, would avoid squashing the data, but they still would be riddled with outliers.

The transform that we are apply to the data circumvents both problems. It is the Yeo–Johnson power transform. It rescales the datum individually to follow an approximate normal distribution. The transformed distribution is then scaled by the variance. In this manner the temperature and precipitation data sets are moved to similar ranges. It is worth noting, the individual features will not be normally distributed. Although, they are expected to be less riddled with outliers.

from sklearn.preprocessing import power_transform

temp_trf = power_transform(df.loc[:, '2' : '13'].values.reshape(-1, 1), method = 'yeo-johnson') 
                       # with_centering = False, quantile_range = (00.0, 99.0)) 

precip_trf = power_transform(df.loc[:, '14' : '25'].values.reshape(-1, 1), method = 'yeo-johnson')
                               #        with_centering = False, quantile_range = (0.0, 99.0))

# create new dataframe of the transformed values
df_trf = pd.DataFrame(data = np.concatenate([temp_trf.reshape(-1, 12), precip_trf.reshape(-1, 12)], axis = 1),   
                      columns = [str(x) for x in range(2, 26)])   
 
df_trf['Cls'] = df['Cls'].values

As we can see in the top row of the plot below, the 0th and 99th percentiles of the temperature and precipitation distributions are positioned close to each other. The number and distance of the outliers are greatly reduced as indicated by the middle row. These changes are further confirmed by the scatter plots of the monthly data in the bottom row.

png

The monthly precipitation and temperature values are now neatly positioned within the same range for all climates as shown in the plot below.

png

Feature engineering

In this section, we further transform the data to generate features can optimally used to establish classification rules.

Firstly, the Yeo–Johnson transformed data are saved in variable X:

X = df_trf.iloc[:, :-1].values

Label encoding

The literal class labels are converted to integer ones and stored in the vector y:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(df_trf['Cls'].values)
label_codes = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))

Dimensionality reduction

Before we proceed to select classifiers it is the most useful to investigate the cardinality of various classes.

from collections import Counter
", ".join( "({0} : {1})".format(label_codes[x[0]], x[1]) for x in Counter(y).most_common())
'(Dfc : 10894), (ET : 7219), (BWh : 7193), (Aw : 5894), (Dfb : 4406), (BSk : 3273), (BSh : 3107), (Cfa : 2855), (Af : 2440), (Cfb : 2234), (EF : 1980), (BWk : 1838), (Am : 1778), (Cwa : 1501), (Dfd : 1410), (Dwc : 1315), (Csa : 996), (Dfa : 756), (Csb : 677), (Dwb : 651), (Cwb : 560), (Dwa : 313), (As : 276), (Dsc : 261), (Dsb : 207), (Cfc : 175), (Dwd : 100), (Dsa : 79), (Csc : 11), (Cwc : 10)'

There are four–five classes in which the number of samples are comparable to that of the dimension. Clearly, a dimensionality reduction would help here. We invoke our old friend principal component analysis (PCA).

It is readily seen that the first six principal components account for almost 99% of the total variance. This implies there is a chance to cluster the data in a projected subspace whose dimension is sufficiently smaller than the number of samples of almost all classes.

from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

png

Classification

We now proceed to select a suitable method process the climate data from the plethora of classification algorithms.

Nearest means

Because it is too hard to choose, we create our classification method based on the class means. The set of labels are denoted by $L$. The cluster label, $l_{i}$ of the datum $i$ is that of the class whose mean $\mu_{j}$ is closest to the point in question:

\[l_{i} = \underset{j \in L}{\operatorname{argmin}} ||\mu_{j} - x_{i}||_{2}\, .\]

The partition above results in the Voronoi decomposition of the data around the class means. This method is implemented in the NearestMeanClassifier class

from sklearn.base import BaseEstimator
from sklearn.metrics.pairwise import pairwise_distances_argmin

class NearestMeanClassifier(BaseEstimator):
    
    def __init__(self):
        
        self._is_fitted = False
    
    def fit(self, X, y):
        """
        Calculates the class centres on a training sample.
        
        Parameters:
            X np.ndarray, (n_samples, n_features) : training observations
            y (n_samples) :  training class labels
        
        Returns:
            self
        """
        
        labels = np.unique(y)
        n_clusters = labels.size
        
        self.cluster_centers_ = np.zeros((n_clusters, X.shape[1]))
        self.cluster_labels_ = np.zeros(n_clusters, dtype = np.int)
        
        for idx, label in enumerate(labels):
            self.cluster_centers_[idx] = np.mean(X[y == label], axis = 0)
            self.cluster_labels_[idx] = label
            
        self._is_fitted = True
         
    def predict(self, X):
        """
        Returns the predicted class labels on a test sample.
        Parameters:
            X np.ndarray, (n_samples, n_features) : observations
            
        Returns:
            labels_ np.ndarray, (n_samples) : predicted labels
        """
        
        if not self._is_fitted:
            raise ValueError("Classifier must be fit before prediction")
            
        label_positions = pairwise_distances_argmin(X, self.cluster_centers_)
        self.labels_ = self.cluster_labels_[label_positions]
        
        return self.labels_

The classification is carried out in the projected PCA space. A fivefold cross validation is invoked to determine the optimum number of principal components. Accuracy and adjusted rand score – to account for the chance matches – will measure the goodness of the classifier.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import accuracy_score, adjusted_rand_score, confusion_matrix, make_scorer

cv = KFold(n_splits = 5, shuffle = True)
pca = PCA()
classifier = NearestMeanClassifier()

pipeline = Pipeline([('pca', pca), ('classifier', classifier)])

param_grid_nmc = [{'pca__n_components' : [2, 4, 6, 8, 10, 12, 14]}]

scoring = scoring = {'ars': make_scorer(adjusted_rand_score), 'accuracy': make_scorer(accuracy_score)}

grid_nmc = GridSearchCV(pipeline, cv = cv, param_grid = param_grid_nmc, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)
_ = grid_nmc.fit(X, y)

png

The classifier is surprisingly poor. What could the possible reasons be? We tacitly required

These are really strict requirements which are not granted to have met.

Structure of data

It is high time to delve into the structure of the data deeper. The projections of the error ellipses onto the 2D hyperplanes spanned by the first six principal components are shown in the figure below. The labels in the top left corner are the indices of the hyperplanes, 0 being the largest. Only the first most populous classes are shown for sake of clarity. Each colour corresponds to a class. It is clear, the distributions are far from symmetric. Also there is a considerable amount of overlap between various classes.

png

Linear discriminant analysis

One can rightly point out that the binary combination of the PCA vectors (unit or zero coefficients) are not necessarily those that maximally separate the classes.

Linear Discriminant Analysis (LDA) assumes all classes are identically distributed according to a normal distribution. Using this assumption it finds those projections of the data along which the classes are most separated. Using less than the all principal components posing the danger that some variables being discarded that might be important components of the separating planes. It is however reasonable to assume the separating planes are correlated with the principal components given the large number of classes. (On the other hand, it is not at all difficult to draw an example which would prove otherwise, but that would be a really contrived one.)

To use LDA, the data are required to be normal and identically distributed. If any of these criteria grossly violated, this classification method shall not be used, or at least can result in poor performance. Nevertheless, the projections onto the 2D LDA hyperplanes are plotted below. The figure reassures us as to our assumption, that linear combinations of the vectors do not separate well the classes. (A cross validated LDA fit would result in approximately 80% accuracy.)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA()
X_lda = lda.fit_transform(X, y)

png

k-nearest neighbours

Separating planes, thus must be of higher order than linear. Unfortunately, quadratic discriminant analysis cannot be invoked for the low membership of certain classes. An other option is to use support vector machines with nonlinear kernels.

Since the data are mingled, why not to use them directly to decide membership? A point is more likely to have a label identical to those of its neighbours. The k-nearest neighbour method utilises this idea.

Space to search in

The cost of running a KNN classifier depends on the dimension. It is thus advantageous the reduce the dimensionality of the data. LDA creates a series of projections along which the class separation is enhanced. There we use a succession of them to build the KNN classifier.

from sklearn.neighbors import KNeighborsClassifier

lda = LDA()
pca = PCA()

knn = KNeighborsClassifier()

pipeline_lda = Pipeline([('lda', lda), ('knn', knn)])
pipeline_pca = Pipeline([('pca', pca), ('knn', knn)])

cv = KFold(n_splits = 5, shuffle = True)

param_grid_lda = [{'lda__n_components' : [4, 6, 8, 10]}]
param_grid_pca = [{'pca__n_components' : [4, 6, 8, 10]}]

grid_knn_lda = GridSearchCV(pipeline_lda, cv = cv, param_grid = param_grid_lda, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)

grid_knn_pca = GridSearchCV(pipeline_pca, cv = cv, param_grid = param_grid_pca, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)

_1 = grid_knn_lda.fit(X, y)
_2 = grid_knn_pca.fit(X, y)

The fact that PCA outperforms LDA at lower dimensions reminds us the fact the many assumptions of LDA is not met. At higher dimensions, both method perform equally well. In the following we choose PCA for feature engineering.

png

KNN can be percieved as a voting algorithm. The label of a datum is decided upon a handful its closest neighbours. Two notions ought to be specified, namely, handful and distance.

Increasing the number of neighbours, decreases the locality of the model, for further points are included in casting a vote on the class membership. The parameter n_neighbors specifies how many neighbours should be taken into account. The distance between points is irrelevant when using the default settings of the KNN estimator. However, one might expect that data closer to a specific datum are more likely to have labels identical to that of the datum in question. If the weights parameter is set to distance, points closer the contribute with larger weight to the decision on the class membership.

param_grid = {'pca__n_components' : [6, 8, 10, 12],
              'knn__n_neighbors' : [5, 10, 15],
              'knn__weights' : ['uniform', 'distance']}


grid = GridSearchCV(pipeline_pca, param_grid = param_grid, 
                    cv = cv,
                    scoring = scoring, 
                    refit = False, 
                    return_train_score = False)

_2 = grid.fit(X, y)

The variation of accuracy as a response to the hyperparameters is shown below. It plateaus off beyond the tenth principal component in all settings. Taking into account the distance of the neighbours leads to an increase of 1%. It is curious that the accuracy depends negatively on the number of neighbours. This can be attributed to classes overlapping heavily.

png

Summary

We have created a pipeline to classify climates based on the observed precipitation and temperature data. Using dimensionality reduction techniques and an optimised k-nearest neighbours classifier we have managed to achieve 90% accuracy.

Further improvements

Considerable amount of time has been sacrificed to choose to generate features, and again some labour was needed to optimise the classifier. In a following post we are going to merge these two steps in a neural network.