Is it Cfa or Cfb?

Goals

We attempt to construct a pipeline to predict the Koppen climate of a geographical location based on the observed temperatures and precipitation.

The raw notebook can be found here.

The data

Source

The temperature and precipitation data were obtained from the WorldClim - Global Climate Data database. The resolution is 10’ on an equidistant latitude and longitude grid. Please note, this leads to increasing weight of points advancing towards the polar regions. The Koppen classification can be found at WORLD MAPS OF Koppen–Geiger Climate Classification page of Universitat Wien. The data cover most of the landmass save for the Antarctica.

The two datasets were projected onto a common grid and saved to a csv store.

Structure

The loaded from the store have roughly 64,00 points:

path_to_store = r'path/to/temperature-precipitation-cls.csv'
df = pd.read_csv(path_to_store, index_col = 0 )

print("Number of records: {0}".format(len(df)))

Number of records: 64409

The fields in the column header have the following meaning:

Lat : lattitude in degrees
Lon : longitude in degrees
Cls : Koppen climate label
2-13 : monthly average in temperature in $10 \cdot \text{C}^{\circ}$ from January to December.
14-25 : monthly average precipitation in millimetres from January to December.

df.head()

	Lat	Lon	Cls	2	3	4	5	6	7	8	...	16	17	18	19	20	21	22	23	24	25
1	-55.75	-67.75	ET	86.0	83.0	76.0	56.0	36.0	20.0	15.0	...	86.0	81.0	75.0	66.0	67.0	63.0	51.0	51.0	75.0	66.0
2	-55.75	-67.25	ET	83.0	80.0	73.0	53.0	32.0	17.0	12.0	...	72.0	70.0	61.0	53.0	52.0	55.0	44.0	43.0	58.0	59.0
3	-55.25	-70.75	ET	88.0	86.0	76.0	57.0	40.0	25.0	21.0	...	145.0	132.0	121.0	113.0	105.0	107.0	103.0	92.0	105.0	107.0
4	-55.25	-70.25	ET	88.0	86.0	76.0	57.0	39.0	24.0	20.0	...	131.0	119.0	110.0	104.0	95.0	97.0	90.0	80.0	95.0	96.0
5	-55.25	-69.75	ET	71.0	69.0	59.0	39.0	20.0	6.0	2.0	...	120.0	108.0	99.0	94.0	86.0	88.0	80.0	72.0	89.0	88.0

5 rows × 27 columns

The latitude and longitude columns are removed for they will not be used in the following:

df.drop(['Lat', 'Lon'], axis = 1, inplace = True)

Preliminary checks

A quick sanity check is worth the time:

df.describe()

	2	3	4	5	6	7	8	9	10	11	...	16	17	18	19	20	21	22	23	24	25
count	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	...	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000	64409.000000
mean	-19.496421	-7.391420	27.436694	76.984102	124.579748	160.791535	178.200174	170.462063	139.925818	92.115248	...	55.060628	53.122390	56.857893	63.541989	73.573740	73.104815	64.611576	58.248350	54.097471	54.739369
std	221.942406	218.333765	199.267312	164.706627	126.037252	99.320742	88.278280	91.803491	111.517567	146.584784	...	75.628189	67.982472	70.617548	79.706175	90.766774	84.688932	72.526402	66.720411	66.163412	74.047716
min	-512.000000	-469.000000	-442.000000	-357.000000	-206.000000	-116.000000	-82.000000	-95.000000	-165.000000	-275.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	-219.000000	-199.000000	-130.000000	-41.000000	35.000000	94.000000	119.000000	107.000000	56.000000	-22.000000	...	11.000000	13.000000	14.000000	14.000000	19.000000	20.000000	19.000000	17.000000	13.000000	11.000000
50%	-29.000000	-12.000000	34.000000	88.000000	135.000000	160.000000	180.000000	172.000000	142.000000	94.000000	...	24.000000	28.000000	34.000000	43.000000	52.000000	53.000000	44.000000	36.000000	30.000000	25.000000
75%	214.000000	221.000000	229.000000	230.000000	232.000000	243.000000	249.000000	248.000000	242.000000	238.000000	...	64.000000	62.000000	67.000000	78.000000	88.000000	89.000000	80.000000	71.000000	67.000000	63.000000
max	337.000000	332.000000	331.000000	339.000000	359.000000	384.000000	392.000000	382.000000	357.000000	325.000000	...	609.000000	707.000000	759.000000	1471.000000	1728.000000	1232.000000	903.000000	919.000000	802.000000	705.000000

8 rows × 24 columns

The precipitation data exhibit a distribution heavily skewed by outliers. This is best visualised by plotting the density and cumulative distribution functions along with the (0.1, 0.2, 0.3, …, 0.9, 0.99) quantiles in the top row of the figure below.

df_deciles = df.describe(percentiles = [.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]).drop(['count', 'mean', 'std'])

The box plots of the monthly data in the middle row also emphasise the presence of large number outliers. In addition, the temperature and precipitation data are on different scales

png

We are foremost interested in how the observations are distributed within a class. For this end, the monthly mean temperature and precipitation and standard deviation thereof are calculated for all climates to see where they are positioned on a common scale. It is apparent that the temperature data encompass a range five to ten times larger than that of the the precipitation data

png

Ideally all of the features should be on similar scales, because it would make the distances in each feature subspace comparable. An other desirable property that each feature of each class is normally distributed. It would enable us to use methods require normally distributed data, such as discriminant analysis. Thus we will attempt to:

Bring the temperature and precipitation data to similar scales
Further transform them to approximately normal distribution for each feature.

StandardScaler and MinMaxScaler are out of question for they cannot handle outliers. The outliers have two effects. Firstly, using the minmax scaler, the ranges where the bulk of the data lays would be squashed due to the large value of the outlying points. Secondly, the variance would be two large, which would again result in constraining the meaningful data to a small range. Secondly, the variance would be inflated by the spread of the outliers, and it thus would not reflect the variance of the bulk of the data. Therefore performing any variance based transformation, such as PCA, or fitting, e.g. discriminant analysis, would potentially lead to information loss.

RobustScaler uses specific quantiles to scale the data. Choosing the quantile range carefully, say 0.10 to 0.90, would avoid squashing the data, but they still would be riddled with outliers.

The transform that we are apply to the data circumvents both problems. It is the Yeo–Johnson power transform. It rescales the datum individually to follow an approximate normal distribution. The transformed distribution is then scaled by the variance. In this manner the temperature and precipitation data sets are moved to similar ranges. It is worth noting, the individual features will not be normally distributed. Although, they are expected to be less riddled with outliers.

from sklearn.preprocessing import power_transform

temp_trf = power_transform(df.loc[:, '2' : '13'].values.reshape(-1, 1), method = 'yeo-johnson') 
                       # with_centering = False, quantile_range = (00.0, 99.0)) 

precip_trf = power_transform(df.loc[:, '14' : '25'].values.reshape(-1, 1), method = 'yeo-johnson')
                               #        with_centering = False, quantile_range = (0.0, 99.0))

# create new dataframe of the transformed values
df_trf = pd.DataFrame(data = np.concatenate([temp_trf.reshape(-1, 12), precip_trf.reshape(-1, 12)], axis = 1),   
                      columns = [str(x) for x in range(2, 26)])   
 
df_trf['Cls'] = df['Cls'].values

As we can see in the top row of the plot below, the 0th and 99th percentiles of the temperature and precipitation distributions are positioned close to each other. The number and distance of the outliers are greatly reduced as indicated by the middle row. These changes are further confirmed by the scatter plots of the monthly data in the bottom row.

png

The monthly precipitation and temperature values are now neatly positioned within the same range for all climates as shown in the plot below.

png

Feature engineering

In this section, we further transform the data to generate features can optimally used to establish classification rules.

Firstly, the Yeo–Johnson transformed data are saved in variable X:

X = df_trf.iloc[:, :-1].values

Label encoding

The literal class labels are converted to integer ones and stored in the vector y:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(df_trf['Cls'].values)
label_codes = dict(zip(encoder.transform(encoder.classes_), encoder.classes_))

Dimensionality reduction

Before we proceed to select classifiers it is the most useful to investigate the cardinality of various classes.

from collections import Counter
", ".join( "({0} : {1})".format(label_codes[x[0]], x[1]) for x in Counter(y).most_common())

'(Dfc : 10894), (ET : 7219), (BWh : 7193), (Aw : 5894), (Dfb : 4406), (BSk : 3273), (BSh : 3107), (Cfa : 2855), (Af : 2440), (Cfb : 2234), (EF : 1980), (BWk : 1838), (Am : 1778), (Cwa : 1501), (Dfd : 1410), (Dwc : 1315), (Csa : 996), (Dfa : 756), (Csb : 677), (Dwb : 651), (Cwb : 560), (Dwa : 313), (As : 276), (Dsc : 261), (Dsb : 207), (Cfc : 175), (Dwd : 100), (Dsa : 79), (Csc : 11), (Cwc : 10)'

There are four–five classes in which the number of samples are comparable to that of the dimension. Clearly, a dimensionality reduction would help here. We invoke our old friend principal component analysis (PCA).

It is readily seen that the first six principal components account for almost 99% of the total variance. This implies there is a chance to cluster the data in a projected subspace whose dimension is sufficiently smaller than the number of samples of almost all classes.

from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

png

Classification

We now proceed to select a suitable method process the climate data from the plethora of classification algorithms.

Nearest means

Because it is too hard to choose, we create our classification method based on the class means. The set of labels are denoted by $L$. The cluster label, $l_{i}$ of the datum $i$ is that of the class whose mean $\mu_{j}$ is closest to the point in question:

\[l_{i} = \underset{j \in L}{\operatorname{argmin}} ||\mu_{j} - x_{i}||_{2}\, .\]

The partition above results in the Voronoi decomposition of the data around the class means. This method is implemented in the NearestMeanClassifier class

from sklearn.base import BaseEstimator
from sklearn.metrics.pairwise import pairwise_distances_argmin

class NearestMeanClassifier(BaseEstimator):
    
    def __init__(self):
        
        self._is_fitted = False
    
    def fit(self, X, y):
        """
        Calculates the class centres on a training sample.
        
        Parameters:
            X np.ndarray, (n_samples, n_features) : training observations
            y (n_samples) :  training class labels
        
        Returns:
            self
        """
        
        labels = np.unique(y)
        n_clusters = labels.size
        
        self.cluster_centers_ = np.zeros((n_clusters, X.shape[1]))
        self.cluster_labels_ = np.zeros(n_clusters, dtype = np.int)
        
        for idx, label in enumerate(labels):
            self.cluster_centers_[idx] = np.mean(X[y == label], axis = 0)
            self.cluster_labels_[idx] = label
            
        self._is_fitted = True
         
    def predict(self, X):
        """
        Returns the predicted class labels on a test sample.
        Parameters:
            X np.ndarray, (n_samples, n_features) : observations
            
        Returns:
            labels_ np.ndarray, (n_samples) : predicted labels
        """
        
        if not self._is_fitted:
            raise ValueError("Classifier must be fit before prediction")
            
        label_positions = pairwise_distances_argmin(X, self.cluster_centers_)
        self.labels_ = self.cluster_labels_[label_positions]
        
        return self.labels_

The classification is carried out in the projected PCA space. A fivefold cross validation is invoked to determine the optimum number of principal components. Accuracy and adjusted rand score – to account for the chance matches – will measure the goodness of the classifier.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import accuracy_score, adjusted_rand_score, confusion_matrix, make_scorer

cv = KFold(n_splits = 5, shuffle = True)
pca = PCA()
classifier = NearestMeanClassifier()

pipeline = Pipeline([('pca', pca), ('classifier', classifier)])

param_grid_nmc = [{'pca__n_components' : [2, 4, 6, 8, 10, 12, 14]}]

scoring = scoring = {'ars': make_scorer(adjusted_rand_score), 'accuracy': make_scorer(accuracy_score)}

grid_nmc = GridSearchCV(pipeline, cv = cv, param_grid = param_grid_nmc, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)
_ = grid_nmc.fit(X, y)

png

The classifier is surprisingly poor. What could the possible reasons be? We tacitly required

the distributions being pairwise symmetric with respect to the dividing hyperplane.
The distribution of each class is confined within its Voronoi boundary

These are really strict requirements which are not granted to have met.

Structure of data

It is high time to delve into the structure of the data deeper. The projections of the error ellipses onto the 2D hyperplanes spanned by the first six principal components are shown in the figure below. The labels in the top left corner are the indices of the hyperplanes, 0 being the largest. Only the first most populous classes are shown for sake of clarity. Each colour corresponds to a class. It is clear, the distributions are far from symmetric. Also there is a considerable amount of overlap between various classes.

png

Linear discriminant analysis

One can rightly point out that the binary combination of the PCA vectors (unit or zero coefficients) are not necessarily those that maximally separate the classes.

Linear Discriminant Analysis (LDA) assumes all classes are identically distributed according to a normal distribution. Using this assumption it finds those projections of the data along which the classes are most separated. Using less than the all principal components posing the danger that some variables being discarded that might be important components of the separating planes. It is however reasonable to assume the separating planes are correlated with the principal components given the large number of classes. (On the other hand, it is not at all difficult to draw an example which would prove otherwise, but that would be a really contrived one.)

To use LDA, the data are required to be normal and identically distributed. If any of these criteria grossly violated, this classification method shall not be used, or at least can result in poor performance. Nevertheless, the projections onto the 2D LDA hyperplanes are plotted below. The figure reassures us as to our assumption, that linear combinations of the vectors do not separate well the classes. (A cross validated LDA fit would result in approximately 80% accuracy.)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA()
X_lda = lda.fit_transform(X, y)

png

k-nearest neighbours

Separating planes, thus must be of higher order than linear. Unfortunately, quadratic discriminant analysis cannot be invoked for the low membership of certain classes. An other option is to use support vector machines with nonlinear kernels.

Since the data are mingled, why not to use them directly to decide membership? A point is more likely to have a label identical to those of its neighbours. The k-nearest neighbour method utilises this idea.

Space to search in

The cost of running a KNN classifier depends on the dimension. It is thus advantageous the reduce the dimensionality of the data. LDA creates a series of projections along which the class separation is enhanced. There we use a succession of them to build the KNN classifier.

from sklearn.neighbors import KNeighborsClassifier

lda = LDA()
pca = PCA()

knn = KNeighborsClassifier()

pipeline_lda = Pipeline([('lda', lda), ('knn', knn)])
pipeline_pca = Pipeline([('pca', pca), ('knn', knn)])

cv = KFold(n_splits = 5, shuffle = True)

param_grid_lda = [{'lda__n_components' : [4, 6, 8, 10]}]
param_grid_pca = [{'pca__n_components' : [4, 6, 8, 10]}]

grid_knn_lda = GridSearchCV(pipeline_lda, cv = cv, param_grid = param_grid_lda, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)

grid_knn_pca = GridSearchCV(pipeline_pca, cv = cv, param_grid = param_grid_pca, 
                        scoring = scoring, 
                        refit = False,
                        return_train_score = False)

_1 = grid_knn_lda.fit(X, y)
_2 = grid_knn_pca.fit(X, y)

The fact that PCA outperforms LDA at lower dimensions reminds us the fact the many assumptions of LDA is not met. At higher dimensions, both method perform equally well. In the following we choose PCA for feature engineering.

png

Hyperparameter search

KNN can be percieved as a voting algorithm. The label of a datum is decided upon a handful its closest neighbours. Two notions ought to be specified, namely, handful and distance.

Increasing the number of neighbours, decreases the locality of the model, for further points are included in casting a vote on the class membership. The parameter n_neighbors specifies how many neighbours should be taken into account. The distance between points is irrelevant when using the default settings of the KNN estimator. However, one might expect that data closer to a specific datum are more likely to have labels identical to that of the datum in question. If the weights parameter is set to distance, points closer the contribute with larger weight to the decision on the class membership.

param_grid = {'pca__n_components' : [6, 8, 10, 12],
              'knn__n_neighbors' : [5, 10, 15],
              'knn__weights' : ['uniform', 'distance']}


grid = GridSearchCV(pipeline_pca, param_grid = param_grid, 
                    cv = cv,
                    scoring = scoring, 
                    refit = False, 
                    return_train_score = False)

_2 = grid.fit(X, y)

The variation of accuracy as a response to the hyperparameters is shown below. It plateaus off beyond the tenth principal component in all settings. Taking into account the distance of the neighbours leads to an increase of 1%. It is curious that the accuracy depends negatively on the number of neighbours. This can be attributed to classes overlapping heavily.

png

Summary

We have created a pipeline to classify climates based on the observed precipitation and temperature data. Using dimensionality reduction techniques and an optimised k-nearest neighbours classifier we have managed to achieve 90% accuracy.

Further improvements

Considerable amount of time has been sacrificed to choose to generate features, and again some labour was needed to optimise the classifier. In a following post we are going to merge these two steps in a neural network.