Entry Requirements

We are going to explore the entry requirements to visit a country across the globe. We will find out how easy or difficult to travel from one country to an other; which are most welcoming or reclusive nation according to what kind of travel documents they demand on entering their territories.

Preliminaries

Modules

The following modules and tools will be needed:

from collections import Counter
from itertools import chain, count, cycle, islice, product, repeat 
import json

import ete3

from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm, ListedColormap

import networkx as nx
import numpy as np
import pandas as pd
import seaborn as sns

from scipy.cluster.hierarchy import linkage
import scipy.stats as scs
import scipy.spatial.distance as scd

from sklearn.decomposition import PCA
from sklearn.metrics import fowlkes_mallows_score, silhouette_score
from sklearn.preprocessing import normalize

Utility functions

Firstly, a handful of helper functions are defined:

def load_dict_from_json(path_to_file):
    """
    Loads a dictionary from a json file
    Parameters:
        path_to_file (str) : **full** path to the json storage
    Returns:
        dict_ ({}) : dictionary
    """
    with open(path_to_file, 'r') as fproc:
        dict_ = json.load(fproc)
    return dict_

def load_df_from_json(path_to_file, orient = 'index'):
    """
    Loads a pandas dataframe from a json file.
    Parameters:
        path_to_file (str) : **full** path to the json storage
        orient (str) : orientation of the json (see pandas.read_json for details)
    Returns:
        df (pandas.DataFrame) : a dataframe
    """
    with open(path_to_file, 'r') as fproc:
        df = pd.read_json(fproc, orient = orient)        
    return df

def most_common(arr):
    """
    Finds the most common element in the array.
    Parameters:
        arr (iterable) : an iterable 
    Returns:
        most_common_element (object) : the most common element in the iterable. In case of ties it returns only on of them.
    """
    counter = Counter(arr)
    most_common_element = counter.most_common(1)[0][0]
    
    return most_common_element

Coding style

We will tend to use a functional approach for the following reasons:

It is concise
It can save memory
It reflects the transformational nature of data analysis

For example, the most_common function would look like

most_common = lambda x: Counter(x).most_common(1)[0][0]

In some cases, the code snippets are modestly interesting, hence they are hidden from this post. The raw jupyter notebook can be found here, which contains all the code used to process the data.

Graphs

If a script used to generate a plot is not particularly insightful it will be omitted. Again, they all can be found here.

Data collection

The data was collected from the most excellent website passportindex.org. I would encourage everyone to have a look at the site; one can learn about the (geo)political relationships between countries in a really entertaining way.

Preparing the data

The data

The database is loaded from a json storage.

path_to_visa_json_db = r'path/to/visa_by_country.json'
df_visa = load_df_from_json(path_to_visa_json_db)
df_visa.head(3)

	AD	AE	AF	AG	AL	AM	AO	AR	AT	AU	...	VA	VC	VE	VN	VU	WS	YE	ZA	ZM	ZW
AD	self	visa-free	visa required	visa-free	visa-free	visa-free	visa required	visa-free	visa-free	eTA	...	visa-free	visa-free	visa-free	visa required	visa-free	visa on arrival	visa required	visa-free	visa on arrival	visa on arrival
AE	visa-free	self	visa required	visa required / e-Visa	visa-free	visa-free	visa required	visa-free	visa-free	visa required / e-Visa	...	visa-free	visa-free	visa required	visa required	visa-free	visa on arrival	visa on arrival	visa required	visa on arrival	visa on arrival
AF	visa required	visa required / e-Visa	self	visa required / e-Visa	visa required	visa required	visa required	visa required	visa required	visa required / e-Visa	...	visa required	visa-free	visa required	visa required	visa required	visa on arrival	visa required	visa required	visa required / e-Visa	visa required / e-Visa

3 rows × 199 columns

The row indices or labels, denote the countries whose citizens are traveling to the countries enumerated in the columns. For example, the (AD, AL) = visa free entry tells us that the citizens of Andorra are granted visa free entry to Albania. A guest country is the country one travels from, whereas a host country is the country one travels to.

Cleaning the data

Firstly, all of the entry requirements are checked. We are looking for

aliases
mistyped values

A quick histogram binning returns all types of entry requirements:

raw_requirements = Counter(np.ravel(df_visa.values))

for idx, req in enumerate(raw_requirements.most_common()):
    print("{0}\t:  {1}".format(idx, req))

:  ('visa required', 18396)
:  ('visa-free', 11990)
:  ('visa on arrival', 5660)
:  ('visa required / e-Visa', 1799)
:  ('visa on arrival / eVisa', 537)
:  ('eTA', 479)
:  ('self', 199)
:  ('visitor’s permit', 196)
:  ('e-Visa', 153)
:  ('eVisa', 100)
:  ('Visa waiver on arrival', 80)
:  ('eVistors', 5)
:  ('EVW', 4)
:  ('visa-free ', 3)

It appears electronic visa has two aliases e-Visa and eVisa. Also, visa free has a trailing space a few times. The terms eVistors (sic!), EVW are also curious. The following utility find_in_df helps us to figure out where these entries appear:

def find_in_df(df, value):
    idcs = []
    for col in df.columns:
        row_idcs = df[df[col] == value].index.tolist()
        idcs.extend([(idx, col) for idx in row_idcs])
    
    return idcs

print("Countries with 'EVW' {0}".format(find_in_df(df_visa, 'EVW')))
print("Countries with 'eVistors' {0}".format(find_in_df(df_visa, 'eVistors')))

Countries with 'EVW' [('AE', 'UK'), ('KW', 'UK'), ('OM', 'UK'), ('QA', 'UK')]
Countries with 'eVistors' [('BG', 'AU'), ('CY', 'AU'), ('CZ', 'AU'), ('HR', 'AU'), ('RO', 'AU')]

EVW is a form of ETA (ESTA for USA) thus it will be merged into that category.
eVistor (eVisitor if spelt correctly) is also a type of ETA so it will also be merged into that category

Looking at Visa waiver on arrival and visitor’s permit reveals that they cover Quatar’s and Seychelles’ travel requirements.

countries = set([x[1] for x in find_in_df(df_visa, 'Visa waiver on arrival')])
print("Countries with 'Visa waiver on arrival' {0}".format(countries))

countries = set([x[1] for x in find_in_df(df_visa, 'visitor’s permit')])
print("Countries with 'visitor’s permit' {0}".format(countries))

Countries with 'Visa waiver on arrival' {'QA'}
Countries with 'visitor’s permit' {'SC'}

It is worth noting the Seychelles offers universal visa free entry, hence its entries ought to be changed to visa free.

The most interesting term Visa waiver on arrival provided to certain national traveling to Qatar is in practice equivalent to visa free travel. It will thus be relabeled as such:

replace_map =  {'EVW' : 'eTA',
                'eVistors' : 'eTA',
                'visa-free ' : 'visa-free', 
                'Visa waiver on arrival' : 'visa-free', 
                'visitor’s permit' : 'visa-free',
                'e-Visa': 'eVisa'}

df_visa.replace(replace_map, inplace = True);

The case where one country explicitly bars specific nationals to enter her territory is missing from the original database. They can be compiled from this wikipedia site. The citizens are the keys the values are the lists of countries from which they are barred from in the dictionary below. (Can the reader find out which country bans Taiwan?)

barred_countries = {"AM" : ["AZ", "PK"],  "BD" : ["SD", "LY"], "VE" : ["US"], "TD" : ["US"],
                    "IR" : ["US", "LY"],  "LY" : ["US"], "SY" : ["US", "LY", "ML"],"YE" : ["US", "LY"],
                    "PK" : ["LY"],"SD" : ["LY"], "KP" : ["KR", "JP"],
                    "SO" : ["AU","NZ","FI", "SE", "DE", "NO", "CA", "SK", "CZ", "HU", "GR", "LU", "NL", "UK", "US"],
                    "QA" : ["LY", "AE", "SA"], "TW" : ["GE"],
                    "RK" : ["AM", "AZ", "BY", "KZ", "KG", "MD", "RU", "TJ", "TM", "UA", "UZ", "CU", "GE", "SC"],
                    "IL" : ["IR","KW","LB","LY","SA","SY","YE","IQ","MY","DZ","BD","BN","OM","PK","AE"]}

for guest, hosts in barred_countries.items():
    for host in hosts:
        df_visa.loc[guest, host] = 'barred'

A quick check shows all substitutions have been performed correctly.

requirements = Counter(np.ravel(df_visa.values))

for idx, req in enumerate(requirements.most_common()):
    print("{0}\t: {1}".format(idx, req))

: ('visa required', 18341)
: ('visa-free', 12268)
: ('visa on arrival', 5658)
: ('visa required / e-Visa', 1791)
: ('visa on arrival / eVisa', 537)
: ('eTA', 488)
: ('eVisa', 253)
: ('self', 199)
: ('barred', 66)

The cleaned data are then saved.

path_to_visa_clean_db = r'path/to/visa_by_country_clean.json'
df_visa.to_json(path_to_visa_clean_db, orient = 'index')

Preparation

There is wide range of possibilities to explore. If the reader is interested in 1D statistics, such as which is the most welcoming country he is referred to the passport.org website.

Encoding

We are dealing with categorical values therefore one can assign an integer number to each category. They are also ordered, in a sense, based on how restricted the entry to a certain country is. One can attempt to create a mapping that represents the various levels of freedom. Albeit, it will be slightly subjective.

requirement_order_map = {'self' : -1, 
                         'visa-free' : 0, 
                         'eTA' : 1,
                         'visa on arrival' : 2,
                         'visa on arrival / eVisa' : 3,
                         'eVisa' : 4,
                         'visa required / e-Visa' : 5,
                         'visa required' : 6, 
                         'barred' : 7}
category_lookup = dict(zip(requirement_order_map.values(), requirement_order_map.keys()))
_ = category_lookup.pop(-1)

This mapping is applied to the clean database:

path_to_visa_clean_db = r'path/to/visa_by_country_clean.json'
df_visa_clean = load_df_from_json(path_to_visa_clean_db)
df_visa_clean.replace(requirement_order_map, inplace = True)

Two lookup tables are generated to facilitate to match the country labels and names to the indices:

cnt_code_lookup = dict(enumerate(df_visa_clean.index)) # index --> country code

code_name_dict = load_dict_from_json(r'C:\Users\Balazs\Desktop\bhornung_movies\yemen2018\visa\ccodes.json')
cnt_name_lookup = {_k : code_name_dict[_v] for _k, _v in cnt_code_lookup.items()} # index --> country name

Finally, colourmaps are created to encode the entry categories.

cat_colors = ['#405da0', '#3189b2', '#45935d', '#a5ea85', '#e2f28c', '#edc968', '#bc6843', '#474543']
cat_colors_aug = ['#000000', '#405da0', '#3189b2', '#45935d', '#a5ea85', '#e2f28c', '#edc968', '#bc6843', '#474543']
cat_cmap = ListedColormap(cat_colors)
cat_cmap_aug = ListedColormap(cat_colors_aug)

Terminology

It is worth introducing the following terminology and notation.

The number of countries is denoted by $n_{g}$ (g refers to guest)
The entry requirements are categories hence the full set of them is denoted by

$C \text{, where } C =\bigcup\limits_{k=0} c_{k}$ where $c_{k}$ is an entry category such as visa free.
The table containing the entry requirements is called voting table and is denoted by

$V \in C^{n_{g} \times n_{g}} \iff V_{ij} \in C$, where the first index, $i$, in the guest country the second index, $j$, is the host country. $V$ is symmetric only if each pair of countries have reciprocal entry requirements.
The i-th row of $V$ is denoted by $v_{i,*}$ which represents the entry requirements for a citizen of country i. Likewise,
The i-th column of $V$ is denoted by $v_{*,i}$ which represents requirements demanded by country j

V = df_visa_clean.values
ng = V.shape[0]
nc = len(requirement_order_map) - 1 # ignore `self`

Analysis

General observations

The entropy, $H$ measures the information content or randomness of a distribution. The higher it is the less likely one can predict which value will be observed next from the distribution of interest. Entropy has a lower bound of zero, however, in general, it does not have a constant upper bound. The entropy of the distribution of categories equals $H(C) = \sum\limits_{i=1}^{|C|} - p(c_{i}) \log(p(c_{i})) \text{, where}$ $p$ is the probability density function of $C$.

Should all countries have random requirements, the entropy would be maximal for the entire table and for each country: $H_{max}(C) = \sum\limits_{i}^{|C|} - \frac{1}{|C|} \log\frac{1}{|C|} = - log(|C|).$

We can then define the following relative entropies, with upper bound of unit:

The relative overall entropy: $H_{t} = H(V) / H_{max}(C)$

relative guest entropy: $H_{g}(i) = H(v_{i,*}) / H_{max}(C)$

relative host entropy: $H_{h}(i)= H(v_{*,i}) / H_{max}(C)$ are defined to investigate the diversity of distribution.

max_entropy = np.log2(nc)
get_frequencies = lambda x: np.unique(x[x != -1], return_counts = True)[1]

# all countries
overall_entropy = scs.entropy(get_frequencies(V)) / max_entropy

# guest countries
guest_entropy = np.fromiter((scs.entropy(get_frequencies(g)) for g in V), dtype = np.float, count = ng) / max_entropy

# host countries
host_entropy = np.fromiter((scs.entropy(get_frequencies(h)) for h in V.T), dtype = np.float, count = ng) / max_entropy

print("Entropy of requirements for all countries: {0:5.3f}".format(overall_entropy))

Entropy of requirements for all countries: 0.431

The joint distribution of the host and guest entropies are shown below. (I prefer to use scatter plot along with kde-type plots as the latter ones can be misleading sometimes, especially violin plots.)

png

It is immediately seen that countries do not have random requirements, for the maximum entropy is around 0.4. On the contrary, each country is biased towards certain requirements. The guest entropy never hits its lower bound, indicating each country has at least two exit requirements. The countries whose host requirement is zero demand the same documents from all incoming citizens. There are around ten of them. However, the entropy does not state anything about the values of the random variate. A quick sorting (pun intended) tells which ones these are.

The statistics of the countries will be collected in the df_cnt_stats dataframe.

df_cnt_stats = pd.DataFrame({'code' : df_visa_clean.index})
df_cnt_stats['name'] = df_cnt_stats['code'].apply(lambda x: code_name_dict[x])
df_cnt_stats['guest_entropy'] = guest_entropy
df_cnt_stats[ 'host_entropy'] = host_entropy
df_cnt_stats['most_common_entry'] = [category_lookup[most_common(h)] for h in V.T]
df_cnt_stats['most_common_exit'] = [category_lookup[most_common(g)] for g in V]

# there is a weird bug in nsmallest and nlargest so we use sort_values instead
df_cnt_stats.sort_values(by=['host_entropy'])[['name', 'host_entropy', 'most_common_entry']].head(10) 

	name	host_entropy	most_common_entry
57	MICRONESIA	0.000000	visa-free
111	MADAGASCAR	0.000000	visa on arrival
2	AFGHANISTAN	0.000000	visa required
121	MALDIVES	0.000000	visa on arrival
92	NORTH KOREA	0.000000	visa required
90	COMOROS	0.000000	visa on arrival
163	SOMALIA	0.000000	visa on arrival
44	DJIBOUTI	0.010582	visa on arrival
46	DOMINICA	0.010582	visa-free
154	SEYCHELLES	0.010582	visa-free

Indeed, uniform requirement for a country does not mean identical requirements across countries as we can see visa free, visa required and other categories.

Distribution of requirements

Host and guest scores

A crude estimation of how welcoming a country can be expressed as the mean of the her entry requirements. This will be called the host score, $s_{w}$.

$s_{h}(i) = \frac{ \sum\limits_{j \neq i} v_{ji} } {n_{g} - 1}$ Likewise, the average of the exit requirements of a country will be called the guest score, $s_{g}$ and indicates how welcomed the citizens of that country are. $s_{g}(i) = \frac{ \sum\limits_{j \neq i} v_{ij} } {n_{g} - 1}$

sh = (V.sum(axis = 0) + 1) / (ng - 1)
sg = (V.sum(axis = 1) + 1) / (ng - 1)

The score distribution is shown below.

png

The histogram of the host and guest score plot hints at three major types of mutual relationships. Agglomerate of countries in the upper right corner indicates mutually restricted entry criteria. The bottom middle region of the graph groups countries that have mixed entry requirements but, in general, one can enter other countries reasonably easily. The top left cluster represents welcoming countries whose citizens face strict entry requirements.

Distribution of requirements

The distribution of exit requirements i.e. documents required from a country, $\mathbf{p}{g}(i) \in \mathbb{R}^{n{g} \times

}$, i.e. histogram of the columns reflects how permissive a country is.

\[p_{g}(i)_{j} = \frac{ \sum\limits_{ k\neq i} \delta(v_{ki}, c_{j}) } {n_{g} - 1 }\]

pg = np.zeros((ng, nc))
count_gen = (np.unique(v, return_counts = True) for v in V)

for row, (idcs, count_) in zip(pg, count_gen): # blessed be passing references
    row[idcs[:-1]] = count_[:-1]
    
pg = pg / pg.sum(axis = 1)[:, None]

The distributions for Azerbaijan, Central African Republic, Cameroon, Cape Verde, Bhutan, DR Congo, Belgium and The Netherlands are show below. It is readily observed the first six countries are easier to travel from as to how much paperwork is needed compared to the last two ones. A question then naturally arises whether are there groups of countries of similar requirement patterns?

png

As a start, we can now easily figure out which country has the most polar exit requirements. The term ‘polar’ means separated bimodal distribution:

there are two categories that are more likely than any other $\forall m \in C / {k, l} : p_{g}(i){k/l} > p{g}(i)_{m} $
their probabilities are close to each other: $p_{g}(i){k} \approx p{g}(i)_{l}$
their separation is larger than half of the cardinality of the categories: $ k - l \geq \frac{ C }{2}$

A limiting case is where two categories have 0.5 and 0.5 probabilities. Based on the previous plots these categories are likely to be visa free, visa on arrival and visa required.

mask = np.zeros((3, nc), dtype = np.float)
mask[0,0] = mask[0,6] = 0.5           # visa free -- visa required
mask[1,0] = mask[1,2] = 0.5           # visa free -- visa on arrival
mask[2,2] = mask[2,6] = 0.5           # visa on arrival -- visa required
overlap = np.linalg.norm(pg[:,None,:] - mask, axis = 2)
bimodals = np.array([np.argsort(x)[:3] for x in overlap.T])

png

Each row of the plot above shows the three most polar distribution of requirements for the via free - visa required, visa free - visa on arrival and visa on arrival - visa required pairings.

Comparison of requirements

The distributions of the countries are compared via the Euclidean distance matrix, $D \in \mathcal{R}^{n_{g}\times n_{g}}$. $d_{ij} = \| \mathbf{p}_{h}(i) - \mathbf{p}_{h}(j) \|^{2}_{2}$

dist_mat = scd.squareform(scd.pdist(pg))

png

If there are $n$ clusters of similar $P_{g}$-s, then the maximum number of peaks in the distance plot is $ \begin{equation} \begin{pmatrix} n
2 \end{pmatrix} + 1 \end{equation} $. There are about between 4 and 7 peaks in the plot above. This implies the number of clusters is 3 or 4. However, one cannot tell the which countries are in a certain cluster. In order to reveal this, we will group the $P_{g}(i)$ distributions for all countries. Firstly, we calculate the Ward linkage between the individual distributions. The number of clusters is then chosen. The following options are given:

Perform a number of consecutive clusterings then choose the one where the intra cluster variance is below a certain limit with the maximum number of clusters.
Calculate the linkage matrix, $Z$ and plot the inter cluster distance. The number of subgraphs to keep is deduced by the location from the largest gap between the successive inter cluster distances (see below).
Calculate the linkage matrix and calculate the intra cluster distance in a breadth first traversal. Stop the traversal when the distance drops below a certain limit. This would save us from the computational burden of performing consecutive clustering thus assembling the same linkage matrix over and over again.

Z = linkage(pg, 'ward')

png

In the plot above each point corresponds to a merge. The merge height is also an upper bound for the intra cluster variance of the joined clusters. We therefore use method #1 to determine the number of clusters. The largest gap is between the penultimate point and the one before it, thus we will have three clusters.

In the following we create a circular dendrogram of the clusters.

The dendrogram is split to three clusters:

clustaz = split_at_n_top(Z, 3)

The means and variance of the requirement distributions are then calculated:

means = np.array([np.mean(pg[_v], axis = 0) for _k, _v in clustaz.items()])
stds = np.array([np.std(pg[_v], axis = 0) for _k, _v in clustaz.items()])

The three groups correspond to the most welcomed ones, e.g. as those in the EU, welcomed ones, such as many Asian and African countries. The third groups consists of countries with strained international relationships (Armenia) or ones with serious internal conflicts (Syria).

png

The dendrogram shows these clusters and the base distributions from which the distances were calculated, in anticlockwise order.

png

Mutual requirements

We turn our attention to the two way relationships ebodied in the requirements.

Reciprocity

Firstly, we investigate the reciprocity of requriments. To this end we pair, the $i \rightarrow j$ and $j \rightarrow i$ requirements and make a histogram of the pairs. The resulting matrix is called reciprocity matrix, $R$:

\[\begin{eqnarray} R & \in & \mathbb{N}^{|C| \times |C|} \\ r_{ij} & = & \sum\limits_{k<l} \delta(v_{kl}, c_{i}) \cdot \delta(v_{lk}, c_{j}) \end{eqnarray}\]

tril = chain(*map(lambda x: zip(repeat(x-1, x), range(x)), range(ng, 1, -1)))         # generator for lower triangle indices
recip_histo = Counter(map(lambda x: (V[x[1], x[0]], V[x[0], x[1]]), tril))            # count tuples

recip_mat = np.zeros((nc, nc), dtype = np.float)                                      # convert to np.ndarray
for idcs, count_ in recip_histo.items():
    recip_mat[idcs] = count_ 
recip_mat = recip_mat / np.sum(recip_mat)

png

The matrix is not symmetric meaning that there are pairs of countries which demand different travel documents depending on the direction of travel. The most popular mutual requirements are

(0,0) is the number of cases where entry from i to j is visa free and entry from j to i is visa free too
(0,6) is the number of cases where entry from i to j is visa free and entry from j to i requires a visa
likevise (6,0) is the number of cases where entry from i to j requires a visa and entry from j to i is visa free
(2,6) and (6,2) are popular categories where visa on arrival is expected for a visa

The ratio of the identical requirements, r_req_iden is the trace of the matrix.

r_req_iden = np.trace(recip_mat) 
print("Ratio of mutually identical entry requirements: {0:.4f}".format(r_req_iden))

Ratio of mutually identical entry requirements: 0.4997

Comparing countries by reciprocity patterns

In oder to compare individual countries, their reciprocity matrices, $R^{(k)} \in \mathbb{R}^{n_{c} \times n_{c}}$, have to be calculated.

\[r^{(i)}_{kl} = \frac{ \sum\limits_{j \neq i} \delta(c_{k},v_{ij}) \cdot \delta(c_{l}, v_{ji})} {n_{g} - 1}\]

The reciprocity matrix, $R$ defined earlier is the sum of the individual reciprocity matrices:

\[R = \sum\limits_{i} R^{(i)}\]

The few lines below conveniently do this job:

cnt_recip_mat = np.zeros((ng, nc, nc), dtype = np.float)

for idx, (g, h) in enumerate(zip(V, V.T)):                         # make pairs
    count_dict = dict(Counter(zip(g, h)))                          # count pairs of requirements
    _ = count_dict.pop((-1, -1), None)                             # remove (-1, -1) if present
                                                        
    for (r, c), count_ in count_dict.items():                       # populate matrix
        cnt_recip_mat[idx, r, c] = count_

cnt_recip_mat = np.reshape(cnt_recip_mat, (ng, nc * nc))
normalize(cnt_recip_mat, norm = 'l1', axis = 1, copy = False);

A way to identify repeating patterns is to perform a PCA analysis. There are, however, two issues with PCA. Firstly, it only accounts for linear dependencies, secondly it only looks at linear transformation of the data. Nevertheless, it recovers some basic relationships.

pca = PCA();
pca.fit(cnt_recip_mat)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

We are going to the first few principal components that account for the 95% of the total variance.

evr = pca.explained_variance_ratio_
cevr = np.cumsum(evr)
keep_idcs = np.arange(64, dtype = np.int)[cevr < 0.95]
print(keep_idcs)

[0 1 2 3 4 5]

Only six components are retained out of sixty four. It is inetersting to see that most of the linear relationships involve the visa required restriction. For example, the first component (top left corner) indicates that a pair of countries either has (visa required, visa on arrival) xor (visa required, visa required) mutual requirements.

png

Clustering

It would be interesting to see whether there are clusters formed around certain reciprocity patterns. First of all, a dissimilarity measure is needed. Taking the $L_{2}$ overlap of the reciprocity distributions can be misleading. For example, consider the following three distributions below:

png

The $L_{p}$ distances between any two distributions are equal. However, we feel that (1) is closer to (2) than to (3) for visa free is closer to ETA than to visa required. In general, when comparing probability distributions over ordered categorical variables it is advantageous to use a distance that represets the underlying ordering. One such function is the earth mover’s distance (EMD) or Wasserstein metric. It is the minimum work required to move one distribution to match and other one, in a nutshell. Before we calculate EMDs for the countries two notes are in order:

The two dimensional nature of the reciprocity matrix is taken into account by generating a 2D ground distance matrix. For the details please see calc_ground_distance_2d in the appendix.
The hiearchical clustering matches pairs of distributions with gradually increasing $L_{2}$ distance. As a consequence, the above effect is mitigated to some extent, for the widely dissimilar vectors are unlikely to be paired.

The ground distance matrix is calculated:

def calc_ground_distance_2D(N):
    """
    Calculates the 2D Manhattan distance for a N*N grid. It returns the flattened 2D distance matrix
    d(ij,kl) = |i - j| + |k - l| (4D)
    Let: m = i * N + j and n = k * N + l
    Let D \in N^{n^2, n^2}
    Then D(n,m) = d(n//N, mod(n,N), m//N, mod(m,N)) = d(ij, kl)
    Parameters:
        N (int) : gridsize
    Returns:
        dmat (np.ndarray[:,:] dtype = np.int) : the flattened Manhattan distance matrix
    """
    
    i_arr = np.repeat(np.arange(N, dtype = np.int), N)
    j_arr = np.tile(np.arange(N, dtype = np.int), N)
    
    dmat = np.abs(np.subtract.outer(i_arr,i_arr)) + np.abs(np.subtract.outer(j_arr, j_arr))
    
    return dmat   

d_ground = calc_ground_distance_2D(8)

The matrix below is the ground distance between two flattened reciprocity matrix. The $(m,n)$-th element is the $L_{1}$ distance between

\[\big(i \leftarrow \lfloor \frac{m} {n_{c}} \rfloor, j \leftarrow \text{mod}(m, n_{c})\big) \text{ and } \big(k \leftarrow \lfloor \frac{n} {n_{c}} \rfloor, l \leftarrow \text{mod}(n, n_{c})\big)\]

png

We can now calculate the EMD, emd, for all pairs of countries using the Python Optimal Transport module. We also calculate the Euclidean distance, l2d, for comparison.

import ot
emd = scd.squareform(scd.pdist(cnt_recip_mat, lambda x, y: ot.emd2(x, y, d_ground)))
l2d = scd.squareform(scd.pdist(cnt_recip_mat))

It is readily seen from the joint plot below that the earth mover’s distance is usually larger than the ordinary $L_{2}$ distance. This is due to taking the distance between the ordered categories into account.

png

Two hierarchical clusterings are generated, one for EMD and one for $L_{2}$ distance.

Z_emd = linkage(emd, 'ward') 
Z_l2d = linkage(l2d, 'ward')

The EMD is an order of magnitude larger than the $L_{2}$ distance, due to the underlying ground distance. The distribution of merge heights also differ. The $L_{2}$ metric generates are more branched structure at at larger heights.

png

The labels in the top forty clusters are compared with respect to the induced Fowlkes-Mallows score. The closer its value to unit, the more labels can be found grouped in the same clusters in the two hierarchy.

label_pairs = ((linkage_to_labels(Z_emd, n), linkage_to_labels(Z_l2d, n)) for n in range(1,40))
scores = list(fowlkes_mallows_score(*lp) for lp in label_pairs)

The score is unit when all countries are in a single cluster as expected. It plummets just below 0.6 indicating the topmost merge partitions the countries quite differently. The next lower merges increase the value of the score, so that the countries within the top two clusters are grouped into similar clusters. This jaggedy behaviour is due to the EMD and $L_{2}$ distance being correlated. In case of two independent partitionings, the score is expected to monotonically decrease.

png

Back to the EMD clustering: where to cut the dendrogam? For there is no apparent gap (or elbow) in the succession of merge heights, the silhouette method will be invoked.

label_gen = (linkage_to_labels(Z_emd, n) for n in range(2,40))
score_gen = (silhouette_score(emd, label, metric = 'precomputed') for label in label_gen)
scores = np.fromiter(score_gen, dtype = np.float)

n_clusters = np.argmax(scores) + 2
print("Silhouette maximum at number of clusters: {0}".format(n_clusters))

Silhouette maximum at number of clusters: 5

The tree thus will be cut to yield five clusters. The means of the reciprocity matrices within each cluster are shown below. The first three roughly corresponds to the (visa required, visa on arrival), (visa required, visa requiredl), (visa free, visa free) relationships. The fourth group collates countries that are less welcoming than welcomed. The fifth represents a group of countries which either have mutual visa free arrangements or being demanded visa required.

clustaz_ = split_at_n_top(Z_emd, 5)
means_ = np.array([np.mean(cnt_recip_mat[_v], axis = 0) for _k, _v in clustaz_.items()])

png

The dendrogram of the five clusters is shown below in anticloskwise order.

png

Asymmetry

Next, we calculate the conditional distribution of requirements with respect to the exit requirements. This is easily achieved by normalising each row of the overall reciprocity matrix.

p_recip_mat = recip_mat / np.sum(recip_mat, axis = 0)

png

From these probability plots one can deduce that visa free and visa on arrival requirements are the mostly reciprocated ($p_{0,0}= 0.7, p_{7,7} = 0.55$). Countries also tend to mutually bar each other’s citizen’s ($p_{8,8} = 0.95$). At least there is something they agree on.

Quantifying reciprocity

To see much a country’s entry requirements reciprocated, the overlap between host $v_{i,}$ and guest $v_{,i}$ vectors is calculated.

\[A_{us}(i) = \frac{\sum\limits_{k} \delta(v_{ik},v_{ki})}{n_{g}}\]

A more sophisticated indicator is the count the sing of the differences, $A_{sg}$. Negative values are assigned to countries that are more welcoming than welcomed.

\[A_{sg} = \frac{\sum\limits_{k} \text{sign}(v_{ik} - v_{ki}) } {n_{g}}\]

A third measure of asymmetry is the signed sum of differences, $A_{ss}$. However, the assignment of the actual numerical values to the categories, apart from being ordered, is somewhat accidental.

\[A_{ss} = \frac{\sum\limits_{k} v_{ik} - v_{ki} } {n_{g} \cdot (\max(C) - \min(c)) }\]

# A_{us} -- unsigned overlap
a_unsigned = np.fromiter((np.sum(g == h) for g, h in zip(V, V.T)), count = ng, dtype = np.float) / ng
# A_{sg} -- signed difference
a_signed = np.fromiter((np.sum(np.sign(h - g)) for h, g in zip(V, V.T)), count = ng, dtype = np.float) / ng
# A_{ss} -- sum of signed difference
a_signed_sum = np.fromiter((np.sum(h - g) for h, g in zip(V, V.T)), count = ng, dtype = np.float) / ng / 7

df_cnt_stats['unsigned'] = a_unsigned 
df_cnt_stats['signed'] = a_signed
df_cnt_stats['signed sum '] = a_signed_sum 

The three most welcomed country:

df_cnt_stats.sort_values(by=['signed'])[['name', 'signed', 'unsigned']].head(3)

	name	signed	unsigned
186	UNITED STATES OF AMERICA	-0.814070	0.185930
28	CANADA	-0.768844	0.201005
9	AUSTRALIA	-0.572864	0.035176

The three least welcomed countries:

df_cnt_stats.sort_values(by=['signed'])[['name', 'signed', 'unsigned']].tail(3)

	name	signed	unsigned
163	SOMALIA	0.788945	0.110553
33	COTE D'IVOIRE	0.824121	0.115578
73	HAITI	0.869347	0.120603

The sorted scores are shown below. The flatter regions indicate that numerous countries have similar asymmetric relations. However, looking at just $A_{us}$ and $A_{sg}$ does tell us whether the asymmetries of different signs compensate each other or a country has symmetric relationships.

png

A joint plot of $A_{us}$ and $A_{sg}$ can resolve this issue. The more asymmetric a country is the more likely it is more less welcomed then welcoming. These countries constitute top left cluster. If a country has symmetric relationships, ($A_{us}$) close to zero, it is likely to be less welcoming than welcomed according to the grouping in the bottom right corner.

png

Blocks

It is interesting to see whether there are groups of countries which have stronger ties among themselves than with others. An extreme example is a group where all entry requirements are visa free in both directions.

To this end consider the directed graph where the vertices are the countries; each of them connected by two directed edges corresponding to the entry requirements. We then color the edges according to their types e.g. Holy See and Italy are connected by two antiparallel blue edges. Finding a group of identical entry requirements is then equivalent to finding the n-cliques containing only edges of the same colour.

This can be done in the following way:

\[\begin{eqnarray} & & \textbf{Algorithm:} \textit{Find maximum disjoint coloured cliques} \\ &. & \quad \texttt{FMDCC}(G, c) \\ &. & \quad G' \leftarrow \left(V, G(E) \setminus \{v_{ij} \in G(E): \,\, \texttt{Colour}(v_{ij}) \neq \texttt{Colour}(v_{ji}) \} \right) \\ &. & \quad G' \leftarrow \left(V, G(E) \setminus \{v_{ij} \in G(E): \,\, \texttt{Colour}(v_{ij}) \neq c\} \right) \\ &. & \quad \mathbf{Blocks} \leftarrow () \\ &. & \quad \mathbb{S} \leftarrow \texttt{FindDisjointSubgraphs}(G') \\ &. & \quad \textbf{for} \,S \in \mathbb{S} \textbf{ do} \\ &. & \quad \quad \textbf{while} \, |S| > 0 \, \textbf{do} \\ &. & \quad \quad \quad C \leftarrow \texttt{FindMaximumClique}(S) \\ &. & \quad \quad \quad \mathbf{Blocks}.append( C) \\ &. & \quad \quad \quad S \leftarrow S \setminus C \\ &. & \quad \quad \textbf{end while} \\ &. & \quad \textbf{end do} \\ \end{eqnarray}\]

(To generate the plots below a fair of amount modestly interesting code is required. All of the functions can be found in the raw jupyter notebook.)

The cliques of visa free countries, cliques_vf are retrieved as:

cliques_vf = find_colour_max_cliques(V,0)

The blocks are then printed on a circular layout where each node correspond to a country, and each edge represents a mutual visa free requirement. Largest one consists of the Schengen/EEA/European microstates and numerous South American countries. Several Central American (plus a few African) states constitute the second largest group. The ECOWAS member states made up the third block. In general, countries of geopolitical proximity tend to form cliques. However, all connections in a clique should be visa free therefore even one asymmetric relationship can remove an otherwise perfectly connected country. A fine example is Russia, which requires visa from some of the Central Asian republics, therefore it is cannot be in the mainly post Soviet fourth block.

png

The original matrix, $V$ (1) shows no apparent structure. When countries are reordered, the blocks appear along the diagonal. Off-diagonal blocks can also be spotted where the requirements are dominated by a certain category. This is amplified is the mode of each block is plotted (c). The largest visa free clique comprises European and South American countries. Again, it is easier to travel from than to travel to these countries as the leftmost and topmost bands suggests.

png

Requirement similarity

Finally, we compare countries by their country specific requirements. Again, we start with defining a distance. The most straightforward choice is the simple matching coefficient, $d_{smc}$ excluding the host and guest countries:

\[d_{smc}(ij) = \sum\limits_{k\notin \{i,j\}} \delta ( v_{ik}, v_{jk}) \, .\]

This, however, is not a proper metric and does not take into account the ordering of the categories. The second choice can be the $L_{1}$ distance, again excluding the host and gues countries:

\[d_{L1} = \sum\limits_{k\notin \{i,j\}} | v_{ik} - v_{jk} | \, .\]

The drawback of this one, that the numerical values of the ordinal categories are arbitrary. One can use the Jaccard distance over the sets of tuples:

\[d_{Jac} = \frac{S_{i} \cap S_{j} } { S_{i} \cup S_{j} }\,\, \text{where } S_{k} = \{(k, v_{kl}) \},\,\, k\neq l\]

We will use the $L_{1}$ distance, masked_l1, for it reflects the ordering of the categories which is then fed into a hierarchical Ward clustering.

def masked_l1(v1, v2, mask_value = -1):  
    """
    Calculates the L1 distance between two arrays where all elements of specified value are masked.
    Parameters:
        v1, v2 (np.ndarray) : arrays to compare
        mask_value ({np.int, np.float}) : values to ignore
    Returns:
        l1 (np.float) : L1 distance
    """
    mask = (v1 != -1) & (v2 != -1)
    l1 = np.sum(np.abs(v1[mask] - v2[mask]))
    
    return l1

d_l1m = scd.squareform(scd.pdist(V, lambda x, y: masked_l1(x, y)))
Z_l1m = linkage(d_l1m, 'ward')

The histogram of the distances suggest there are at least three clusters (3*2 / 2).

png

The silhouette index assumes it maximum at the topmost merge. However, we choose to add one more split which justified by the density plot above and the large gap between the second and third merges (between 3 and 4 clusters).

label_gen = (linkage_to_labels(Z_l1m, n) for n in range(2,20))
score_gen = (silhouette_score(d_l1m, label, metric = 'precomputed') for label in label_gen)
scores = np.fromiter(score_gen, dtype = np.float)

png

Again a geopolitical segmentation is readily observed most strikingly in the cases of post Soviet and African states.

png