Diapsalmata

Biomes 2: Climatic analysis

Introdution

I finally got around to following up on the Southern African biomes. I have accumulated quite a few maps of the region but I have since realised that most of the descriptions are based on vegetation structure. This is also why, for example, some of these maps recognise a very large number of units (i.e. not at the biome or bioregional scale). I considered whether it may be possible to use these as base units for larger units but this would seem to put the cart before the horse as one would in essence be assigning biomes that one already has in mind.

As such I thought it may be possible to come from the other direction and see what kind of latent structures are present in the climatic description of Southern Africa. There are obvious bioclimatic factors such as temperature and rainfall, which are distributed across the region and it may be possible to find areas which can be grouped together on this basis.

Study area

The area was bounded at -14S, as this was the area I was interested in for a different project but this does correspond well with most concepts of Southern Africa, which often terminate at the Cunene and the Zambezi. Our area extends a bit further north and so we can consider this a useful buffer zone - it makes sense not to have only a few points on the borders. The figure below shows some of the key parameters over this area. I added labels for some locations which are useful to understand the extents of the groups.

Map 1 - Study area

In all the "maps" I will present I have just used seaborn's scatterplot with the x and y coords, along with the relevant parameter as the hue/colour. It's worth noting that this is not a sophisticated "projection" but it works well enough at this scale and avoids needing to treat this as a GIS problem.

Data

Data were mostly derived from WorldClim's (WC) global dataset of climatic and bioclimatic variables but a few others were added to capture some particular conditions which I considered might be important descriminators. The data, generally at a 30s resolution natively, were sampled at 5 min resolution resulting in around 56k points. The full list of parameters and sources:

Temperature

Precipitation

Other

Analysis

All analysis was done using Scikit-learn, with other utilities such as pandas, matplotlib and seaborn as mentioned.

Processing

Input data were scaled using RobustScaler. Initially StandardScaler was used but some parameters are not well distributed, particularly in the case of parameters relating to rainfall (i.e. many are lumped close to zero with a long tail. RobustScaler is intended to account for these kind of outliers better than StandardScaler.

PCA was used to reduce the parameter space. A threshold of 90% variance accounting was used, corresponding to 5 components (i.e. a big reduction from ~30).

Clustering

As I am approaching this as an unsupervised problem, I wanted to look also at how some clustering algorithms may differ in their outputs. I also wanted to make sure I could specify the number of clusters as a model parameter, to see how varying this would compare. The following clustering algorithms were used: Kmeans, Bisecting Kmeans, Gaussian Mixtures and Agglomerative Clustering (with Ward linkage).

The main metaparameter for all of these is the number of clusters but for Agglomerative Clustering it additionally requires a nearest neighbours graph to run efficiently - the number of neighbours is then another parameter. In this case it was set to 50 but it did not have a major effect. Too large would result in insufficient memory (56k points...) but IIRC between 10 and 100 all returned similar results.

In each case the number of clusters was varied within [9, 12, 15, 18, 20, 30, 40, 50]. 9 is the number of biomes defined by Mucina and Rutherford for South Africa and I wanted this to be the minimum. 50 ends up with clusters closer to a bioregional scale. In the interest of brevity I will focus on the Kmeans results between 9 and 18 but in some cases I look among the methods.

Kmeans

Map 2 - Kmeans 9

I thought it might be interesting to look through at least one of the maps in detail. As an aside, calling these biomes is poetic licence as they are actually just clusters in the parameter space. The 9 biome case identifies the following of these "biomes":

Very interesting here is the apparent distinctness of a number of units otherwise classified as Savanna - units 1, 2, 5 and 7. At the same time a number of biomes easily differentiated on vegetation are not segregated on these climatic parameters. Fynbos and Succulent Karoo are grouped whilst the Albany Thicket is somewhere in the border of a number of other zones. Adding more clusters to try and prise these out does not seem to "bring out" the Thicket but does tend to separate the core Fynbos from the Succulent Karoo. The Indian Ocean belt is also reliably split into northern and southern parts.

Contingency

It is possible to compare the overlap between the different methods using a contingency matrix:

Figure 3 - Contingency 9

A darker colour at an intersection corresponds to a high degree of similar assignment but the direction of the comparison does matter. For example, Kmeans_3 only really overlaps with Bikmeans_3 but Bikmeans_3 is clearly a larger unit as it also overlaps with Kmeans_8. Referring to the figure above this implies that Bikmeans _3 must be a combined Nama Karoo and Grassland-type area, which it is. There is generally a decent correspondence between these units, with most only having one or two high ranking intersections and not too much "spreading" of the overlap. I think (and with the benefit of being able to look at similar maps as above for the other methods) that there is general agreement among these on what are similar units but that the scales are not always grouped the same. I could not think of a very eloquent way to generalise the agreement between the models but I did try:

Similarity

A very rough attempt was made to quantify the degree of overlap on a per-point basis. The approach essentially comes down to counting the number of identical assignments between the 4 models and scaling this as a rough "degree of similarity". There may be more effective set-theory type methods but most of what I found related to the similarity of the clusters and not about metrics which apply to the samples.

Map 4 - Similarity index

Here, the darker/bluer colours correspond to higher agreement and could be thought of as "core" climatic regions. The lighter/redder areas are the areas with poor aggreement and seem to visualise the tension zones between these "core" biomes. Some areas are clearly stable:

For interest's sake I have also plotted the distributions of the clusters in the original parameters (although still scaled). This gives a rough idea of what parameters are contributing to the distinctness of each biome/groups of biomes. Again, all these graphs are also produced for the other numbers of clusters and the other methods.

Figure 5 - Boxplots

More to be done

This is just a quick report although it did get a bit long. I think I will need to keep digging but this has answered some questions whilst raising a few more. Particularly tricky is the classification of borders. Perhaps the existence of "core" areas can be used to derive classification methods for the other points which can then be scrutinised (e.g. CART).

Cheers,

DW