Biomes 2: Climatic analysis
Introdution
I finally got around to following up on the Southern African biomes. I have accumulated quite a few maps of the region but I have since realised that most of the descriptions are based on vegetation structure. This is also why, for example, some of these maps recognise a very large number of units (i.e. not at the biome or bioregional scale). I considered whether it may be possible to use these as base units for larger units but this would seem to put the cart before the horse as one would in essence be assigning biomes that one already has in mind.
As such I thought it may be possible to come from the other direction and see what kind of latent structures are present in the climatic description of Southern Africa. There are obvious bioclimatic factors such as temperature and rainfall, which are distributed across the region and it may be possible to find areas which can be grouped together on this basis.
Study area
The area was bounded at -14S, as this was the area I was interested in for a different project but this does correspond well with most concepts of Southern Africa, which often terminate at the Cunene and the Zambezi. Our area extends a bit further north and so we can consider this a useful buffer zone - it makes sense not to have only a few points on the borders. The figure below shows some of the key parameters over this area. I added labels for some locations which are useful to understand the extents of the groups.
In all the "maps" I will present I have just used seaborn's scatterplot with the x and y coords, along with the relevant parameter as the hue/colour. It's worth noting that this is not a sophisticated "projection" but it works well enough at this scale and avoids needing to treat this as a GIS problem.
Data
Data were mostly derived from WorldClim's (WC) global dataset of climatic and bioclimatic variables but a few others were added to capture some particular conditions which I considered might be important descriminators. The data, generally at a 30s resolution natively, were sampled at 5 min resolution resulting in around 56k points. The full list of parameters and sources:
Temperature
- Annual Mean Temperature (WC)
- Mean Diurnal Range (WC)
- Isothermality (WC)
- Temperature Seasonality (WC)
- Max Temperature of Warmest Month (WC)
- Min Temperature of Coldest Month (WC)
- Temperature Annual Range (WC)
- Mean Temperature of Wettest Quarter (WC)
- Mean Temperature of Driest Quarter (WC)
- Mean Temperature of Warmest Quarter (WC)
- Mean Temperature of Coldest Quarter (WC)
Precipitation
- Annual Precipitation (WC)
- Precipitation of Wettest Month (WC)
- Precipitation of Driest Month (WC)
- Precipitation Seasonality (WC)
- Precipitation of Wettest Quarter (WC)
- Precipitation of Driest Quarter (WC)
- Precipitation of Warmest Quarter (WC)
- Precipitation of Coldest Quarter (WC)
- Winter Precipitation Concentration (calculated based on WC)
- Spring Precipitation Concentration (calculated based on WC)
- Summer Precipitation Concentration (calculated based on WC)
- Autumn Precipitation Concentration (calculated based on WC)
Other
- Elevation (WC) [It was not 100% clear to me that this is an appropriate parameter. Elevation affects temperature but is only one of the factors in that regard. This was not used in the analysis going forward]
- Evapotranspiration (Zomer et al. 2022a)
- Aridity Index (Zomer et al. 2022b)
- Soil Moisture Stress Annual Average (calcuated based on Trabucco et al. 2019)
- Soil Moisture Stress Winter Average (calcuated based on Trabucco et al. 2019)
- Soil Moisture Stress Summer Average (calcuated based on Trabucco et al. 2019)
Analysis
All analysis was done using Scikit-learn, with other utilities such as pandas, matplotlib and seaborn as mentioned.
Processing
Input data were scaled using RobustScaler. Initially StandardScaler was used but some parameters are not well distributed, particularly in the case of parameters relating to rainfall (i.e. many are lumped close to zero with a long tail. RobustScaler is intended to account for these kind of outliers better than StandardScaler.
PCA was used to reduce the parameter space. A threshold of 90% variance accounting was used, corresponding to 5 components (i.e. a big reduction from ~30).
Clustering
As I am approaching this as an unsupervised problem, I wanted to look also at how some clustering algorithms may differ in their outputs. I also wanted to make sure I could specify the number of clusters as a model parameter, to see how varying this would compare. The following clustering algorithms were used: Kmeans, Bisecting Kmeans, Gaussian Mixtures and Agglomerative Clustering (with Ward linkage).
The main metaparameter for all of these is the number of clusters but for Agglomerative Clustering it additionally requires a nearest neighbours graph to run efficiently - the number of neighbours is then another parameter. In this case it was set to 50 but it did not have a major effect. Too large would result in insufficient memory (56k points...) but IIRC between 10 and 100 all returned similar results.
In each case the number of clusters was varied within [9, 12, 15, 18, 20, 30, 40, 50]. 9 is the number of biomes defined by Mucina and Rutherford for South Africa and I wanted this to be the minimum. 50 ends up with clusters closer to a bioregional scale. In the interest of brevity I will focus on the Kmeans results between 9 and 18 but in some cases I look among the methods.
Kmeans
I thought it might be interesting to look through at least one of the maps in detail. As an aside, calling these biomes is poetic licence as they are actually just clusters in the parameter space. The 9 biome case identifies the following of these "biomes":
- 0- A coastal strip, I suppose a mega Indian Ocean Coastal belt but larger in spatial scale and extending further south.
- 1-The Zimbabwean, Zambian and Angolan Miombo.
- 2-A Kalahari band stretching from Windhoek to around and sputtering beyond Serowe.
- 3-Mesic grasslands of Lesotho, KZN etc. with outliers at the border of Zimbabwe and Mozambique (Manica highlands - these islands are stable in many model runs)
- 4-A Cape area comprising the better part of the Fynbos and Succulent Karoo
- 5-An area which corresponds to the Mopane-dominated areas in SA and Zimbabwe (both N and S) but I am not sure about what happens in between. Mopane does occur in Northern Namibia again. Allusions to vegetation linkages may be hallucinatory as such concepts play no role in this assignment, but it is interesting to note such overlaps.
- 6-Namib desert
- 7-Lowveld (with Moz extensions)
- 8-Nama Karoo. Note the extension into dry grassland around Bloemfontein. This is also a stable assignment and interestingly touches on an old phytogeographic question around the borders of the Nama Karoo and the grassland. Rutherford and Westfall make a point to identify this as "Grassland invaded by Karoo" but I note that this may now be considered an ecotonal area - from Mucina 2023 (Biomes of the Southern Hemisphere): "Mucina et al. (2023) [Biomes of Southern Africa, in press] disintegrated the heterogenous Grassland Biome and assigned the bioregion Drakensberg Grassland to the zonobiome A2 (Subtropical Alpine Biome), while the remaining bioregions were recognised as ecotonal. Mesic Highveld Grassland and Sub- Escarpment Grassland were to form the zonoecotone E2–T3, while the Dry Highveld Grassland was recognised as a zonoecotone E2–S2, hence a transitional biome straddling the zonobiomes E2 (arid savanna) and S2 (semidesert). " (my emphasis)
Very interesting here is the apparent distinctness of a number of units otherwise classified as Savanna - units 1, 2, 5 and 7. At the same time a number of biomes easily differentiated on vegetation are not segregated on these climatic parameters. Fynbos and Succulent Karoo are grouped whilst the Albany Thicket is somewhere in the border of a number of other zones. Adding more clusters to try and prise these out does not seem to "bring out" the Thicket but does tend to separate the core Fynbos from the Succulent Karoo. The Indian Ocean belt is also reliably split into northern and southern parts.
Contingency
It is possible to compare the overlap between the different methods using a contingency matrix:
A darker colour at an intersection corresponds to a high degree of similar assignment but the direction of the comparison does matter. For example, Kmeans_3 only really overlaps with Bikmeans_3 but Bikmeans_3 is clearly a larger unit as it also overlaps with Kmeans_8. Referring to the figure above this implies that Bikmeans _3 must be a combined Nama Karoo and Grassland-type area, which it is. There is generally a decent correspondence between these units, with most only having one or two high ranking intersections and not too much "spreading" of the overlap. I think (and with the benefit of being able to look at similar maps as above for the other methods) that there is general agreement among these on what are similar units but that the scales are not always grouped the same. I could not think of a very eloquent way to generalise the agreement between the models but I did try:
Similarity
A very rough attempt was made to quantify the degree of overlap on a per-point basis. The approach essentially comes down to counting the number of identical assignments between the 4 models and scaling this as a rough "degree of similarity". There may be more effective set-theory type methods but most of what I found related to the similarity of the clusters and not about metrics which apply to the samples.
Here, the darker/bluer colours correspond to higher agreement and could be thought of as "core" climatic regions. The lighter/redder areas are the areas with poor aggreement and seem to visualise the tension zones between these "core" biomes. Some areas are clearly stable:
- A northern Miombo area
- A northern Kalahari area
- A southern Kalahari area, which bleeds into what is normally considered Nama Karoo
- The Namib desert
- A core Nama Karoo area between Beaufort West and Colesberg (roughly)
- The SA/Moz lowveld
- The eastern grasslands south of JHB/PRE
For interest's sake I have also plotted the distributions of the clusters in the original parameters (although still scaled). This gives a rough idea of what parameters are contributing to the distinctness of each biome/groups of biomes. Again, all these graphs are also produced for the other numbers of clusters and the other methods.
More to be done
This is just a quick report although it did get a bit long. I think I will need to keep digging but this has answered some questions whilst raising a few more. Particularly tricky is the classification of borders. Perhaps the existence of "core" areas can be used to derive classification methods for the other points which can then be scrutinised (e.g. CART).
Cheers,
DW