Overview
Spatial thinning is widely used across many disciplines to reduce the density of spatial point data by selectively removing points based on specified criteria, such as threshold spacing or grid cells. This technique is applied in fields like remote sensing, where it improves data visualization and processing, as well as in geostatistics, and spatial ecological modeling. Among these, species distribution modeling is a key area where thinning plays a crucial role in mitigating the effects of spatial sampling bias.
When working with large-scale datasets containing hundreds of thousands of points, applying thinning manually is not possible. While existing tools in R provide thinning methods for SDM, they often lack scalability and computational efficiency. GeoThinneR provides an efficient, scalable, and user-friendly approach to spatial thinning, implementing optimized algorithms based on kd-trees to accelerate neighbor searches. This vignette demonstrates how to get started with GeoThinneR and apply different spatial thinning methods.
Setup and loading data
To get started with GeoThinneR, we will also load the following packages:
- terra: for working with raster and spatial data.
- sf: for handling spatial polygons.
- ggplot2: for visualizing the thinning process.
We’ll use two dataset to demonstrate the use of GeoThinneR: (i) A
simulated a dataset with n = 2000
random
occurrence points randomly distributed within a 40 x 20 degree area for
two species, and (ii) a real-world dataset containing a
subset of loggerhead sea turtle (Caretta caretta) occurrences
in the Mediterranean Sea.
# Set seed for reproducibility
set.seed(123)
# Simulated dataset
n <- 2000
sim_data <- data.frame(
lon = runif(n, min = -10, max = 10),
lat = runif(n, min = -5, max = 5),
sp = sample(c("sp1", "sp2"), n, replace = TRUE)
)
# Load real-world occurrence dataset
data("caretta")
# Load Mediterranean Sea polygon
medit <- system.file("extdata", "mediterranean_sea.gpkg", package = "GeoThinneR")
medit <- sf::st_read(medit)
Quick start guide
Now that our data is loaded, let’s use GeoThinneR for spatial
thinning. You can use GeoThinneR via the wrapper
function thin_points()
, which requires, at a
minimum, a dataset containing georeferenced points in
matrix
, data.frame
, or tibble
format. The query dataset is provided through the data
argument, and with lon_col
and lat_col
parameters, we can define the column names containing longitude and
latitude (or x, y) coordinates. If you don’t specify any columns, it
will take the first two columns by default.
The thinning method is defined using the
method
parameter, with the following options:
-
"distance"
: Distance-based thinning
-
"grid"
: Grid-based thinning
-
"precision"
: Precision-based thinning
The thinning constraint (distance, grid resolucion or decimal precision) depends on the method chosen (these will be detailed in the next section). Regardless of the method, the algorithm selects which points to retain** and remove. Since this selection is non-deterministic, you can run multiple randomized thinning trials to maximize the number of retained points.
In this example, we run 5 iterations (trials
) and return
the results of all trials (all_trials = TRUE
). In the
following code snippet, we apply distance-based thinning using a
distance threshold of 20 km (thin_dist
).
# Apply spatial thinning to the simulated data
thin_sim_data <- thin_points(
data = sim_data, # Dataframe with coordinates
lon_col = "lon", # Longitude column name
lat_col = "lat", # Latitude column name
method = "distance", # Method for thinning
thin_dist = 20, # Thinning distance in km,
trials = 5, # Number of trials
all_trials = TRUE, # Return all trials
seed = 123 # Seed for reproducibility
)
The output is a list of thinned datasets. If we check the number of retained points in each trial, we can see that out of every 5 trials, there is one that returns one point less. This is due to the randomness of the algorithm when selecting which points to retain.
sapply(thin_sim_data, nrow)
#> [1] 1387 1385 1388 1391 1387
This returns a list of thinned datasets. From 2000 simulated points,
we end up with 1795 points separated by at least 20 km. We can see that
out of every 5 trials, there is one that returns one point less. This is
due to the randomness of the algorithm when selecting which points to
retain. You can use seed
for reproducibility. By setting,
all_trials = FALSE
it will just return the trial the kept
the most points.
# Number of kept points in each trial
sapply(thin_sim_data, nrow)
#> [1] 1387 1385 1388 1391 1387
You can access the thinned datasets by indexing them in the list and you can compare them with the original data using the rowname values.
head(thin_sim_data[[1]]) # # Accessing the first trial's thinned dataset
#> lon lat sp
#> 1 -4.2484496 -3.4032602 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 4 7.6603481 0.1443426 sp1
#> 6 -9.0888700 1.1634277 sp1
#> 7 0.5621098 -0.5257711 sp1
#> 8 7.8483809 -4.4432328 sp1
head(sim_data[rownames(thin_sim_data[[1]]), ]) # Unthinned points from original data
#> lon lat sp
#> 1 -4.2484496 -3.4032602 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 4 7.6603481 0.1443426 sp1
#> 6 -9.0888700 1.1634277 sp1
#> 7 0.5621098 -0.5257711 sp1
#> 8 7.8483809 -4.4432328 sp1
We now apply a 30 km thinning distance to the real-world dataset,
returning only the best trial (the one with the maximum number of points
retained, all_trials
set to FALSE
).
# Apply spatial thinning to the real data
thin_real_data <- thin_points(
data = caretta, # We will not specify lon_col, lat_col as they are in position 1 and 2
lat_col = "decimalLatitude",
lon_col = "decimalLongitude",
method = "distance",
thin_dist = 30, # Thinning distance in km,
trials = 5,
all_trials = FALSE,
seed = 123
)
# Thinned dataframe stored in the first element of the output list
dim(thin_real_data[[1]])
#> [1] 499 5
As you can see, with GeoThinneR, it’s very easy to apply spatial thinning to your dataset. In the next section we explain the different thinning methods and how to use them, and also the implemented optimizations for large-scale datasets.
Thinning methods
In this section, we will show how to use each thinning method, highlighting its unique parameters and features.
Distance-based thinning
Distance-based thinning (method="distance"
) ensures that
all retained points are separated by at least a user-defined minimum
distance, thin_dist
(in km). For geographic coordinates,
using distance = "haversine"
(default) computes
great-circle distances, or alternatively, the Euclidean distance between
3D Cartesian coordinates (depending on the method) to identify
neighboring points within the distance threshold. The algorithm iterates
through the dataset and removes points that are too close to each other.
The trials
parameter allows multiple iterations of the
thinning process.
In general, the choice of search method depends on the size of the dataset and the desired thinning distance:
-
search_type="brute"
: Useful for small datasets, low scalability. -
search_type="kd_tree"
: More memory-efficient, better scalability for medium-sized datasets. -
search_type="local_kd_tree"
: Faster than"kd_tree"
and reduces memory usage. Can be parallelized to improve speed. -
search_type="k_estimation"
: Very fast when the number of points to remove is low or not very clustered, if not similar or worse than"local_kd_tree"
.
You can find the benchmarking results for each method in the package manuscript (see references).
Brute force
This is the most common greedy method for calculating the distance
between points (search_type="brute"
), as it directly
computes all pairwise distances and retains points that are far enough
apart based on the thinning distance. By default, it computes the
Haversine distance using the rdist.earth()
function from
the fields package. If
distance = "euclidean"
, it computes the Euclidean distance
instead. The main advantage of this method is that it computes the full
distance matrix between your points. However, the main disadvantage is
that it is very time- and memory-consuming, as it requires computing all
pairwise comparisons.
thin_sim_data <- thin_points(
data = sim_data,
method = "distance",
search_type = "brute",
thin_dist = 20,
trials = 50,
all_trials = FALSE,
seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391
Traditional kd-tree
A kd-tree is a binary tree that partitions space into
regions to optimize nearest neighbor searches. The
search_type="kd_tree"
method uses the
nabor R package to implement kd-trees via the
libnabo library. This method is more memory-efficient
than brute force. Since this implementation of kd-trees is
based on Euclidean distance, when working with geographic coordinates
(distance = "haversine"
), GeoThinneR
transforms the coordinates into XYZ Cartesian coordinates. Although this
method scales better than the brute-force method for large datasets,
since we want to identify all neighboring points of each query point,
its performance can still be improved for very large datasets.
thin_sim_data <- thin_points(
data = sim_data,
method = "distance",
search_type = "kd_tree",
thin_dist = 20,
trials = 50,
all_trials = FALSE,
seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391
Local kd-trees
To reduce the complexity of searching within a single huge
kd-tree containing all points from the dataset, we propose a
simple modification: creating small local kd-trees by dividing
the spatial domain into multiple smaller regions, each with its own tree
(search_type="local_kd_tree"
). This notably reduces memory
usage, allowing this thinning method to be used for very large datasets
on personal computers. Additionally, you can run in parallel this method
to improve speed using the n_cores
parameter.
thin_sim_data <- thin_points(
data = sim_data,
method = "distance",
search_type = "local_kd_tree",
thin_dist = 20,
trials = 50,
all_trials = FALSE,
seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391
Restricted neighbor search
Another way to reduce the computational complexity of
kd-tree neighbor searches is by limiting the number of
neighbors to find. In the nabor::knn()
function, we can
specify the number of neighbors to search for (k
). For
spatial thinning, we need to identify all neighbors for each point,
which increases computational cost by setting k
to the
total number of points (n
). However, with the
search_type="k_estimation"
method, we can estimate the
maximum number of neighbors a point can have based on the spatial
density of points, thereby reducing the k
parameter. This
method is particularly useful when datasets are not highly clustered, as
it estimates a smaller k
value, significantly reducing
computational complexity.
thin_sim_data <- thin_points(
data = sim_data,
method = "distance",
search_type = "local_kd_tree",
thin_dist = 20,
trials = 50,
all_trials = FALSE,
seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391
Grid-based thinning
Grid sampling is a standard method where the area is divided into a
grid, and n
points are sampled from each grid cell. This
method is very fast and memory-efficient. The only drawback is that you
cannot ensure a minimum distance between points. There are two main ways
to apply grid sampling in GeoThinneR: (i) Define the
characteristics of the grid, or (ii) pass your own grid as a raster
(terra::SpatRaster
).
For the first method, you can use the thin_dist
parameter to define the grid cell size (the distance in km will be
approximated to degrees to define the grid cell size, one degree roughly
111.32 km), or you can pass the resolution of the grid (e.g.,
resolution = 0.25
for 0.25x0.25-degree cells). If you want
to align the grid with external data or covariate layers, you can pass
the origin
argument as a tuple of two values (e.g.,
c(0, 0)
). Similarly, you can specify the coordinate
reference system (CRS) of your grid (crs
).
system.time(
thin_sim_data <- thin_points(
data = sim_data,
method = "grid",
resolution = 1,
origin = c(0, 0),
crs = "epsg:4326",
n = 1, # one point per grid cell
trials = 50,
all_trials = FALSE,
seed = 123
))
#> user system elapsed
#> 0.015 0.001 0.014
nrow(thin_sim_data[[1]])
#> [1] 200
Alternatively, you can pass a terra::SpatRaster
object,
and that grid will be used for the thinning process.
rast_obj <- terra::rast(xmin = -10, xmax = 10, ymin = -5, ymax = 5, res = 1)
system.time(
thin_sim_data <- thin_points(
data = sim_data,
method = "grid",
raster_obj = rast_obj,
trials = 50,
n = 1,
all_trials = FALSE,
seed = 123
))
#> user system elapsed
#> 0.005 0.000 0.002
nrow(thin_sim_data[[1]])
#> [1] 200
Precision thinning
In this approach, coordinates are rounded to a certain precision to
remove points that fall too close together. After removing points based
on coordinate precision, the coordinate values are restored to their
original locations. This is the simplest method and is more of a
filtering when working with data from different sources with varying
coordinate precisions. To use it, you need to define the
precision
parameter, indicating the number of decimals to
which the coordinates should be rounded.
system.time(
thin_sim_data <- thin_points(
data = sim_data,
method = "precision",
precision = 0,
trials = 50,
all_trials = FALSE,
seed = 123
))
#> user system elapsed
#> 0.003 0.000 0.003
nrow(thin_sim_data[[1]])
#> [1] 230
These are the methods implemented in GeoThinneR. Depending on your specific dataset and research needs, one method may be more suitable than others.
Additional customization features
Spatial thinning by group
In some cases, your dataset may include different groups, such as
species, time periods, areas, or conditions, that you want to thin
independently. The group_col
parameter allows you to
specify the column containing the grouping factor, and the thinning will
be performed separately for each group. For example, in the simulated
data where we have two species, we can use this parameter to thin each
species independently:
thin_sim_data <- thin_points(
data = sim_data,
thin_dist = 100,
seed = 123
)
thin_sim_data_group <- thin_points(
data = sim_data,
group_col = "sp",
thin_dist = 100,
seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 167
nrow(thin_sim_data_group[[1]])
#> [1] 321
Fixed number of points
What if you need to retain a fixed number of points that best covers
the area where your data points are located, or for class balancing? The
target_points
parameter allows you to specify the number of
points to keep, and the function will return that number of points
spaced as separated as possible. Additionally, you can also set a
thin_dist
parameter so that no points closer than this
distance will be retained. Currently, this approach is only implemented
using the brute force method (method = "distance"
and
search_type = "brute"
), so be cautious when applying it to
very large datasets.
thin_real_data <- thin_points(
data = caretta,
lon_col = "decimalLongitude",
lat_col = "decimalLatitude",
target_points = 150,
thin_dist = 30,
all_trials = FALSE,
seed = 123,
verbose = TRUE
)
#> Starting spatial thinning at 2025-03-27 19:20:57
#> Thinning using method: distance
#> For specific target points, brute force method is used.
#> Thinning process completed.
#> Total execution time: 3.09 seconds
nrow(thin_real_data[[1]])
#> [1] 150
Select points by priority
In some scenarios, you may want to prioritize certain points based on
a specific criterion, such as uncertainty, data quality, or any other
variable. The priority
parameter allows you to pass a
vector representing the priority of each point. Currently, this feature
can be used with the "grid"
and "precision"
methods.
For example, in the sea turtle data downloaded from GBIF, there is a
column named coordinateUncertaintyInMeters
. We can use this
to prioritize points with lower uncertainty within each grid cell (in
the grid
method) or when rounding coordinates (in the
precision
method). Keep in mind that bigger uncertainty
values represent less priority so we have to reverse this values.
thin_real_data <- thin_points(
data = caretta,
lon_col = "decimalLongitude",
lat_col = "decimalLatitude",
method = "grid",
resolution = 5,
seed = 123
)
# Substracting the maximum - the highest uncertainty becomes the lowest priority and vice versa.
priority <- max(caretta$coordinateUncertaintyInMeters) - caretta$coordinateUncertaintyInMeters
thin_real_data_uncert <- thin_points(
data = caretta,
lon_col = "decimalLongitude",
lat_col = "decimalLatitude",
method = "grid",
resolution = 5,
priority = priority,
seed = 123
)
mean(thin_real_data[[1]]$coordinateUncertaintyInMeters)
#> [1] 36355.38
mean(thin_real_data_uncert[[1]]$coordinateUncertaintyInMeters)
#> [1] 16950.97
Other Packages
The GeoThinneR package was inspired by the work of many others who have developed methods and packages for working with spatial data and thinning techniques. Our goal with GeoThinneR is to offer additional flexibility in method selection and to address specific needs we encountered while using other packages. We would like to mention other tools that may be suitable for your work:
-
spThin: The
thin()
function provides a brute-force spatial thinning of data, maximizing the number of retained points through random iterative repetitions, using the Haversine distance. -
enmSdmX: Includes the
geoThin()
function, which calculates all pairwise distances between points for thinning purposes. -
dismo: The
gridSample()
function samples points using a grid as stratification, providing an efficient method for spatial thinning.
References
Aiello‐Lammens, M. E., Boria, R. A., Radosavljevic, A., Vilela, B., and Anderson, R. P. (2015). spThin: an R package for spatial thinning of species occurrence records for use in ecological niche models. Ecography, 38(5), 541-545.
Elseberg, J., Magnenat, S., Siegwart, R., & Nüchter, A. (2012). Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. Journal of Software Engineering for Robotics, 3(1), 2-12.
Hijmans, R.J., Phillips, S., Leathwick, J., & Elith, J. (2023). dismo: Species Distribution Modeling. R Package Version 1.3-14. https://cran.r-project.org/package=dismo
Mestre-Tomás J (2025). GeoThinneR: Simple Spatial Thinning for Ecological and Spatial Analysis. R package version 1.1.0, https://github.com/jmestret/GeoThinneR
Smith, A. B., Murphy, S. J., Henderson, D., & Erickson, K. D. (2023). Including imprecisely georeferenced specimens improves accuracy of species distribution models and estimates of niche breadth. Global Ecology and Biogeography, 32(3), 342-355.
Session info
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 sf_1.0-20 terra_1.8-29 GeoThinneR_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 class_7.3-23 KernSmooth_2.23-26 digest_0.6.37
#> [5] magrittr_2.0.3 evaluate_1.0.3 grid_4.4.3 iterators_1.0.14
#> [9] fastmap_1.2.0 maps_3.4.2.1 foreach_1.5.2 jsonlite_1.9.1
#> [13] e1071_1.7-16 DBI_1.2.3 spam_2.11-1 viridisLite_0.4.2
#> [17] scales_1.3.0 codetools_0.2-20 textshaping_1.0.0 jquerylib_0.1.4
#> [21] cli_3.6.4 rlang_1.1.5 units_0.8-7 munsell_0.5.1
#> [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 tools_4.4.3
#> [29] colorspace_2.1-1 vctrs_0.6.5 R6_2.6.1 matrixStats_1.5.0
#> [33] proxy_0.4-27 lifecycle_1.0.4 classInt_0.4-11 nabor_0.5.0
#> [37] fs_1.6.5 ragg_1.3.3 pkgconfig_2.0.3 desc_1.4.3
#> [41] pkgdown_2.1.1 pillar_1.10.1 bslib_0.9.0 gtable_0.3.6
#> [45] glue_1.8.0 data.table_1.17.0 Rcpp_1.0.14 fields_16.3.1
#> [49] systemfonts_1.2.1 xfun_0.51 tibble_3.2.1 knitr_1.50
#> [53] farver_2.1.2 htmltools_0.5.8.1 rmarkdown_2.29 labeling_0.4.3
#> [57] dotCall64_1.2 compiler_4.4.3