Skip to contents

Overview

Spatial thinning is widely used across many disciplines to reduce the density of spatial point data by selectively removing points based on specified criteria, such as threshold spacing or grid cells. This technique is applied in fields like remote sensing, where it improves data visualization and processing, as well as in geostatistics, and spatial ecological modeling. Among these, species distribution modeling is a key area where thinning plays a crucial role in mitigating the effects of spatial sampling bias.

When working with large-scale datasets containing hundreds of thousands of points, applying thinning manually is not possible. While existing tools in R provide thinning methods for SDM, they often lack scalability and computational efficiency. GeoThinneR provides an efficient, scalable, and user-friendly approach to spatial thinning, implementing optimized algorithms based on kd-trees to accelerate neighbor searches. This vignette demonstrates how to get started with GeoThinneR and apply different spatial thinning methods.

Setup and loading data

To get started with GeoThinneR, we will also load the following packages:

  • terra: for working with raster and spatial data.
  • sf: for handling spatial polygons.
  • ggplot2: for visualizing the thinning process.

We’ll use two dataset to demonstrate the use of GeoThinneR: (i) A simulated a dataset with n = 2000 random occurrence points randomly distributed within a 40 x 20 degree area for two species, and (ii) a real-world dataset containing a subset of loggerhead sea turtle (Caretta caretta) occurrences in the Mediterranean Sea.

# Set seed for reproducibility
set.seed(123)

# Simulated dataset
n <- 2000
sim_data <- data.frame(
  lon = runif(n, min = -10, max = 10),
  lat = runif(n, min = -5, max = 5),
  sp = sample(c("sp1", "sp2"), n, replace = TRUE)
)

# Load real-world occurrence dataset
data("caretta")

# Load Mediterranean Sea polygon
medit <- system.file("extdata", "mediterranean_sea.gpkg", package = "GeoThinneR")
medit <- sf::st_read(medit)

Quick start guide

Now that our data is loaded, let’s use GeoThinneR for spatial thinning. You can use GeoThinneR via the wrapper function thin_points(), which requires, at a minimum, a dataset containing georeferenced points in matrix, data.frame, or tibble format. The query dataset is provided through the data argument, and with lon_col and lat_col parameters, we can define the column names containing longitude and latitude (or x, y) coordinates. If you don’t specify any columns, it will take the first two columns by default.

The thinning method is defined using the method parameter, with the following options:

  • "distance": Distance-based thinning
  • "grid": Grid-based thinning
  • "precision": Precision-based thinning

The thinning constraint (distance, grid resolucion or decimal precision) depends on the method chosen (these will be detailed in the next section). Regardless of the method, the algorithm selects which points to retain** and remove. Since this selection is non-deterministic, you can run multiple randomized thinning trials to maximize the number of retained points.

In this example, we run 5 iterations (trials) and return the results of all trials (all_trials = TRUE). In the following code snippet, we apply distance-based thinning using a distance threshold of 20 km (thin_dist).

# Apply spatial thinning to the simulated data
thin_sim_data <- thin_points(
  data = sim_data,     # Dataframe with coordinates
  lon_col = "lon",   # Longitude column name
  lat_col = "lat",     # Latitude column name
  method = "distance", # Method for thinning
  thin_dist = 20,      # Thinning distance in km,
  trials = 5,          # Number of trials
  all_trials = TRUE,   # Return all trials
  seed = 123           # Seed for reproducibility
)

The output is a list of thinned datasets. If we check the number of retained points in each trial, we can see that out of every 5 trials, there is one that returns one point less. This is due to the randomness of the algorithm when selecting which points to retain.

sapply(thin_sim_data, nrow)
#> [1] 1387 1385 1388 1391 1387

This returns a list of thinned datasets. From 2000 simulated points, we end up with 1795 points separated by at least 20 km. We can see that out of every 5 trials, there is one that returns one point less. This is due to the randomness of the algorithm when selecting which points to retain. You can use seed for reproducibility. By setting, all_trials = FALSE it will just return the trial the kept the most points.

# Number of kept points in each trial
sapply(thin_sim_data, nrow)
#> [1] 1387 1385 1388 1391 1387

You can access the thinned datasets by indexing them in the list and you can compare them with the original data using the rowname values.

head(thin_sim_data[[1]]) # # Accessing the first trial's thinned dataset
#>          lon        lat  sp
#> 1 -4.2484496 -3.4032602 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 4  7.6603481  0.1443426 sp1
#> 6 -9.0888700  1.1634277 sp1
#> 7  0.5621098 -0.5257711 sp1
#> 8  7.8483809 -4.4432328 sp1
head(sim_data[rownames(thin_sim_data[[1]]), ]) # Unthinned points from original data
#>          lon        lat  sp
#> 1 -4.2484496 -3.4032602 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 4  7.6603481  0.1443426 sp1
#> 6 -9.0888700  1.1634277 sp1
#> 7  0.5621098 -0.5257711 sp1
#> 8  7.8483809 -4.4432328 sp1

We now apply a 30 km thinning distance to the real-world dataset, returning only the best trial (the one with the maximum number of points retained, all_trials set to FALSE).

# Apply spatial thinning to the real data
thin_real_data <- thin_points(
  data = caretta, # We will not specify lon_col, lat_col as they are in position 1 and 2
  lat_col = "decimalLatitude",
  lon_col = "decimalLongitude",
  method = "distance",
  thin_dist = 30, # Thinning distance in km,
  trials = 5,
  all_trials = FALSE,
  seed = 123
)

# Thinned dataframe stored in the first element of the output list
dim(thin_real_data[[1]])
#> [1] 499   5

As you can see, with GeoThinneR, it’s very easy to apply spatial thinning to your dataset. In the next section we explain the different thinning methods and how to use them, and also the implemented optimizations for large-scale datasets.

Thinning methods

In this section, we will show how to use each thinning method, highlighting its unique parameters and features.

Distance-based thinning

Distance-based thinning (method="distance") ensures that all retained points are separated by at least a user-defined minimum distance, thin_dist (in km). For geographic coordinates, using distance = "haversine" (default) computes great-circle distances, or alternatively, the Euclidean distance between 3D Cartesian coordinates (depending on the method) to identify neighboring points within the distance threshold. The algorithm iterates through the dataset and removes points that are too close to each other. The trials parameter allows multiple iterations of the thinning process.

In general, the choice of search method depends on the size of the dataset and the desired thinning distance:

  • search_type="brute": Useful for small datasets, low scalability.
  • search_type="kd_tree": More memory-efficient, better scalability for medium-sized datasets.
  • search_type="local_kd_tree": Faster than "kd_tree" and reduces memory usage. Can be parallelized to improve speed.
  • search_type="k_estimation": Very fast when the number of points to remove is low or not very clustered, if not similar or worse than "local_kd_tree".

You can find the benchmarking results for each method in the package manuscript (see references).

Brute force

This is the most common greedy method for calculating the distance between points (search_type="brute"), as it directly computes all pairwise distances and retains points that are far enough apart based on the thinning distance. By default, it computes the Haversine distance using the rdist.earth() function from the fields package. If distance = "euclidean", it computes the Euclidean distance instead. The main advantage of this method is that it computes the full distance matrix between your points. However, the main disadvantage is that it is very time- and memory-consuming, as it requires computing all pairwise comparisons.

thin_sim_data <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "brute",
  thin_dist = 20,
  trials = 50,
  all_trials = FALSE,
  seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391

Traditional kd-tree

A kd-tree is a binary tree that partitions space into regions to optimize nearest neighbor searches. The search_type="kd_tree" method uses the nabor R package to implement kd-trees via the libnabo library. This method is more memory-efficient than brute force. Since this implementation of kd-trees is based on Euclidean distance, when working with geographic coordinates (distance = "haversine"), GeoThinneR transforms the coordinates into XYZ Cartesian coordinates. Although this method scales better than the brute-force method for large datasets, since we want to identify all neighboring points of each query point, its performance can still be improved for very large datasets.

thin_sim_data <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "kd_tree",
  thin_dist = 20,
  trials = 50,
  all_trials = FALSE,
  seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391

Local kd-trees

To reduce the complexity of searching within a single huge kd-tree containing all points from the dataset, we propose a simple modification: creating small local kd-trees by dividing the spatial domain into multiple smaller regions, each with its own tree (search_type="local_kd_tree"). This notably reduces memory usage, allowing this thinning method to be used for very large datasets on personal computers. Additionally, you can run in parallel this method to improve speed using the n_cores parameter.

thin_sim_data <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "local_kd_tree",
  thin_dist = 20,
  trials = 50,
  all_trials = FALSE,
  seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391

Another way to reduce the computational complexity of kd-tree neighbor searches is by limiting the number of neighbors to find. In the nabor::knn() function, we can specify the number of neighbors to search for (k). For spatial thinning, we need to identify all neighbors for each point, which increases computational cost by setting k to the total number of points (n). However, with the search_type="k_estimation" method, we can estimate the maximum number of neighbors a point can have based on the spatial density of points, thereby reducing the k parameter. This method is particularly useful when datasets are not highly clustered, as it estimates a smaller k value, significantly reducing computational complexity.

thin_sim_data <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "local_kd_tree",
  thin_dist = 20,
  trials = 50,
  all_trials = FALSE,
  seed = 123
)
nrow(thin_sim_data[[1]])
#> [1] 1391

Grid-based thinning

Grid sampling is a standard method where the area is divided into a grid, and n points are sampled from each grid cell. This method is very fast and memory-efficient. The only drawback is that you cannot ensure a minimum distance between points. There are two main ways to apply grid sampling in GeoThinneR: (i) Define the characteristics of the grid, or (ii) pass your own grid as a raster (terra::SpatRaster).

For the first method, you can use the thin_dist parameter to define the grid cell size (the distance in km will be approximated to degrees to define the grid cell size, one degree roughly 111.32 km), or you can pass the resolution of the grid (e.g., resolution = 0.25 for 0.25x0.25-degree cells). If you want to align the grid with external data or covariate layers, you can pass the origin argument as a tuple of two values (e.g., c(0, 0)). Similarly, you can specify the coordinate reference system (CRS) of your grid (crs).

system.time(
thin_sim_data <- thin_points(
  data = sim_data,
  method = "grid",
  resolution = 1,
  origin = c(0, 0),
  crs = "epsg:4326",
  n = 1, # one point per grid cell
  trials = 50,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>   0.015   0.001   0.014
nrow(thin_sim_data[[1]])
#> [1] 200

Alternatively, you can pass a terra::SpatRaster object, and that grid will be used for the thinning process.

rast_obj <- terra::rast(xmin = -10, xmax = 10, ymin = -5, ymax = 5, res = 1)
system.time(
thin_sim_data <- thin_points(
  data = sim_data,
  method = "grid", 
  raster_obj = rast_obj,
  trials = 50,
  n = 1,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>   0.005   0.000   0.002
nrow(thin_sim_data[[1]])
#> [1] 200

Precision thinning

In this approach, coordinates are rounded to a certain precision to remove points that fall too close together. After removing points based on coordinate precision, the coordinate values are restored to their original locations. This is the simplest method and is more of a filtering when working with data from different sources with varying coordinate precisions. To use it, you need to define the precision parameter, indicating the number of decimals to which the coordinates should be rounded.

system.time(
thin_sim_data <- thin_points(
  data = sim_data,
  method = "precision", 
  precision = 0,
  trials = 50,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>   0.003   0.000   0.003
nrow(thin_sim_data[[1]])
#> [1] 230

These are the methods implemented in GeoThinneR. Depending on your specific dataset and research needs, one method may be more suitable than others.

Additional customization features

Spatial thinning by group

In some cases, your dataset may include different groups, such as species, time periods, areas, or conditions, that you want to thin independently. The group_col parameter allows you to specify the column containing the grouping factor, and the thinning will be performed separately for each group. For example, in the simulated data where we have two species, we can use this parameter to thin each species independently:

thin_sim_data <- thin_points(
  data = sim_data,
  thin_dist = 100,
  seed = 123
)
thin_sim_data_group <- thin_points(
  data = sim_data,
  group_col = "sp",
  thin_dist = 100,
  seed = 123
)

nrow(thin_sim_data[[1]])
#> [1] 167
nrow(thin_sim_data_group[[1]])
#> [1] 321

Fixed number of points

What if you need to retain a fixed number of points that best covers the area where your data points are located, or for class balancing? The target_points parameter allows you to specify the number of points to keep, and the function will return that number of points spaced as separated as possible. Additionally, you can also set a thin_dist parameter so that no points closer than this distance will be retained. Currently, this approach is only implemented using the brute force method (method = "distance" and search_type = "brute"), so be cautious when applying it to very large datasets.

thin_real_data <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  target_points = 150,
  thin_dist = 30,
  all_trials = FALSE,
  seed = 123,
  verbose = TRUE
)
#> Starting spatial thinning at 2025-03-27 19:20:57 
#> Thinning using method: distance
#> For specific target points, brute force method is used.
#> Thinning process completed.
#> Total execution time: 3.09 seconds
nrow(thin_real_data[[1]])
#> [1] 150

Select points by priority

In some scenarios, you may want to prioritize certain points based on a specific criterion, such as uncertainty, data quality, or any other variable. The priority parameter allows you to pass a vector representing the priority of each point. Currently, this feature can be used with the "grid" and "precision" methods.

For example, in the sea turtle data downloaded from GBIF, there is a column named coordinateUncertaintyInMeters. We can use this to prioritize points with lower uncertainty within each grid cell (in the grid method) or when rounding coordinates (in the precision method). Keep in mind that bigger uncertainty values represent less priority so we have to reverse this values.

thin_real_data <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  method = "grid",
  resolution = 5,
  seed = 123
)

# Substracting the maximum - the highest uncertainty becomes the lowest priority and vice versa.
priority <- max(caretta$coordinateUncertaintyInMeters) - caretta$coordinateUncertaintyInMeters
thin_real_data_uncert <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  method = "grid",
  resolution = 5,
  priority = priority,
  seed = 123
)

mean(thin_real_data[[1]]$coordinateUncertaintyInMeters)
#> [1] 36355.38
mean(thin_real_data_uncert[[1]]$coordinateUncertaintyInMeters)
#> [1] 16950.97

Other Packages

The GeoThinneR package was inspired by the work of many others who have developed methods and packages for working with spatial data and thinning techniques. Our goal with GeoThinneR is to offer additional flexibility in method selection and to address specific needs we encountered while using other packages. We would like to mention other tools that may be suitable for your work:

  • spThin: The thin() function provides a brute-force spatial thinning of data, maximizing the number of retained points through random iterative repetitions, using the Haversine distance.
  • enmSdmX: Includes the geoThin() function, which calculates all pairwise distances between points for thinning purposes.
  • dismo: The gridSample() function samples points using a grid as stratification, providing an efficient method for spatial thinning.

References

  • Aiello‐Lammens, M. E., Boria, R. A., Radosavljevic, A., Vilela, B., and Anderson, R. P. (2015). spThin: an R package for spatial thinning of species occurrence records for use in ecological niche models. Ecography, 38(5), 541-545.

  • Elseberg, J., Magnenat, S., Siegwart, R., & Nüchter, A. (2012). Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. Journal of Software Engineering for Robotics, 3(1), 2-12.

  • Hijmans, R.J., Phillips, S., Leathwick, J., & Elith, J. (2023). dismo: Species Distribution Modeling. R Package Version 1.3-14. https://cran.r-project.org/package=dismo

  • Mestre-Tomás J (2025). GeoThinneR: Simple Spatial Thinning for Ecological and Spatial Analysis. R package version 1.1.0, https://github.com/jmestret/GeoThinneR

  • Smith, A. B., Murphy, S. J., Henderson, D., & Erickson, K. D. (2023). Including imprecisely georeferenced specimens improves accuracy of species distribution models and estimates of niche breadth. Global Ecology and Biogeography, 32(3), 342-355.

Session info

sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1    sf_1.0-20        terra_1.8-29     GeoThinneR_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9         class_7.3-23       KernSmooth_2.23-26 digest_0.6.37     
#>  [5] magrittr_2.0.3     evaluate_1.0.3     grid_4.4.3         iterators_1.0.14  
#>  [9] fastmap_1.2.0      maps_3.4.2.1       foreach_1.5.2      jsonlite_1.9.1    
#> [13] e1071_1.7-16       DBI_1.2.3          spam_2.11-1        viridisLite_0.4.2 
#> [17] scales_1.3.0       codetools_0.2-20   textshaping_1.0.0  jquerylib_0.1.4   
#> [21] cli_3.6.4          rlang_1.1.5        units_0.8-7        munsell_0.5.1     
#> [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.4.3       
#> [29] colorspace_2.1-1   vctrs_0.6.5        R6_2.6.1           matrixStats_1.5.0 
#> [33] proxy_0.4-27       lifecycle_1.0.4    classInt_0.4-11    nabor_0.5.0       
#> [37] fs_1.6.5           ragg_1.3.3         pkgconfig_2.0.3    desc_1.4.3        
#> [41] pkgdown_2.1.1      pillar_1.10.1      bslib_0.9.0        gtable_0.3.6      
#> [45] glue_1.8.0         data.table_1.17.0  Rcpp_1.0.14        fields_16.3.1     
#> [49] systemfonts_1.2.1  xfun_0.51          tibble_3.2.1       knitr_1.50        
#> [53] farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29     labeling_0.4.3    
#> [57] dotCall64_1.2      compiler_4.4.3