the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Global fields of daily accumulation-mode particle number concentrations using in situ observations, reanalysis data and machine learning
Abstract. Accurate global estimates of accumulation-mode particle number concentrations (N100) are essential for understanding aerosol–cloud interactions, their climate effects, and improving Earth System Models. However, traditional methods relying on sparse in situ measurements lack comprehensive coverage, and indirect satellite retrievals have limited sensitivity in the relevant size range. To overcome these challenges, we apply machine learning (ML) techniques— multiple linear regression (MLR) and eXtreme Gradient Boosting (XGB)—to generate daily global N100 fields, using in situ measurements as target variables and reanalysis data from Copernicus Atmosphere Monitoring Service (CAMS) and ERA5 as predictor variables. Our cross-validation showed that ML models captured N100 concentrations well in environments well-represented in the training set, with over 70 % of daily estimates within a factor of 1.5 of observations. However, performance declines in underrepresented regions and conditions, such as clean and remote environments, underscoring the need for more diverse observations. The most important predictors for N100 in theML models were aerosol-phase sulphate and gas-phase ammonia concentrations, followed by carbon monoxide and sulfur dioxide. Although black carbon and organic matter showed the highest feature importance values, their opposing signs in the MLR model coefficients suggest their effects largely offset each other’s contribution to the N100 estimate. By directly linking estimates to in situ measurements, our ML approach provides valuable insights into the global distribution of N100 and serves as a complementary tool for evaluating Earth System Model outputs and advancing the understanding of aerosol processes and their role in the climate system.
Competing interests: Some authors are members of the editorial board of journal AR.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.- Preprint
(3540 KB) - Metadata XML
-
Supplement
(1538 KB) - BibTeX
- EndNote
Status: open (until 12 Aug 2025)
-
RC1: 'Comment on ar-2025-18', Anonymous Referee #1, 01 Aug 2025
reply
Review of Global fields of daily accumulation-mode particle number concentrations using in situ observations, reanalysis data and machine learning by Ovaska et al.
Accumulation mode aerosols are important climatologically because of their interactions with radiation and clouds. Measurements of accumulation mode number concentrations from satellites are inferred from radiative properties (e.g., extinction) and only available in cloud-free regions which reduces our ability to constrain aerosol-cloud interactions in climate models. Observations from ground or airborne instruments are spatially and temporally limited but of high fidelity and useful for testing satellite retrievals and/or model simulations. The rise of complex machine learning (ML) approaches offers the opportunity to create diverse datasets that relate observed aerosol number concentrations to more widely observed/simulated meteorological phenomena. In this paper, Ovaska et al use two established explainable-ML approaches to relate observed accumulation number (N100) concentrations to coincident reanalysis fields and thereby create a high-coverage N100 dataset. The methodology accounts for various confounding factors which may bias results including the paucity of surface measurements which are mostly Europe-confined, and the varying-lengths of observational datasets.
The paper is very well written, timely, important and a provides a benchmark method for calculating aerosol number concentrations from predictor variables using ML methods. As someone with limited ML knowledge I particularly appreciate the comprehensive Methods section which highlights the complexities involved in ML training and gives critical details on reproducing results and applying these approaches to similar scenarios. The justification for choosing both ML methods (L107) is also very useful to an ML novice. While not an ML expert albeit as someone with a statistical background, the decisions made by the authors as documented in Sections 2-4 intuitively make sense. The results are also intuitive – the models show skill over stations with long observation datasets / close proximity to other stations, and reduced skill elsewhere. I have some minor comments which I think would improve the manuscript but otherwise I think the paper is an excellent fit for the journal.
General comments
The greatest source of uncertainty I think is in the spatial distribution of the surface observations (Fig. 1) which is very Europe-centric. Additionally, all/most of the surface sites are over land which diminishes the skill at predicting over oceans where the sources of aerosol are frankly very different to over land. I think the manuscript should be re-framed as a “Global land network of accumulation number” rather than global as there is limited evidence of skill over the oceans. I do not think this diminishes the paper but is more reflective of the results and limitations of the study. If the authors can provide some evaluation of concentrations over the ocean, even if rudimentary, that would be useful
The CAMS fields used as predictor variables (Table 2) are predominantly gas and aerosol tracers in that model, with some limited data assimilation of aerosol (from satellite AOD) but as far as I’m aware not gases. The paper lacks a quantification of the uncertainty in these predictor variables at the surface sites and in general. This is not to say that the use of these predictor variables is wrong, but I would appreciate some quantification of the relative uncertainty in these variables. If the surface sites measured the variables (e.g., EMEP over Europe) that would be a useful way to evaluate this uncertainty. Some qualitative evaluation is provided in Sections 5.3 and 5.4 but this should be extended.
The Brock et al reference seems like an interesting counterpoint, and I would have appreciated a direct comparison of the results of the 2 different approaches to deriving aerosol number concentrations, even if only over the measurement sites.
Specific comments
[L36] “need to be captured within a factor of 1.5 of their true values” – this is mentioned tangentially in Rosenfeld et al. without context. Is there any reasoning behind the factor of 1.5? If so, please include
[L41] This is optional but I wonder if you can mention any specific regions/seasons which are blighted by lack of CCN observation coverage from e.g., too much cloud cover, lack of satellite data, etc
[L125] What is the difference between testing/validation and holdout data? As an ML novice, this jargon appears to describe very similar partitions of the data phase space and might warrant a one-line explanation as to how they differ
[L165] Are the thresholds for ‘excellent’, ‘good’, and ‘poor’ fit at 0.2/0.3 arbitrary or established in the ML sciences? I understand the need for categorising model skill but perhaps there should be a neutral level between good and poor.
[L208] Just a quick check that the altitude of the interpolated predictor variables matched the altitude of the measurement station and corresponding CCN?
[Section 4] This is a very useful section and I thank the authors for their level of detail
[L503] Presumably the HAD issue is dust or is it NPF related - ACP - New particle formation, growth and apparent shrinkage at a rural background site in western Saudi Arabia?
[Section 5.1] This section is currently a little weak and would benefit from a hypotheses over why we see certain biases (see comment above). Potentially this would be useful also to the CAMS model developers. The line in [L515] starts to do this.
[L521] Potentially my only suggestion for adding to the methodology here which I think is optional is: did you try to re-train the ML model without one of either BC or OC given their significant correlation? Potentially there is some important detail which is missed by ignoring these predictor variables which could be useful if only including one?
[L534] Presumably the CO mixing ratio importance is because it’s a useful proxy for biomass burning smoke? It seems like an odd fit here given its limited aerosol chemistry so would be good to identify a reason for its inclusion
[L545] This relates to my General Comment – I think that the model framework with no marine sites will have very limited skill over the oceans. Perhaps the ocean ML-predicted concentrations can be tested with satellite, field campaigns or ship measurements where available but the negative coefficient for sea-salt is highly suggestive
Citation: https://doi.org/10.5194/ar-2025-18-RC1 -
RC2: 'Comment on ar-2025-18', Anonymous Referee #2, 05 Aug 2025
reply
Review of “Global fields of daily accumulation-mode particle number concentrations using in situ observations, reanalysis data and machine learning” by Aino Ovaska and co-authors.
In this manuscript the authors use a database of N100 measurements from 35 stations combined with reanalysis data (ERA5 meteorology and CAMS aerosol) to train and test two established machine learning (ML) models. N100 is a good proxy for CCN concentrations, which itself is an important for cloud microphysical and radiative properties. There is paucity of in-situ measurements for both variables, which results in considerable uncertainty when evaluating climate models and estimating future projections. A ML model that can provide robust global estimates of N100 and CCN concentrations would be a very important step forward for the community, therefore the focus of this study is very relevant.
The authors take commendable effort towards robustly training and testing the ML models. All steps are considered and justified throughout the manuscript. The results are well presented and discussed and framed very well with regards to associated limitations and steps required to improve on the ML models accuracy and representativeness.
I thoroughly enjoyed reading this manuscript and recommend publication in Aerosol Research following a discussion on a few largely minor comments.
General comments
Some regions may have meteorological drivers that are not found in other regions – for example the Southern Ocean. The synoptic / seasonal meteorology and sources of aerosol will be very different here than any other region that includes your training dataset. The time series analysis in Figure 7 (lines 409 – 502) for Alert (located in a relatively remote region) demonstrates that it struggles with this. Does this limit the use of the ML models in regions very far from any of the stations? (e.g., a lot of the Southern Hemisphere). I should note that this is still an excellent dataset and just clearly demonstrates the need for additional measurements in these remote regions.
This is a comment rather than a suggestion. Given the lack of observations in many regions – I wonder whether one way to test this is to repeat the methods using output from a global aerosol microphysics model. There would still be associated uncertainty due to the microphysical processes but it would be a very good test of the methods.
How would you expect the ML models to perform in a PI scenario where you have removed the most important features (the anthropogenic aerosol sources?). Do you think there is a possibility that they are only representative of the PD environment?
Overall, do you think the community can use this as a realistic proxy for N100 (outside of Europe – Line 647) in the absence of a detailed aerosol microphysics model? Is there sufficient skill? Or do you believe more work is required?
Minor comments
Line 194. The aerosols will likely have diurnal cycles in some locations. Therefore, there is an implicit assumption that the model reanalysis is able to capture the diurnal cycle correctly – is this a valid assumption? Why not use 6hourly? Some predictors (e.g., u, v, T, RH, some aerosol emissions) will likely vary throughout the diurnal cycle – and may be overlooked during feature importance analysis etc.
Line 196. How many data points did this remove from the sets?
Line 197. As you often measured very low concentrations did you also include zero counts when calculating the daily mean?
Line 212-217. How were gridded datasets spatially collocated with the measurement stations? Was the grid cell average used or linearly interpolated from nearby neighbouring grid points?
Line 221. Related to above, how were gridded datasets collocated with the altitude of the measurement stations? Stations on a hilltop or a mountain site may not be well represented by the model surface mean.
Line 312 Downweighing = Downweighting?
Line 336. Re manually selecting parameter combinations. Was there not a statistical method that could be used to eliminate any human sourced bias?
Line 367. Did you pay any specific attention in the analysis to how the ML models tested as a function of how remote the excluded station was?
Line 406. weighed = weighted?
Figure 3. Suggest making figure wider to clearly show the notches and features of the boxplots
Line 441 (and 213). 2020 to 2022 had a reasonably strong -ve ENSO index. Could this bias the comparison between training and testing RMSE values?
Figure 5. Suggest adding ‘MLR model’ and ‘XGB models’ to the top of panels (a) and (c) to make it automatically clear to the reader what is different.
Line 459. Seems XGB is best unless extreme values. Would you therefore recommend using a combination of both? XGB unless MLR predicts values < 25 or > 5000?
Line 468. Suggest authors add number of stations where XGB < MLR and vice versa.
Line 521. Given the importance of BC and OM, how well are they represented in CAMS?
Line 531. What happens if you were to remove either BC or OM as one of the predictor variables?
Figure 9. Missing M in label of panel (c). Suggest making station circles in (c) larger.
Line 602. Would you therefore recommend concentrating on regions with extreme N100 magnitudes to better train the global ML models?
Line 667. Worth noting that in well mixed boundary layers these ground-based ML models will likely provide representative values at cloud base.
Citation: https://doi.org/10.5194/ar-2025-18-RC2
Data sets
Daily Averaged Accumulation Mode Particle Number Concentrations (N100) from 35 Stations (2003-2019) A. Ovaska, E. Rauth, D. Holmberg, P. Artaxo, J. Backman, B. Bergmans, D. Collins, M. A. Franco, S. Gani, R. M. Harrison, R. K. Hooda, T. Hussein, A. Hyvärinen, K. Jaars, A. Kristensson, M. Kulmala, L. Laakso, A. Laaksonen, N. Mihalopoulos, C. O'Dowd, J. Ondracek, T. Petäjä, K. Plauškaitė-Šukienė, M. Pöhlker, X. Qi, P. Tunved, V. Vakkari, A. Wiedensohler, K. Puolamäki, T. Nieminen, V.-M. Kerminen, V. A. Sinclair, and P. Paasonen https://doi.org/10.5281/zenodo.15222674
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
161 | 51 | 13 | 225 | 16 | 10 | 17 |
- HTML: 161
- PDF: 51
- XML: 13
- Total: 225
- Supplement: 16
- BibTeX: 10
- EndNote: 17
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1