Predicting Ice Nucleation Particle properties in a Boreal Environment using machine learning

Wu, Yusheng; Brasseur, Zoé; Castarède, Dimtri; Heikkilä, Paavo; Keskinen, Jorma; Möhler, Ottmar; Kulmala, Markku; Petäjä, Tuukka; Thomson, Erik S.; Duplissy, Jonathan

doi:10.5194/ar-2026-4

Preprints

https://doi.org/10.5194/ar-2026-4

Preprints

29 Jan 2026

| 29 Jan 2026

Status: this preprint is currently under review for the journal AR.

Predicting Ice Nucleation Particle properties in a Boreal Environment using machine learning

Yusheng Wu, Zoé Brasseur, Dimtri Castarède, Paavo Heikkilä, Jorma Keskinen, Ottmar Möhler, Markku Kulmala, Tuukka Petäjä, Erik S. Thomson, and Jonathan Duplissy

Abstract. Mixed-phase clouds, which are dominant in mid- and high-latitude regions, strongly influence Earth’s radiative balance and precipitation processes. Their formation depends critically on the presence of ice-nucleating particles (INPs), which are rare relative to cloud condensation nuclei. The HyICE-2018 measurement campaign took place at the SMEAR II station in the high-latitude boreal forest of Hyytiälä, Finland, between February and June 2018. Two continuous-flow diffusion chambers Portable Ice Nucleation Chamber I and II (PINC and PINCii) with high-frequency sampling were deployed to measure INP concentrations. We applied machine-learning techniques to explore predictors of INP variability using more than 500 high-resolution atmospheric, aerosol, and ecosystem variables measured continuously at Station for Measuring Ecosystem-Atmosphere Relations (SMEAR) II. We identify distinct differences between winter and spring/summer measurements. The winter measurements conducted with PINC appear to be nearly independent of any monitored variable. In contrast, the spring/summer measurements conducted with PINCii appear to be more closely linked to and responsive to ambient aerosol properties. Furthermore, we find that classical parameterizations based on particle concentration overestimate observed INP concentrations in the boreal environment. However, similar empirical fits based on local proxies, such as a marker of biogenic aerosol or nitrate, yield improved agreement during spring and summer, while no improvement occurs during winter. These results underscore the need for site-specific parameterizations to capture INP variability in the complex boreal environments.

Received: 22 Jan 2026 – Discussion started: 29 Jan 2026

Competing interests: At least one of the (co-)authors is a member of the editorial board of Aerosol Research.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.

Download & links

Yusheng Wu, Zoé Brasseur, Dimtri Castarède, Paavo Heikkilä, Jorma Keskinen, Ottmar Möhler, Markku Kulmala, Tuukka Petäjä, Erik S. Thomson, and Jonathan Duplissy

Status: open (until 22 Mar 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on ar-2026-4', Anonymous Referee #1, 05 Mar 2026 reply

Review of “Predicting Ice Nucleation Particle properties in a Boreal Environment using machine learning” by Wu et al.
General comment:

The manuscript reports on the feasibility of predicting INP concentrations measured at -31°C and -32°C during the HyICE-2018 campaign in the boreal forest from 84 complementary variables measured at the site. The paper's honest takeaway is that no variable strongly predicts INP in the boreal winter, and only moderate skill (R²≈0.5) is achieved in spring/summer. This is an important negative result and is nicely summarized at the end of the introduction, telling the community that even many co-located variables at one of the best instrumented stations cannot explain INP variability in general. The authors could commit more fully to this message. The paper is framed around "predicting INP using machine learning" yet the conclusion effectively states that strong links remain "abstruse." This is a cautionary study about the limits of data-driven approaches for INP prediction, which is arguably more useful to the community than a modest positive result would be. However, the paper presents itself as a machine learning study, but the ML algorithms are used exclusively as importance ranking tools, not for predictive modeling. The more closely investigated relations (Table 1) are simple power-law fits with only two parameters, no different from classical empirical fitting. There is no actual ML-based predictive model evaluated with train/test splits. With data from 84 variables, a proper ML regression model (e.g., gradient boosting or random forest regressor) could be benchmarked against the power-law fits. As the paper stands, the title overstates the ML contribution.
Specific comments:

line 15: Specify how the results underscore the need for site-specific parameterizations and suggest on which variables parameterizations could instead be based, considering that none of the 84 variables included here work.
line 20: Explain why the coexistence is inherently unstable. In the same sentence it is mentioned that mixed-phase clouds persist for days, which seems to contradict this statement.
line 22: The Arctic is considered a high-latitude region, not a region beyond high latitudes.
line 23: Summarize what the underlying amplifying feedbacks are.
line 26: Summarize how CCN and INP are fundamental and to which cloud processes.
line 27: Clarify the difference between INP occurrence and abundance.
line 33: Providing some more details on the instrumentation, specifically about the instruments measuring the 84 variables used in this study, would be helpful. This could be done in a supplementary table including the instrument name, brand, sampling frequency, and volume/flow.
line 50, 75: Specify the time resolution with which the CFDCs measure INP concentrations and explain why the data are averaged over 20 min or 1 hour (Figure 3). Figure 1 shows INP concentrations often reaching hundreds per liter. With a sample flow of 1 L/min there should be plenty of signal in 1 min averages or even 10 s averages, which would substantially increase the number of INP data points available for analysis. Additionally, correlations of ambient measurements depend on the time resolution or averaging interval. The temporal scale at which a correlation is seen also identifies the scale of the process that drives the changes in variables. Investigating correlations at different time resolutions, which seems possible with this dataset, could be interesting for the very variable INP concentrations. Such an analysis could be added, and the dependence of correlation analyses on the temporal data resolution should be discussed.
Section 2.2.: Currently, the ML techniques are not explained in this section. Consider changing the section title to "Complementary Data Selection". Add more details about how datasets were processed for the analysis, for example how differences in sampling frequency, sample volume, and cutoff sizes were handled.
Add a separate methodology section about ML algorithms. The abstract (line 7), introduction (line 38), and Sec. 3.2 explicitly refer to several machine learning algorithms or analysis techniques. Add some information about these algorithms and techniques, for example feature selection criteria, to make the approach reproducible.
line 73: Explain which hypotheses are tested to illuminate the sources and mechanisms, for example including organic matter to test whether NPF generates INPs.
line 76: Clarify what is meant by "straightforwardly intercompared".
Section 2.3.: Clarify why the details on the CFDC chambers are relevant for this paper. None of the details are referred to later.
line 107: Contrary to what is stated here, Brasseur et al., 2022 mention a sampling window of 15 min for PINCii.
Figure 1: Indicate the temperature at which INP concentrations were measured. Check the units in c); I assume this should be the same data as in Brasseur et al. 2022 Fig. 6c). In e), the last line is shown with very weak colours.
line 119–120: Clarify why NPF is relevant here.
line 122: What relationships can be observed in Figure 1 that motivate the analysis?
line 140: The references attribute bio-INP to much lower concentrations at higher temperatures than below -30°C. Provide a supporting citation for biological particles contributing substantially at low temperatures.
line 144: Only half of the variables are listed in Figure 2, while others are selected by intuition. Specify for each variable based on which information or hypothesis it was selected.
line 148ff: That BC ranks highly in Fig. 2 and yields the highest adjusted R² of all predictors for PINCii (Table 1), yet is a poor INP in laboratory studies, is a provocative finding that could be explored beyond noting that it is surprising and pointing to aging and oxidation as possible explanations. It could be mentioned that Paramonov et al. (2020) also reported that BC correlated well with INP concentrations on a short timescale (their Fig. 5).

In general, the paper would be strengthened if connections to findings from previous HyICE articles were integrated in more detail.
line 151ff: Clarify which instruments were used to measure the variables on the horizontal axes. For example, are >500 nm data from APS or WIBS?
line 155, Table 1: Previously, it is implied that organic mass and >500 nm concentration were included based on intuition and not a high-skill ranking.
Figure 3 and related interpretation: Explain why the data are split for the Pearson correlation analysis if the goal is to investigate the variables' predictive power for INP concentrations. The difference between PINC and PINCii data shows that the correlations are only good for a subset, not generally. Provide a discussion on what this implies for the many campaign-based INP correlations reported in the literature.
Figure 3: For all panels, consider marking significant r values with an asterisk instead of reporting absurdly small p values. a), c): Check units on the horizontal axes. The different Pearson correlations for PINC and PINCii data, which seem to overlap for the most part in the log–log scatter plots, are surprising. Clarify if the correlation was calculated using the raw data or the log-transformed data. Consider showing the data on a linear scale. Provide the number of PINC and PINCii data points. Are they the same in each panel? Why are hourly mean data used here, and does using the mean affect the outcome of the analysis compared to using 20 min data?
line 192ff: Eq. (2) reproduces Eq. (2) from Tobo et al., 2013, which uses n_AP > 500 nm. However, the text seems to imply that Eq. (3) uses the number of fluorescent particles. How would the updated Eq. (3) from Tobo et al., 2013 perform? Which formula was used for Figure 8 in Brasseur et al., 2022?
Figure 4: a) Would p = 1 not require χ² = 0? Double-check the values. b) Clarify why χ² for PINC data is huge and p = 0. Double-check whether the fitting and statistics were performed correctly. The width of component 1 and the location of the second mode for fitting PINC data do not seem optimal. The χ² and visual inspection do not support the statement in the caption that tails are better resolved by the bi-modal fit.
line 198: Explain why the CFDCs operating above water saturation were not measuring immersion freezing comparable to the parameterization derived from datasets obtained with INSEKT. It would be interesting to see how well the parameterizations perform at low temperatures.
Figure 5: Double-check the calculation using Tobo et al., 2013. Figure 8 in Brasseur et al., 2022 shows good agreement between Tobo 2013 and PINC, PINCii measured INP concentrations during the instrument intercomparison.
line 209: Clarify why in Figure 5b) Tobo 2013 never performs well.
line 218: Please provide a more in-depth interpretation of the coefficients found in Table 1.
Figure 6 a), c): Check units in panel titles.
Technical corrections:

line 36: "bridging" seems to be the wrong term as there is only one campaign. Maybe "extending".
Line 132–134: Repetition from Sec. 2.2.
line 139: Use “INP concentration” instead of “ice nucleation activity”.
line 162: Use “contain INPs” instead of “to be ice active”.
Wherever it occurs, use either 0.5 µm or 500 nm.
In the reference, the author list for Brasseur et al., 2022 is incomplete.
References:
Brasseur, Z., Castarède, D., Thomson, E. S., Adams, M. P., Drossaart van Dusseldorp, S., Heikkilä, P., Korhonen, K., Lampilahti, J., Paramonov, M., Schneider, J., Vogel, F., Wu, Y., Abbatt, J. P. D., Atanasova, N. S., Bamford, D. H., Bertozzi, B., Boyer, M., Brus, D., Daily, M. I., Fösig, R., Gute, E., Harrison, A. D., Hietala, P., Höhler, K., Kanji, Z. A., Keskinen, J., Lacher, L., Lampimäki, M., Levula, J., Manninen, A., Nadolny, J., Peltola, M., Porter, G. C. E., Poutanen, P., Proske, U., Schorr, T., Silas Umo, N., Stenszky, J., Virtanen, A., Moisseev, D., Kulmala, M., Murray, B. J., Petäjä, T., Möhler, O., and Duplissy, J.: Measurement report: Introduction to the HyICE-2018 campaign for measurements of ice-nucleating particles and instrument inter-comparison in the Hyytiälä boreal forest, Atmos. Chem. Phys., 22, 5117–5145, https://doi.org/10.5194/acp-22-5117-2022, 2022.
Paramonov, M., Drossaart van Dusseldorp, S., Gute, E., Abbatt, J. P. D., Heikkilä, P., Keskinen, J., Chen, X., Luoma, K., Heikkinen, L., Hao, L., Petäjä, T., and Kanji, Z. A.: Condensation/immersion mode ice-nucleating particles in a boreal environment, Atmos. Chem. Phys., 20, 6687–6706, https://doi.org/10.5194/acp-20-6687-2020, 2020.
Tobo, Y., Prenni, A. J., DeMott, P. J., Huffman, J. A., McCluskey, C. S., Tian, G., Pohlker, C., Poschl, U., and Kreidenweis, S. M.: Biological aerosol particles as a key determinant of ice nuclei populations in a forest ecosystem, Journal of Geophysical Research: Atmospheres, 118, 10,100–10,110, https://doi.org/10.1002/jgrd.50801, 2013.

Reply

Citation: https://doi.org/10.5194/ar-2026-4-RC1

Yusheng Wu, Zoé Brasseur, Dimtri Castarède, Paavo Heikkilä, Jorma Keskinen, Ottmar Möhler, Markku Kulmala, Tuukka Petäjä, Erik S. Thomson, and Jonathan Duplissy

Viewed

Total article views: 264 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
174	69	21	264	22	22

HTML: 174
PDF: 69
XML: 21
Total: 264
BibTeX: 22
EndNote: 22

Views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	48	17	4	69
Feb 2026	105	46	14	165
Mar 2026	21	6	3	30

Cumulative views and downloads (calculated since 29 Jan 2026)

Month	HTML	PDF	XML	Total
Jan 2026	48	17	4	69
Feb 2026	105	46	14	165
Mar 2026	21	6	3	30

Viewed (geographical distribution)

Total article views: 259 (including HTML, PDF, and XML) Thereof 259 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 10 Mar 2026

Short summary

Clouds in cold regions affect climate and precipitation, but their behavior depends on rare airborne particles that help ice form. We measured these particles over several months in a Finnish forest and compared them with many environmental observations. We found that ice formation in winter was largely unpredictable, while in spring and summer it was more strongly linked to particle amount and composition. This shows that local conditions are needed to better represent clouds in climate models.


Total:	0
HTML:	0
PDF:	0
XML:	0