the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Predicting Ice Nucleation Particle properties in a Boreal Environment using machine learning
Abstract. Mixed-phase clouds, which are dominant in mid- and high-latitude regions, strongly influence Earth’s radiative balance and precipitation processes. Their formation depends critically on the presence of ice-nucleating particles (INPs), which are rare relative to cloud condensation nuclei. The HyICE-2018 measurement campaign took place at the SMEAR II station in the high-latitude boreal forest of Hyytiälä, Finland, between February and June 2018. Two continuous-flow diffusion chambers Portable Ice Nucleation Chamber I and II (PINC and PINCii) with high-frequency sampling were deployed to measure INP concentrations. We applied machine-learning techniques to explore predictors of INP variability using more than 500 high-resolution atmospheric, aerosol, and ecosystem variables measured continuously at Station for Measuring Ecosystem-Atmosphere Relations (SMEAR) II. We identify distinct differences between winter and spring/summer measurements. The winter measurements conducted with PINC appear to be nearly independent of any monitored variable. In contrast, the spring/summer measurements conducted with PINCii appear to be more closely linked to and responsive to ambient aerosol properties. Furthermore, we find that classical parameterizations based on particle concentration overestimate observed INP concentrations in the boreal environment. However, similar empirical fits based on local proxies, such as a marker of biogenic aerosol or nitrate, yield improved agreement during spring and summer, while no improvement occurs during winter. These results underscore the need for site-specific parameterizations to capture INP variability in the complex boreal environments.
Competing interests: At least one of the (co-)authors is a member of the editorial board of Aerosol Research.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.- Preprint
(7166 KB) - Metadata XML
- BibTeX
- EndNote
Status: final response (author comments only)
- RC1: 'Comment on ar-2026-4', Anonymous Referee #1, 05 Mar 2026
-
RC2: 'Comment on ar-2026-4', Anonymous Referee #2, 18 Mar 2026
This study investigates potential correlations between INP concentrations and other variables measured at the SMEAR station during the HyICE-2018 campaign. The general idea is to also look at variables that are not usually investigated as candidates for influencing INP concentrations and see which ones correlate. The authors also compare the performance of INP parameterisations for their case and derive a modified parameterisation for their case.
While the idea of the study is straight-forward and sensible -- I like the open-mindedness behind the screening approach --, the way it was conducted (or at least how that is represented in the paper) is not. Major points to correct or explain are:
Screening of variables: methodology
If I understand correctly, several ML techniques were used to scope variables of interest, i.e. those that have some correlation with INP concentrations.
- While you mention their names around l. 129, any more methodological information is absent. Please elaborate on that in a separate methodology section.
- Where can we see that the separate treatments resulted in qualitatively similar results (l. 131 and l. 159)? Fig. 2 only shows one of these, and a comparison should at least be part of the Appendix.
- How was the subset (of the 84 variables that you analysed) selected for Fig. 2? Are these the ones with the highest importance?Uptake of screened variables
From the variables in Fig. 2, you make another selection for Fig. 3. l. 144 states that these were selected ``based upon their highly ranked outcomes and physical intuition''. On a first glance, this appears a reasonable and often-used approach, but in the context of your study it seems to invalidate the idea that the study is based on: screening as many variables as possible. If you exclude those variables that you didn't expect to show a correlation (physical intuition) after the screening, why include them in the first place?
Many of the variables that appear with high importance in Fig. 2 also remain unmentioned -- I think at the least you ought to explain why you exclude them from further analysis (perhaps in a Table?).Title and framing
The title suggests that you used machine learning extensively and that you are predicting INP properties. In my reading, this does not reflect what you did and what is explained in the manuscript:
- the ML was only used to screen variables and not documented well (see above)
- your modified parameterisation shows only modest skill, and only for spring measurements. It is also only a small part of the study.
- The only thing you are predicting are INP concentrations, thus only one INP property.
This can easily be read as overselling, and while the title may have once been the aim of your research, it does not reflect the results that you generated, nor the main content of the paper. To me what stands out is that many of the variables do not have a correlation with INP concentrations, and thus especially in winter they remain unpredictable. As reviewer 1 noted, this ``null result'' is an interesting result to share.
As you state int the conclusion, "no single parameter emerges that is strongly linked to INP". The paper title and framing need to reflect this.Conclusion from the null results
As I stated above, I think the "null result" is an important one. However, the conclusion jumps from "drawing strong conclusions that can illuminate causality will likely remain illusive" to suggesting more, longer and heavily-equipped measurements. Granted, this is the default conclusion that atmospheric scientists like to call for, but given your sobering findings, wouldn't the opposite make more sense? More and more measurements will NOT magically "help the community to identify key predictor parameters".
Why do you think you couldn't find a correlation? What does this imply about the nature of INPs? What could one/the field do differently?
I realise that I might sound polemic here, but I encourage the authors to think beyond standard recipes and consider perhaps uncomfortable but sound conclusions from their findings.New parameterization does not respond to predictors
l. 222 "PINC measurements ... have almost no response to the predictors": it's the predicted INP that don't vary, while the measured INP do. However, I don't understand how this can be the case? Seeing the simple parameterisation formula, it implies that the predictors themselves don't vary. Is that correct? Fig. 1 implies otherwise, at least the particle mass is varying quite a bit there, so I do not understand why a parameterisation based on it would give constant results.
MINOR POINTSIn the methodology you explain that many of the variables that could have potentially been explored were excluded due to excessive NaN values or little variability, which left you with 84 variables. I think this is the number that should be mentioned in the abstract, rather than a grand 500 variables (l. 7).
l. 7 and throughout: aren't INP concentrations what you're after? I don't see you using a measure of variability anywhere, nor do you correlate with variability itself.
Introduction
l. 18: they don't need to "form" below 0°C, the cloud forms as the cloud droplet forms.
l. 20: "would prefer to be ice" -- "school language"
l. 20: The link between the sentences is missing: water would "prefer to be ice", yet it is not!
l. 23: what is meant with "underlying"?
l. 24: "only" - ice formation or ice in MPCs can also occur because of ice sedimenting into the cloud from another cloud above
l. 34: give citations for ICOS, ACTRIS and SITES (and explain the acronyms)
l. 35: what were the most important take-aways (about INPs) from these other campaign studies?
l. 40 "INP data": as for "variability" above, be specific.
l. 41 "even": I would argue that "especially" fits better, because the more variables you try (and the less physical intuition you have for them, see above), the more likely it is that you find spurious correlations. This is a wider issue that I think could be addressed or at least mentioned somewhere in the manuscript.
Missing: since you draw on "physical intuition" to argue for the variables that you include in Fig. 3, I miss background on that intuition in the introduction. What have INPs been correlated with before?
Methods and Data
l. 50: you say "two" but list 3
l. 53-58 and 73-74: no need to repeat the introduction or move this content there if it hasn;t been covered.
l. 68: in principal I think all data that you use should be described with this much information, or else please describe why you describe only this one and where information on the others can be found
l. 104: please state the main take-aways from the comparison
l. 110-111: what do you mean with ``slightly greater variability''? Can you quantify this, for example by giving a range or fitting a distribution?
Results
Fig. 1 caption: I don't see squares for PINCii in a? Why are there no measurements of fluourescent biological aerosol particles before 03-15 in c?
l. 132 - 137: repetition or move to Methods if it hasn't been said before
l. 142: related to the major point above, why don't you investigate the link between surprising variables and INP concentrations more closely? This would add value to your idea of sampling as many variables as possible.
l. 145: you give co-variance as an example for exclusion, but do not state it as a reason for exclusion in the sentence prior.
Fig. 2: the comparison to Fig. 3 is difficult. Could you highlight the variables that you chose to investigate in detail in Fig. 3?
Fig. 2 y-axis: Many of these variables need further specification, for example "nitrate" what? Number concentration? Also, these variables need more detail, which you do give nicely in the caption of Fig. 3, for the variables that are included there.
Fig. 2 caption: The explanation of the decision tree belongs into the Methods section (and, as stated above, the other ML methods need to be explained there as well).
l. 152 - 156: repetitive of what's been said just before
Fig. 3: You may want to remind the reader that PINC corresponds to winter and PINCii to spring measurements.
l. 160: why would a method that shows you correlations be enough to "shed light on causation". Please explain what made you reason that the correlations you find do not reflect causations.
l. 167: Why do log-normal frequency distributions imply long-range transport?
Is your point hear that the absence of a correlation with a co-located variable may be due to the INPs stemming from long-range transport?
Could you substantiate this hypothesis, for example with backtrajectories?
If this is your main hypothesis (which I do find interesting and important), please highlight it in the conclusions. Also, doesn't this render your parameterisation attempts, which are following, less promising from the get go?l. 188 "likely": with all the data that you have at your disposal, can't you confirm whether this is the case?
Also, the idea that your site is far away from dust sources stands in opposition to your idea that long range transport is a source of INPs.l. 207: I don't understand why one would expect them to capture it. After all, you see limited correlations, and then why would a parameterisation based on those little correlating variables have any chance? Please connect the approach of testing the parameterisations logically to your findings from previous sections.
l. 209: Why the disagreement with previous findings? Is that due to the time resolution of your measurements?
Caption of Table 1: no, in my understanding you did not select based on high scoring, but also based on ``physical intuition'' (that requires more explanation, see above)
l. 233: This seems like an easy thing to test given all the data you have available, for example by correlating nitrate and acetone concentrations (again, please specify which "nitrate" property you mean) with aerosol burden.
l. 238: what is the "intrinsic natural variability of INP"? Isn't this what you've been trying to explain? How could that be intrinsic? It needs to be linked to something, doesn't it?
Fig. 6 caption, last sentence: this refers only to the boreal atmosphere in spring.
Conclusion
l. 230: What do you mean by "qualitatively linked"? The links that you get with your ML are quantitative, aren't they?
Open Research SectionPlease share the scripts you've been using for the ML (or at least documentation thereof, the packages, \ldots) and the filtering.
Language
l. 6: ',' around "Portable Ice Nucleation... (PINC and PINCii)"
l. 8: "the" Station
l. 12: aerosol particle concentration
l. 30: "to focus" -> with a focus
l. 59: singular purpose
l. 131: "as illustrated" -> "which are illustrated". See above: Fig. 2 illustrates results for one methodology, not how they compare.
l. 207: "both agrees and disagrees"; I think what you mean here is that they disagree with Tobo performing well and agree with DeMott underpredicting. If you do, please simplify that statement accordingly.
l. 214 algorithms: again, in this manuscript you only show oneCitation: https://doi.org/10.5194/ar-2026-4-RC2
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 236 | 86 | 33 | 355 | 32 | 29 |
- HTML: 236
- PDF: 86
- XML: 33
- Total: 355
- BibTeX: 32
- EndNote: 29
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
Review of “Predicting Ice Nucleation Particle properties in a Boreal Environment using machine learning” by Wu et al.
General comment:
The manuscript reports on the feasibility of predicting INP concentrations measured at -31°C and -32°C during the HyICE-2018 campaign in the boreal forest from 84 complementary variables measured at the site. The paper's honest takeaway is that no variable strongly predicts INP in the boreal winter, and only moderate skill (R²≈0.5) is achieved in spring/summer. This is an important negative result and is nicely summarized at the end of the introduction, telling the community that even many co-located variables at one of the best instrumented stations cannot explain INP variability in general. The authors could commit more fully to this message. The paper is framed around "predicting INP using machine learning" yet the conclusion effectively states that strong links remain "abstruse." This is a cautionary study about the limits of data-driven approaches for INP prediction, which is arguably more useful to the community than a modest positive result would be. However, the paper presents itself as a machine learning study, but the ML algorithms are used exclusively as importance ranking tools, not for predictive modeling. The more closely investigated relations (Table 1) are simple power-law fits with only two parameters, no different from classical empirical fitting. There is no actual ML-based predictive model evaluated with train/test splits. With data from 84 variables, a proper ML regression model (e.g., gradient boosting or random forest regressor) could be benchmarked against the power-law fits. As the paper stands, the title overstates the ML contribution.
Specific comments:
line 15: Specify how the results underscore the need for site-specific parameterizations and suggest on which variables parameterizations could instead be based, considering that none of the 84 variables included here work.
line 20: Explain why the coexistence is inherently unstable. In the same sentence it is mentioned that mixed-phase clouds persist for days, which seems to contradict this statement.
line 22: The Arctic is considered a high-latitude region, not a region beyond high latitudes.
line 23: Summarize what the underlying amplifying feedbacks are.
line 26: Summarize how CCN and INP are fundamental and to which cloud processes.
line 27: Clarify the difference between INP occurrence and abundance.
line 33: Providing some more details on the instrumentation, specifically about the instruments measuring the 84 variables used in this study, would be helpful. This could be done in a supplementary table including the instrument name, brand, sampling frequency, and volume/flow.
line 50, 75: Specify the time resolution with which the CFDCs measure INP concentrations and explain why the data are averaged over 20 min or 1 hour (Figure 3). Figure 1 shows INP concentrations often reaching hundreds per liter. With a sample flow of 1 L/min there should be plenty of signal in 1 min averages or even 10 s averages, which would substantially increase the number of INP data points available for analysis. Additionally, correlations of ambient measurements depend on the time resolution or averaging interval. The temporal scale at which a correlation is seen also identifies the scale of the process that drives the changes in variables. Investigating correlations at different time resolutions, which seems possible with this dataset, could be interesting for the very variable INP concentrations. Such an analysis could be added, and the dependence of correlation analyses on the temporal data resolution should be discussed.
Section 2.2.: Currently, the ML techniques are not explained in this section. Consider changing the section title to "Complementary Data Selection". Add more details about how datasets were processed for the analysis, for example how differences in sampling frequency, sample volume, and cutoff sizes were handled.
Add a separate methodology section about ML algorithms. The abstract (line 7), introduction (line 38), and Sec. 3.2 explicitly refer to several machine learning algorithms or analysis techniques. Add some information about these algorithms and techniques, for example feature selection criteria, to make the approach reproducible.
line 73: Explain which hypotheses are tested to illuminate the sources and mechanisms, for example including organic matter to test whether NPF generates INPs.
line 76: Clarify what is meant by "straightforwardly intercompared".
Section 2.3.: Clarify why the details on the CFDC chambers are relevant for this paper. None of the details are referred to later.
line 107: Contrary to what is stated here, Brasseur et al., 2022 mention a sampling window of 15 min for PINCii.
Figure 1: Indicate the temperature at which INP concentrations were measured. Check the units in c); I assume this should be the same data as in Brasseur et al. 2022 Fig. 6c). In e), the last line is shown with very weak colours.
line 119–120: Clarify why NPF is relevant here.
line 122: What relationships can be observed in Figure 1 that motivate the analysis?
line 140: The references attribute bio-INP to much lower concentrations at higher temperatures than below -30°C. Provide a supporting citation for biological particles contributing substantially at low temperatures.
line 144: Only half of the variables are listed in Figure 2, while others are selected by intuition. Specify for each variable based on which information or hypothesis it was selected.
line 148ff: That BC ranks highly in Fig. 2 and yields the highest adjusted R² of all predictors for PINCii (Table 1), yet is a poor INP in laboratory studies, is a provocative finding that could be explored beyond noting that it is surprising and pointing to aging and oxidation as possible explanations. It could be mentioned that Paramonov et al. (2020) also reported that BC correlated well with INP concentrations on a short timescale (their Fig. 5).
In general, the paper would be strengthened if connections to findings from previous HyICE articles were integrated in more detail.
line 151ff: Clarify which instruments were used to measure the variables on the horizontal axes. For example, are >500 nm data from APS or WIBS?
line 155, Table 1: Previously, it is implied that organic mass and >500 nm concentration were included based on intuition and not a high-skill ranking.
Figure 3 and related interpretation: Explain why the data are split for the Pearson correlation analysis if the goal is to investigate the variables' predictive power for INP concentrations. The difference between PINC and PINCii data shows that the correlations are only good for a subset, not generally. Provide a discussion on what this implies for the many campaign-based INP correlations reported in the literature.
Figure 3: For all panels, consider marking significant r values with an asterisk instead of reporting absurdly small p values. a), c): Check units on the horizontal axes. The different Pearson correlations for PINC and PINCii data, which seem to overlap for the most part in the log–log scatter plots, are surprising. Clarify if the correlation was calculated using the raw data or the log-transformed data. Consider showing the data on a linear scale. Provide the number of PINC and PINCii data points. Are they the same in each panel? Why are hourly mean data used here, and does using the mean affect the outcome of the analysis compared to using 20 min data?
line 192ff: Eq. (2) reproduces Eq. (2) from Tobo et al., 2013, which uses nAP > 500 nm. However, the text seems to imply that Eq. (3) uses the number of fluorescent particles. How would the updated Eq. (3) from Tobo et al., 2013 perform? Which formula was used for Figure 8 in Brasseur et al., 2022?
Figure 4: a) Would p = 1 not require χ² = 0? Double-check the values. b) Clarify why χ² for PINC data is huge and p = 0. Double-check whether the fitting and statistics were performed correctly. The width of component 1 and the location of the second mode for fitting PINC data do not seem optimal. The χ² and visual inspection do not support the statement in the caption that tails are better resolved by the bi-modal fit.
line 198: Explain why the CFDCs operating above water saturation were not measuring immersion freezing comparable to the parameterization derived from datasets obtained with INSEKT. It would be interesting to see how well the parameterizations perform at low temperatures.
Figure 5: Double-check the calculation using Tobo et al., 2013. Figure 8 in Brasseur et al., 2022 shows good agreement between Tobo 2013 and PINC, PINCii measured INP concentrations during the instrument intercomparison.
line 209: Clarify why in Figure 5b) Tobo 2013 never performs well.
line 218: Please provide a more in-depth interpretation of the coefficients found in Table 1.
Figure 6 a), c): Check units in panel titles.
Technical corrections:
line 36: "bridging" seems to be the wrong term as there is only one campaign. Maybe "extending".
Line 132–134: Repetition from Sec. 2.2.
line 139: Use “INP concentration” instead of “ice nucleation activity”.
line 162: Use “contain INPs” instead of “to be ice active”.
Wherever it occurs, use either 0.5 µm or 500 nm.
In the reference, the author list for Brasseur et al., 2022 is incomplete.
References:
Brasseur, Z., Castarède, D., Thomson, E. S., Adams, M. P., Drossaart van Dusseldorp, S., Heikkilä, P., Korhonen, K., Lampilahti, J., Paramonov, M., Schneider, J., Vogel, F., Wu, Y., Abbatt, J. P. D., Atanasova, N. S., Bamford, D. H., Bertozzi, B., Boyer, M., Brus, D., Daily, M. I., Fösig, R., Gute, E., Harrison, A. D., Hietala, P., Höhler, K., Kanji, Z. A., Keskinen, J., Lacher, L., Lampimäki, M., Levula, J., Manninen, A., Nadolny, J., Peltola, M., Porter, G. C. E., Poutanen, P., Proske, U., Schorr, T., Silas Umo, N., Stenszky, J., Virtanen, A., Moisseev, D., Kulmala, M., Murray, B. J., Petäjä, T., Möhler, O., and Duplissy, J.: Measurement report: Introduction to the HyICE-2018 campaign for measurements of ice-nucleating particles and instrument inter-comparison in the Hyytiälä boreal forest, Atmos. Chem. Phys., 22, 5117–5145, https://doi.org/10.5194/acp-22-5117-2022, 2022.
Paramonov, M., Drossaart van Dusseldorp, S., Gute, E., Abbatt, J. P. D., Heikkilä, P., Keskinen, J., Chen, X., Luoma, K., Heikkinen, L., Hao, L., Petäjä, T., and Kanji, Z. A.: Condensation/immersion mode ice-nucleating particles in a boreal environment, Atmos. Chem. Phys., 20, 6687–6706, https://doi.org/10.5194/acp-20-6687-2020, 2020.
Tobo, Y., Prenni, A. J., DeMott, P. J., Huffman, J. A., McCluskey, C. S., Tian, G., Pohlker, C., Poschl, U., and Kreidenweis, S. M.: Biological aerosol particles as a key determinant of ice nuclei populations in a forest ecosystem, Journal of Geophysical Research: Atmospheres, 118, 10,100–10,110, https://doi.org/10.1002/jgrd.50801, 2013.