Appendix B Sampling
Robust and reliable estimation of carbon in forest systems based on sampling must consider the following principles:
Identifying population units
The population is the total number of items, or units under consideration. Population units being sampled can range from plots to trees to points. Whatever type is chosen, the population units must be clearly identifiable, and any exclusions and their treatment noted. When sampling to calibrate an allometric model for example, the logical unit is a tree, but care is needed to deal with different parts – e.g. for the roots what is the practical minimum diameter to be considered? Plots for measuring forest stand characteristics can vary in size with examples ranging from 0.01 ha to over 1 ha, and can also include clusters of sub-plots (related to each other through their spatial placements) or designs where size-based sub-populations are only measured on parts of a plot. Plot shape can be related to remotely sensed data attributes (e.g. pixel size of optical sensors) and are usually rectangular, square or circular. Optimum size and shape of plots will vary with forest conditions, with small area plots more typical in relatively homogeneous populations while larger plots are required in tropical forests where large trees result in high spatial variation in biomass. The combination of field and RS data may require larger plots, to achieve correspondence between ground conditions and the minimum mapping unit.
Selecting which individuals in the population to sample
Individuals are selected for either of two general sampling approaches – probability-based or model-based.
Probability-based approaches rely on the ability to assign a probability of selection to each individual in the population. With such probability samples, sample-based estimates of parameters such as the mean or total can be inferred to represent the entire population. For example, simple random sampling, the most basic of these designs, assigns an equal probability to each individual. More efficient design-based approaches may be employed when some structure in the population can be reliably identified. For example, stratified sampling uses strata of relatively homogenous sub-populations to improve the efficiency of a given sampling effort. Design-based (or probability-based) inference requires probability samples, whereas model-based inference can use, but does not require, probability samples.
Model-based sampling can be used to select individuals to help parameterize a model. For this purpose, individuals do not need to be selected using a probability-based design, but rather are often selected to cover the range over which the model will be applied. Individuals may be selected to cover critical locations in the model domain, e.g. at the extremes, inflection points or where straight line relationships are anticipated. The way the individuals for measurement are identified and located should be transparent. Once the model has been constructed, it can be used with model-based inference to infer estimates of population parameters.
These two approaches are not mutually exclusive, e.g. model-based approaches have been used within design-based approaches like stratified random sampling (Wood & Schreuder, 1986). Box 31 provides more detail on design-based and model-based sampling.
Box 31: Design and model -based sampling
Design-based sampling, also known as probability-based sampling, is a widely-known sampling system. In this system, sample locations are selected by a pre-determined random (probability based) process. The most frequent examples are simple random sampling, systematic sampling with a randomly selected starting point, and stratified random sampling, but cluster, double and sequential sampling approaches are also common. Every possible location must have a probability greater than zero of selection into the sample with the randomization process determining the particular sample locations. The probabilities are the sole basis for drawing conclusions or "inferences" - usually formulated as probability statements - from the sample about the population size (total, mean), proportion of the population with given characteristics (such as disturbance or occurrence of a rare species), or variance. This means that, if a sample is selected correctly according to the chosen random design, any inference based on these probabilities is valid and calculations do not rely on any assumption about the spatial distribution or other pattern in the population. Apart from measurement and observation errors and the errors from using allometric models, sampling is the only source of stochasticity considered and the effects of this uncertainty can be readily calculated. NFIs are typical probability-based sampling systems with plots established on systemic grids (with or without stratification) where the probability of selection for each plot (within a stratum) is equal and known. Probability sampling designs do not preclude unequal probabilities of selection into the sample. Examples include sampling proportional to, size (as in point sampling or variable radius sampling) or proportional to a prediction (estimated volume or height as in 3P sampling – Probability Proportional to Prediction).
Model-based sampling systems hypothesise the existence of a model that relates predictor (X or independent) variables to the response (Y, or dependent) variables of interest. A sample is drawn to allow inferences about this model, and the distribution of data around the model predictions. Two types of inference are therefore made under model-based sampling, concerning: (i) the values at locations unvisited during sampling; and (ii) parameters of the model, including the confidence intervals of the parameterised model. Estimates of the mean Y in a model-based system would be based on the inferences about the model at the value of the mean X. For example, a model-based system that uses LIDAR as a predictor variable might rely on an assumption that biomass is linearly related to the mean height above the ground of the returns per unit area. A purposive sample of field locations could be drawn to parameterise this model and the mean biomass of the forest could be estimated from this parameterised model and the mean LIDAR return over the entire forest. Accuracy of these estimates would depend on the legitimacy of the assumed model and the actual sample locations (within the model space). Inferences at specific locations could also be made although these will be less precise than the population mean estimates. Model-based systems do not assume that the probabilities of any sample location (pair of X and Y variables) are determined by the design, but rather they are an outcome of the chosen random model – for any given X, the Y values are likely to be centred around the model mean. Where the variation in Y around the model prediction is less than the total variation in Y, model-based systems can provide increased precision of estimates.
Selecting the number of individuals to sample
To select a ground sample, the first step is to determine the sample size which is usually predetermined (sample size, n). Predetermined sample size approaches include those where: (i) the sample size is fixed by the available budget or need to have historical consistency; (ii) a systematic approach is adopted to sample selection (e.g. by use of spatial grid of pre-determined resolution); (iii) a predetermined estimate has been made of the number required to produce usefully precise estimates. Predetermined sample sizes to produce usefully precise estimates for the targeted population (or sub-population or stratum), or for parameter estimation in the case of model-based sampling, must be based on estimates of the variability of the (sub-) populations, which may be available from existing data or reconnaissance surveys. Useful estimates are often defined in terms of the precision desired which in many cases is taken to be 10% as a default at the 95% confidence interval. The estimated sample size required under simple random sampling of a population (or a stratum within a population) is:
where σ is the sample standard deviation expressed as a percentage of the mean when the sample is used alone to produce an estimate or σ is the standard deviation of the residual errors if the sample is used in combination with auxiliary data (e.g. remotely sensed data or existing maps) to produce an estimate, P is half the width of the interval, also expressed as a percentage of the mean and t is taken from the t distribution with degrees of freedom equal to n minus the number of parameters being estimated, at the confidence desired, commonly 0.05 corresponding to 95% confidence. Sample sizes to detect rare occurrences (e.g. disturbance in forests such as deforestation) may need to be relatively large under simple random sampling designs. For example, a sample of size of n > 300 is required if annual levels of forest disturbance were expected to be only about 1% of the population units , and sample units were selected via simple random sampling. Stratified sampling can increase efficiency significantly.
Supplementary sampling may be used where an NFI or other extensive plot-based measurement system with a predefined sample size is already in place but does not adequately cover the whole population, or results in a precision that is too small to be reliable for the proposed forest monitoring system. Given the need for random selection (ability to determine the individuals to be selected) in probability sampling, the selection of additional sample units will be difficult in some circumstances. Where a systematic approach to sampling was originally used (e.g. sample locations at the intersection of a regularly spaced grid that was randomly overlaid on the population), additional sampling points can be assigned as an extension of that grid into areas originally excluded. Such an extension is particularly relevant when individuals in the original sample had been excluded due to tenure (e.g. by not including land managed by an Agricultural or Conservation Department even though it included forest by the national definition). The extended areas should maintain a separate identity if a stratified approach is used, but the systematic grid may be manipulated (e.g. only select every 2nd intersection) to ensure the sample size within the new stratum is appropriate (the number of samples per ha does not need to be constant between strata). Alternatively, if the stratum boundaries have not altered since the original sample but it has been determined that the precision of the stratum parameter estimates is insufficient, additional sample units can be selected using the original sampling approach (e.g. truly random or, more commonly re-laying the same systematic grid but randomly choosing additional intersection points).
Where the original sample was not systematic and the population or strata boundaries have changed, it is difficult to add sample units under a design-based approach. One possibility could be to draw an entirely new probability sample, calculate estimates from each sample separately, and then combine the estimates. Otherwise a model-based approach may be more appropriate. The original sample data may be used to parameterise the hypothesized model, with additional sample units chosen to improve the precision of the inferences about that model. For example, the original sample may be used to parameterise a model that relates LIDAR data or canopy characteristics to plot measurements of carbon. Additional plots should be established in strata not included in the original sample to ensure the hypothesised model is appropriate for the extended population. Under a model-based system, the additional sample units need not use the original method of sample selection as inferences are not based on the selection design. Consequently if the inferences about the model are insufficiently precise (e.g. confidence limits of the model around the strata mean are too wide) then additional, ad hoc, sample points can be added provided they use the same plot measurement protocols of the original sample. Under a model-based approach using a linear model, additional sample units that add the most information tend to be those measured at the extremes of the independent value range (e.g. tallest forests as determined by LIDAR) although sampling covering the full range of dependent variables, irrespective of how the underlying population is clumped along this range, is useful to ensure the model is appropriate.
Using sample measurements to make inferences about the target population
The number of individuals selected for field measurement must be sufficient to make it likely that estimates of population means and sampling errors will be sufficiently accurate and precise to cover the variability within the population of interest).
Where population parameters are estimated from the sum of sub-samples or separate models or relationships, double counting of pools must be avoided. All errors must, as far as possible, be identified, and quantified. These include sampling errors, measurement errors, and model errors.
Effective application of sampling strategies and models often relies on stratification by climate (rainfall, temperature) or broad environmental conditions (altitude, topography, soil type), possibly integrated into bio-geo-climatic zones. Such data may also be used directly to develop growth indices (e.g. net primary productivity) or as input into growth models or for prediction of carbon allocation ratios. Networks of weather stations and historical records can be enhanced through spatial modelling approaches to develop climate surfaces for use as input into models or for more effective stratification.
Permanent plots, can be used to improve the accuracy of change estimation when repeatedly measured over time. However if these plots are treated in a way that is different from the rest of the forest (e.g. not harvested or thinned in the same way), or if the original population changes due to the removal of specific types of land without a corresponding removal of plots, the permanent plot sample will no longer be representative of the current forest. Remotely-sensed data, such as canopy cover or disturbance, may be used to determine whether the permanent plots have been treated in a non-representative fashion. If the permanent plots are no longer representative of the larger forest, then new plots may be required to represent more accurately the current condition. If a subset of the already established plots continues to be representative, these can continue to be used by regarding them as a stratum or strata.
Alternatively, permanent plots may be incorporated into an approach whereby models and remotely sensed auxiliary variables are used to increase precision. Sampling with partial replacement systems where a proportion of plots are replaced each measurement period has been used in the past as a compromise to estimating change and current condition, but have generally been found to be a complex compromise and difficult to maintain.