elocation-id: e3618
In sampling of pests with low densities, it is common to obtain a large number of zeros, which is difficult to manage since the Poisson and negative binomial probability distributions are not suitable for modeling and equations to estimate the optimal sample size are not available. In this study model the excess of zeros by estimating parameters through the methods of moments and maximum likelihood of the zero-inflated Poisson and zero-inflated negative binomial distributions, and to derive equations to calculate the optimal sample size. Systematic sampling was used to select 100 trees per grove of Río Red grapefruit (Citrus paradisi Macfad) at Finca Sayula, Veracruz, Mexico (latitude 19.20722, longitude -96.35194), from June to July 2021 and January 2022. The number of leafminers (Phyllocnistis citrella Stainton) and aphids (Toxoptera citricida Kirkaldy) present in three leaves per shoot per tree, considered as a sample unit, was counted. Simulations were performed in RStudio with different proportions of zero (0.1, 0.4, and 0.6) to compare the parameters obtained in the field using the methods of moments and maximum likelihood. Equations were derived to estimate the optimal sample size in studies of pests with low densities, based on the zero-inflated Poisson and zero-inflated negative binomial probability distributions. The method of moments yields optimal sample sizes smaller than those obtained by maximum likelihood, because they distinguish the origin from zero, so its use is recommended.
sampling, zero-inflated negative binomial, zero-inflated Poisson.
In the population dynamics of pest organisms, count data reflect the presence and abundance of species in a fixed period of time (Hashim et al., 2021). It is common for samples of pest populations to present values of zero in excess due to the complex interactions between biotic and abiotic components, to the inherent characteristics of pest species, to spatial-temporal dependencies, to unexplained environmental heterogeneity (Zou et al., 2021) and agroecological control techniques (Villanueva-Jimenez et al., 2017; García-González et al., 2018).
The study and monitoring of the periods in which pest organisms have excess zeros can be very useful since they allow carrying out preventive management of their populations and recognizing early stages of pest invasion for the application of preventive management methods, such as those offered by precision agriculture (Jankielsohn, 2017; Clay et al., 2018), as well as the use of combat tactics before pests cause damage to crops, which would prevent the abusive use of organic-synthetic pesticides, thus also reducing damage to the environment (Shannon et al., 2018; Talaviya et al., 2020).
The excess of zeros is a theoretical and practical problem that arises when the high frequency of zeros alters the probabilities expected by the discrete variable distributions of Poisson and negative binomial (Yesilova et al., 2010; Hashim et al., 2021; Haslett et al., 2022) and no attention has been paid to the mechanisms that explain the origin of zero despite its impact on the estimation of population parameters in species of pest organisms (Haslett et al., 2022).
For the study of pest populations in agroecosystems, it is proposed to analyze the excess of zeros from the proposals of (Mullahy, 1986; Lambert 1992); that is, recognize two possible origins of zero, distinguishing between structural zero (plants without susceptible shoots for the establishment of a pest) and non-structural zero (plants with susceptible shoots free of the pest and susceptible shoots plagued), model zero by its origin with binomial distributions (Lambert, 1992; Zou et al., 2021: Haslett et al., 2022) and depending on the observed value of counts greater than zero, study the effect of overdispersion (Hall, 2000; Cheung, 2002; Doyle, 2009).
In pest counts, the optimal sample size equations for the Poisson or negative binomial distribution are used on a recurring basis, but due to the excess of zeros, the estimated optimal sample sizes are so large as to be impractical (Southwood and Henderson, 2000); however, in integrated pest management, there are no equations that estimate the optimal sample size of zero-inflated distributions, nor proposals that consider the origin of zero.
Equations estimating the optimal sample size are proposed here (Karandinos, 1976), which are adjusted to zero-inflated distributions. The objectives of the present research were: model the excess of zeros, estimate the parameters using the methods of moments and maximum likelihood of the zero-inflated Poisson and zero-inflated negative binomial distributions, and derive equations to calculate the optimal sample size.
For the estimation of the optimal sample size, the excess of zeros was modeled; the parameters were determined by the methods of moments and maximum likelihood of the zero-inflated Poisson and zero-inflated negative binomial distributions and the equations for calculating the sample size were derived.
To model the excess of zeros, the following stages were performed: i) the absence of plant tissue that allows the pest to be housed was included as a cause of extra-zeros. In this way, there were two origins: the ‘structural zero’, when there is no susceptible tissue in the plant that can be occupied by the pest and the ‘non-structural’ zero, when there is adequate tissue in the plant, but it is not inhabited by a pest.
With this definition, the frequency of structural zero was modeled using a binomial distribution (Mullahy, 1986). Where: X is the number of structural zeros present in a sample size n, therefore:
. Where: is the proportion of structural zeros and is the proportion of susceptible plant tissue free from the presence of the pest (non-structural zero), plus the plant tissue inhabited by the target species (positive integer values).
Thus, the probability function of the random variable X or the number of structural zeros in the sample of size n is given by:
1). If is very large, it means that the hosts have little tissue susceptible to damage. ii) the probability of presence-absence of the pest was estimated as a conditioned variable of a binomial distribution. If Y is the number of non-structural zeros (susceptible tissue without the presence of pest) in a sample of size n, then:
2). Where: is the probability of occurrence of a non-structural zero; then, in a sample of size n, X = x is the number of structural zeros in the sample, Y = y is the number of non-structural zeros, while is the number of units of plant tissue with the presence of a pest; in this way, represents the proportion of the population of susceptible tissue, inhabited by the organism of interest; iii) to model the abundance of a pest that excludes structural zeros, Poisson count distribution was used when the mean is equal to the variance (equidispersion) and the negative binomial distribution when the variance is greater than the mean (Hilbe, 2011).
The Poisson distribution is used on a sample when Y is the number of insects in a sample unit that is not a structural zero, so it is possible to use:
3). Where: λ is the mean of the number of insects in the population, excluding structural zeros (ie., sample units without susceptible tissue are not considered).
With overdispersion, the negative binomial is used, where Y is the number of insects in a unit that is not a structural zero:
4). Where: λ is the mean of the number of insects in the population, excluding structural zeros; k is an overdispersion parameter and Γ(y) is the gamma mathematical function. In this way, estimates are not affected by excess zeros (structural zeros).
It can be noted that, under this scheme, the probability of a non-structural zero is given by:
if it is Poisson and if it is negative binomial. The probability of a structural zero in both cases is ; iv) to model the abundance of the pest considering the mixture of structural and non-structural zeros (the two origins of zero), there are two cases. If the mean and variance were equal (equidispersion), the population was modeled with the zero-inflated Poisson distribution (Lambert, 1992; Zou et al., 2021) as follows:
5). The mean of this distribution is ; in addition, the variance is . In the second case, when overdispersion was found, the zero-inflated negative binomial (ZINB) distribution was used (Fang et al., 2016). Where:
6). The mean of this distribution is >; in addition, the variance is
To obtain the parameters of the distributions i) zero-inflated Poisson; and ii) zero-inflated negative binomial, the methods of moments and maximum likelihood were used. a) For the zero-inflated Poisson distribution, the moment estimators for and , given respectively by (Banik and Kibria, 2009) are used:
7). With the estimator of moments of the mean, the sample mean, the sample variance and the estimator of moments of occurrence of structural zero.
The maximum likelihood estimators for and are obtained by maximizing the log-likelihood function given by:
8); b) for the zero-inflated negative binomial distribution, there are no moment estimators for , k and (Banik and Kibria, 2009; Hilbe, 2011). Since the excess zeros are structural (without susceptible tissue), with X= x structural zeros in a sample of size n and since , then the moment estimator of is given:
9). If structural zeros are excluded, the elements of the sample have a negative binomial distribution, with estimators of moments of k and λ given by:
10). Where: is the parameter of the mean estimated by the method of moments, the sample mean, the sample variance; and the estimator of moments of the dispersion parameter.
The maximum likelihood estimator for , k and are obtained by maximizing the log-likelihood function given by:
11). Based on the above, it is proposed to use the moment estimators of the negative binomial distribution (Banik and Kibria, 2009), but excluding structural zeros from the equation, as an approximation to the moments of the zero-inflated negative binomial distribution.
To derive the equations of optimal sample size, the parameters obtained from the models iii and iv were substituted in the equations of Karandinos (1976), related to the coefficient of variation (CV), the fixed proportion of the mean ( ) and half of a confidence interval (h) (Table 1). The values of CV, , and h are arbitrary, so the value used in each case depends on the precision defined in each research (Ramírez et al., 2013; Taherdoost, 2016). The coefficient of variation used was 25% (0.25), proposed by Southwood and Henderson (2000), a level suitable for ecological studies.
Six systematic samplings (n= 100) were carried out in three Río Red grapefruit (Citrus paradisi Macfad) groves at Finca Sayula, SPR de RL de CV, Veracruz, Mexico (latitude 19.20722, longitude -96.35194). Sampling data were direct counts in small units (three leaves per shoot per tree), conducted during the months of June and July 2021 and January 2022.
Three of the samplings were carried out to detect the presence of the citrus leafminer Phyllocnistis citrella Stainton and three more to detect the presence of the citrus tristeza virus vector aphid Toxoptera citricida Kirkaldy. In addition, three samplings were simulated with zero-inflated Poisson and three samplings with zero-inflated negative binomial; both with n= 100, randomly generated numbers. The simulations were performed with RStudio using the programs rbinom (100, size = 1, prob = 0.1, 0.4, 0.6), rpois (100-x, 1.5), rnbinom (100, 1.5) and zeroinfl (x∼1 | 1, dist = ‘poisson’, ‘negbin’) of the vgam and pscl libraries.
For the six field samplings, three of P. citrella (Table 2) and three of T. citricida (Table 3), and for the six simulations (Table 4), the simulated and observed proportion of structural zeros, the non-structural zeros, the overdispersion parameter k, the probability of structural zero and the optimal sample size were estimated using the coefficient of variation equations, proportion of mean and half confidence interval (Table 1).
[i] log-lik= log-likelihood; mom= moments; Prsz= proportion of structural zeros; Prnsz= proportion of non-structural zeros; k= overdispersion parameter; pe= estimated probability of structural zero; optimal sample size by CV= coefficient of variation; D = proportion of the mean; h= half the confidence interval amplitude.
[i] log-lik= log-likelihood; mom= moments; Prsz= proportion of structural zeros; Prnsz= proportion of non-structural zeros; k= overdispersion parameter; pe= estimated probability of structural zero; optimal sample size by: CV= coefficient of variation; D = proportion of the mean; h= half the confidence interval amplitude.
[i] ZIPS= zero-inflated Poisson simulations (1-3); ZINBS= zero-inflated negative binomial simulations (1-3); log-lik= log-likelihood; Prsz= proportion of structural zeros; k= overdispersion parameter; pe= estimated probability of structural zero; optimal sample size by: CV= coefficient of variation; D = proportion of the mean; h= half the confidence interval amplitude.
The equations proposed to estimate the optimal sample size of pests with excess zeros are detailed in the methodology (Table 1).
It was found that the optimal sample size calculated by the proportion of the mean ( = 0.5) is equivalent to the coefficient of variation (CV) proposed by Southwood and Henderson (2000). For the estimation of optimal sample size by half the confidence interval (h), no system that allowed equivalence with the coefficient of variation or proportion of the mean was found.
The optimal sample size of half the confidence interval (h) increased as the overdispersion parameter (k) increased, resulting in very large or difficult-to-estimate optimal sample sizes when pest populations have excess zeros (Tables 2, 3 and 4).
The estimation of the optimal sample size by log-likelihood of the parameter k of the samples of P. citrella (Table 2) indicated that the samples have zero-inflated Poisson distribution. The k estimated by the moment method of the zero-inflated negative binomial distribution, by excluding structural zeros, showed that non-structural zeros and positive integer values had overdispersion.
This result is consistent with that reported by Banik and Kibria (2009), who indicated that, by conditioning or eliminating the structural zeros of a population modeled with a zero-inflated Poisson distribution, it can also be modeled with a negative binomial distribution, provided that the data of the non-structural component present overdispersion.
The values of pe for the methods of moments and log-likelihood for zero-inflated Poisson were similar, therefore, both methods are efficient for the estimation of the parameters. The estimated sample sizes for P. citrella are smaller when estimated by moments than by log-likelihood, even when the number of structural zeros (Prsz) is greater; however, the difference between the two estimates is not very large (< 20 units).
The effect of overdispersion significantly affected the sample size estimated by h; for P. citrella, the results indicate that estimation by CV or by is preferable since, although the interval ranges from 47 to 70, the sample size is smaller than that obtained by Poisson and negative binomial, because the methods proposed here consider the number of structural and non-structural zeros.
In the samplings of T. citricida (Table 3), an insect with a high tendency to aggregation, the k values estimated by log-likelihood indicate populations with zero-inflated negative binomial distribution. The value of k by the method of the moments resulted in low values, which indicates that, when excluding the structural component, the few sample units found with pest presented low variation.
The result is interesting since populations with zero-inflated negative binomial distribution present random distribution at the farm level, but the few occupied trees had a high number of individuals, indicating aggregation, in accordance with the biology of the insect. The exclusion of structural zero, the frequency of non-structural zeros, and the reduction of variation in counts with positive integer values resulted in sample sizes very small for CV and estimated with the moment method.
The optimal sample size of the zero-inflated negative binomial distribution, calculated by moments, is smaller because it distinguishes the different origins of zero. By considering only the non-structural zeros and the positive integer values for the estimation of the sample size, a difference was established with the parameters estimated by log-likelihood that does not distinguish the origin of zero. Therefore, the method of moments for zero-inflated Poisson and zero-inflated negative binomial allows estimating optimal sample sizes similar to or smaller than those estimated by maximum likelihood.
In the simulations (Table 4), it was observed that, as the number of structural zeros increased, the sample size increased in both distributions since, as the sample size was only estimated by the log-likelihood method, when simulating, the origin of zero is not distinguished. In addition, the estimated value of the overdispersion parameter k is consistent with the values obtained in the field.
For zero-inflated Poisson, very small k values were obtained due to the proximity of the mean and variance values, while for the simulations of the zero-inflated negative binomial, the overdispersion parameter was greater than zero, indicating overdispersion, similar to that reported by Zou et al. (2021); Haslett et al. (2022).
The zero-inflated Poisson and zero-inflated negative binomial probability distributions allow modeling populations of pest organisms with low densities and excess zeros. The parameters obtained by the moment method distinguish the origin of zero and estimate optimal sample sizes equivalent to or less than those estimated by log-likelihood, which does not distinguish the origin of zero. A zero-inflated Poisson population can also be modeled with a negative binomial distribution, provided that the non-structural component is overdispersed.
The estimation of the optimal sample size in pest populations with excess zeros can be performed equivalently with the coefficient of variation (CV) equation and the mean proportion ( ) equation. On the other hand, the estimation of the optimal sample size with the equation of the half the confidence interval (h) depends on the value of the overdispersion parameter (k), since it does not have a fixed value that allows establishing an equivalence.
Banik, S. and Kibria, B. M. G. 2009. On some discrete distributions and their applications with real life data. USA. JMASM. 8(2):423-447. https://doi.org/10.22237/jmasm/1257034020 .
Cheung, Y. B. 2002. Zero inflated models for regression analysis of count data: a study of growth and development. USA. Statist. Med. 21(10):1461-1469. https://doi.org/10.1002/sim.1088.
Clay, S. A.; French, B. W. and Mathew, F. M. 2018. Pest measurement and management. In: precision agriculture basics. Shanon, D. K.; Clay, D.E. and Kitchen N. R. (eds.). Ed. ASA, CSSA, and SSSA Books. USA. 93-102 pp. https://doi.org/10.2134/precisionagbasics.2016.0090 .
Doyle, S. R. 2009. Examples of computing power for zero-inflated and over dispersed count data. USA. JMASM. 8(2):360-376. https://doi.org/10.22237/jmasm/1257033720 .
Fang, R.; Wagner, B. D.; Harris, J. K. and Fillon, S. A. 2016. Zero inflated negative binomial mixed models: and important application to two microbial organisms important in oesophagitis. UK. Epidemiol. Infect. 144(1):2447-2455. http://doi.org/10.1017/S0950268816000662.
García-González, J. C.; López-Collado, J.; García-García, C. G.; Villanueva-Jiménez, J. A. y Nava-Tablada, M. E. 2018. Factores bióticos, abióticos y agronómicos que afectan las poblaciones de adultos de mosca pinta (Hemiptera: Cercopidae) en cultivos de caña de azúcar en Veracruz, México. México. Acta Zool. Mex. 33(3):508-517. https://doi.org/10.21829/azm.2017.3331152.
Hall, D. B. 2000. Zero inflated Poisson and binomial regression with random effects: a case study. USA. Biometrics. 56(1):1030-1039. https://doi.org/10.1111/j.0006-341x.2000.01030.x.
Hashim, L. H.; Hashim, K. H. and Shiker, M. A. K. 2021. An application comparison of two Poisson models on zero count data. UK. journal of physics: conference series, 1818(012165):1-12. http://doi:10.1088/1742-6596/1818/1/012165.
Haslett, J.; Parnel, A. C.; Hinde, J. and de Andrade, M. R., 2022. Modelling excess of zeros in count data: a new perspective on modelling approaches. USA. International statistical review. 90(2):216-236. https://doi.org/10.1111/insr.12479.
Jankielsohn, A. 2017. The redesign of suitable agricultural crop ecosystems by increasing natural ecosystem services provided by insects. Hong Kong SAR China. Advances in ecological and environmental research. 1(1):365-381. http://www.ss-pub.org/wp-content/uploads/2017/09/AEER2017040501-1.pdf.
Karandinos, M. G. 1976. Optimum sample size and comments on one published formula. USA. Bull. Entomol. Soc. Amer. 22(4):417-421. https://doi.org/10.1093/besa/22.4.417 .
Lambert, D. 1992. Zero inflated Poisson regression, with an application to defects manufacturing. USA. Technometrics. 34(1):1-14. https://doi.org/10.2307/1269547.
Mullahy, J. 1986. Specification and testing of some modified count data models. Netherlands. J. Econ. 33(1):341-365. https://doi.org/10.1016/0304-4076(86)90002-3 .
Ramírez, I. C.; Barrera, C. J. y Correa, J. C. 2013. Efecto del tamaño de muestra y el número de réplicas bootstrap. Colombia. Inycompe. 15(1):93-101. https://www.redalyc.org/articulo.oa?id=291329165008.
Shannon, D. K.; Clay, D. E. and Sudduth, K. A. 2018. And introduction to precision agriculture. In: precision agriculture basics . Shanon, D. K.; Clay, D.E. and Kitchen N. R. (eds.). Ed. ASA, CSSA, and SSSA Books. USA. 1-12 pp. https://doi.org/10.2134/precisionagbasics.2016.0084.
Southwood, T. R. E. and Henderson, P. A. 2000. Ecological methods. Blackwell science. 3rd Ed. Oxford, UK. 7-66 pp. https://www.researchgate.net/publication/260051655-Ecological-Methods-3rd-edition.
Taherdoost, H. 2016. Sampling methods in research methodology, how to choose a sampling technique for research. Brazil. IJARM. 5(2):18-27. http://dx.doi.org/10.2139/ssrn.3205035 .
Talaviya, T.; Shah, D.; Patel, N.; Yagnik, H. and Shah, M. 2020. Implementation of artificial intelligence in agriculture for optimization of irrigation and application of pesticides and herbicides. China. Artificial Intelligence in Agric. 4(1):58-73. https://doi.org/10.1016/j.aiia.2020.04.002.
Villanueva-Jiménez, J. A.; Reyes-Pérez, N. y Abato-Zárate, M. 2017. Manejo integrado de plagas y sostenibilidad. In: agricultura sostenible como base para los agronegocios. Jarquín, G. R. y Huerta, P. A. (coords.). 1a Ed. Universidad Autónoma de San Luis Potosí. México. 32-42 pp. https://www.researchgate.net/publication/320779257-Manejo-Integrado-de-Plagas-y-Sostenibilidad .
Yesilova, A.; Kaydan, M. B. and Kaya, Y. 2010. Modeling insect-egg data with excess zero using zero-inflated regression models. Hacettepe J. Math. Stat. 39(2):273-282. http://www.hjms.hacettepe.edu.tr/uploads/c879f14e-8c0d-4f30-8bfa-e28658 a8fe0b.pdf.
Zou, Y.; Hanning, J. and Young, D. S. 2021. Generalized fiducial inference on the mean of zero inflated Poisson and Poisson hurdle models. Germany. J Statistical Distributions and Applications. 8(5):1-15. https://doi.org/10.1186/s40488-021-00117-0.