The rapid and unprecedented onset of the coronavirus disease 2019 (COVID-19) has had devastating effects on our global society, with the ramifications towards public health, mortality rates, and economic impacts still unfolding (Wu et al. 2020). Fully understanding all of the risk factors that drive the spread of COVID-19 will likely take years of study, but an increasing number of epidemiological studies are being conducted in an attempt to estimate associations with different biological and societal factors at varying scales and based on limited data (Berg et al. 2021). The aim of this study is to explore potential spatial relationships between environmental, demographic, and economic factors with COVID-19 rates at the census tract-level within Santa Clara County, CA.
Santa Clara County is located within the San Francisco Bay Area within California. It contains the city of San Jose as well as serves as an informal boundary to the region known as Silicon Valley. This area is a high technology business sector known globally for its commitment to innovation as well as its concentration of corporations and personal wealth. As the COVID-19 pandemic progressed, the Santa Clara County Public Health Department tracked and reported COVID-19 cases and deaths, wastewater concentrations, as well as testing and vaccination information.
Data for this project include the cumulative number of COVID-19 cases reported for each of the 372 census tracts in Santa Clara County divided by the total population of each census tract in order to quantify a COVID-19 rate (Figure 1). Data were reported by the California Reportable Disease Information Exchange and published and updated by the Santa Clara County Public Health Department until January 24, 2023. Each record contains a unique Geographic Identifier in order to facilitate matching with census tract boundaries and other statistical data. This study also included data on particulate matter 2.5 concentrations as well as U.S. census estimates for demographic, housing, and other economic statistics (Table 1).
Figure 1: Cumulative COVID-19 cases from 2020 to 2023 per census tract population in Santa Clara County, CA.
Table 1: Metadata of all sources of data for this analysis.
The projected coordinate system used for this analysis is the NAD 1983 State Plane California III (U.S. feet). As this zone extends from west to east across California, this system uses a Lambert Conformal Conic projection. State Plane coordinate systems are often used by government agencies as each zone has a defined boundary and the relatively smaller areas result in less distortion.
Ten potential explanatory variables were chosen based on factors that were found to be important predictors of COVID-19 mortalities at county-levels (Knittel & Ozaltun 2020; Wu et al. 2020) as well as towards predicting COVID-19 infections at census tract-levels (Berg et al. 2021). These variables are summarized in Table 2 and include: particulate matter 2.5 concentrations (PM_Pred), median household income (MedHInc), percent of the civilian population with health insurance (HealthIns), percent of workers 16 years and older that commute to work by public transportation (PubTrans), percent of the population that is Black or African American (PctBlack), percent of the population 65 years and older (Pop65), population density (PopDens), percent of the population 25 years and older that are high school nongraduates (LessHSchool), percent of the population that speak English at home less than "very well" (LessEnglish), and the percent of owner-occupied housing units (OwnerOcc).
Ordinary least squares (OLS) is one linear regression model that is used to assess the strength of relationships between explanatory variables and the dependent variable. It is global in the sense that it assigns a single regression equation of coefficients to all of the input features to be analyzed. The aim of the OLS method is to minimize the sum of square differences between the observed values and the values that the model predicts. A series of statistical checks can be applied to evaluate the model performance, significance, and biases in order to systematically determine the most representative explanatory variables of those input to the model.
If there are a large number of potential explanatory variables that could be contributing to the model, such as with this study, then exploratory regression can also be utilized. Exploratory regression tests all possible combinations of explanatory variables to determine which models, if any, pass all of the OLS statistical checks. These diagnostics include assessing the coefficient of determination (R²), comparing models based on the corrected Akaike Information Criterion (AIC), measuring the amount of multicollinearity between variables based on the Variance Inflation Factor (VIF), assessing the normality of the residuals based on the Jarque-Bera p-values, determining model nonstationarity through the Koenker (Breusch-Pagan) p-values, and testing for spatial autocorrelation of the residuals based on the Global Moran's I. The inputs for the exploratory regression of the ten explanatory variables in predicting COVID-19 rates as well as the search criteria for the listed diagnostic checks are shown in Figure 2.
Both OLS and exploratory regression assume that variable relationships are static over space. Alternatively, geographically weighted regression (GWR) is a local linear regression method that allows for modelling spatially varying relationships. A regression equation is calculated for each feature from nearby features using different neighborhood types and local weighting schemes. If the explanatory variables are estimated for a future time, then a GWR model can be used to predict the dependent variable for that future time within the study area. The inputs for the GWR of the ten explanatory variables in predicting COVID-19 rates, including the neighborhood type and weighting scheme, are shown in Figure 3. The outputs of GWR also include diagnostic statistics that can be used to compare models, such as the adjusted R² and corrected AIC.
GWR operates under the assumption that variables can change over space, otherwise known as nonstationarity. From the results of the top three models found from the exploratory regression (Table 3), the Koenker (BP) p-values are all significant at the 0.05 level. This indicates that nonstationarity is present in the relationship between the ten explanatory variables and COVID-19 rates. This is in line with second-order nonstationarity effects, in that there can be interaction between events as COVID-19 is highly contagious between individuals. To better determine how first-order nonstationarity effects may be present is justification for performing GWR.
Assumptions
For the exploratory regression, all ten variables were thought to be potentially explanatory and were thus allowed to all be included at once. The minimum number of variables was set to two in an attempt to avoid misspecification. As there are relatively few regression studies of this type for the COVID-19 pandemic, there is little information on what an acceptable adjusted R² might be. The default value of 0.50 was chosen, which resulted in 78% of models passing for this criterion. The typical p-values and cutoff values were set for the coefficients, Jarque-Bera, and VIF values. None of the models passed when keeping the spatial autocorrelation p-value at 0.1 significance level. The regression was repeated at the 0.05 significance level, but this still resulted in none of the models passing.
For GWR, the model type was set to be continuous as there were a wide range of COVID-19 rates present. The number of neighbors were used for the calculations, as the irregular areas and thus densities of the census tracts seemed more important than any set distances between them. The Golden search method was used for neighborhood selection, as it determines the number of neighbors with the lowest corrected AIC as the neighborhood size. Finally, a Gaussian weighting scheme was used to increase the chance of variation in the many neighbors each feature will have as well as decrease the possibility of local collinearity.
Table 2: Description of the dependent and all explanatory variables used in this analysis.
Figure 2: Geoprocessing inputs for exploratory regression of the ten explanatory variables in predicting COVID rates (Population Density variable included but not visible).
Figure 3: Geoprocessing inputs for geographically weighted regression of the ten explanatory variables in predicting COVID rates (Population Density variable included but not visible).
None of the models created from the exploratory regression passed all of the statistical criteria, as a result of all of the models having statistically significant spatial autocorrelation of the residuals. This clustering of residuals is typically indicative of model misspecification, in that there are key explanatory variables that have not been included in the analysis. Although they did not pass all of the diagnostic checks, the top three models are described in Table 3. All three models have the highest and nearly identical R² values, which suggests that each model accounts for 74% of the variation in COVID-19 rates when adjusted for the number of variables included. These three models also had the lowest corrected AIC values of the models tested. However, as they are all within a value of three from one another, the lowest value is not necessarily indicative of the best model. In this case, Model 1 is the best of the tested models as it contained a fewer number of explanatory variables.
Model 1 accounts for 74% of the variation in COVID-19 rates when adjusted for the eight explanatory variables included. These variables are PM 2.5, median household income, public transportation, population >65, percent black, <high school graduate, owner occupied housing, and population density. Model 1 also has the lowest corrected AIC value of -1466.65. The Jarque-Bera p-value is statistically significant, which indicates that the residuals are not normally distributed and that the model is therefore biased. The Koenker (BP) p-value of 0.03 is also significant, in that nonstationarity is likely present. The VIF value of 3.16 indicates that redundancy is not present among the eight explanatory variables. The Global Moran's I is significant, as spatial autocorrelation is present in the residuals.
The results of the GWR analysis are given in Table 4. For the 372 census tracts in Santa Clara County, a neighborhood size of 32 was found to result in the lowest corrected AIC values from the Golden search method. The R² value indicates that 85% of the COVID-19 rate variance is accounted for by the GWR model. However, normalizing this value by the degrees of freedom instead gives an adjusted value of 0.81 which can be a preferable indicator of model performance. Smaller values of the sigma-squared and sigma-squared maximum likelihood estimation are preferred, as these are estimates of the variance of the residuals. The larger adjusted R² value as well as the lowest corrected AIC value of -1545.49 overall indicate that the results of the GWR are a better model than Model 1 from the exploratory regression analysis for predicting COVID-19 rates.
Outputs of the GWR analysis also include a scatterplot matrix (Figure 4), which visualizes the correlation between COVID-19 rates and each explanatory variable as well as the correlation between each pair of explanatory variables. Trend lines are included, and strong correlations between any two pairs of explanatory variables indicate multicollinearity. The variable with the highest R² value, percent of high school nongraduates, is highlighted. The GWR results can also be symbolized based on the coefficients calculated for each of the explanatory variables. Figure 5 illustrates the relationship between high school nongraduates and COVID-19 rates, where darker purple census tracts indicate a stronger relationship. This relationship appears to be the weakest in downtown San Jose as well as the urban core along the San Francisco Bay. The highest relationships occur in the suburban areas to the west as well as the less densely populated areas to the east. The variable with the second highest R² value when plotted against COVID-19 rates was the median household income. Figure 6 shows this relationship, where darker green census tracts represent a stronger relationship. The lowest relationship appears to be in the area surrounding downtown San Jose, while the strongest relationship is in the highly affluent northwest from Cupertino to Stanford.
Table 3: Top three models and corresponding statistics from the exploratory regression results (*p<0.1; **p<0.05; ***p<0.01).
Table 4: Results and statistics from the geographically weighted regression.
Figure 4: Relationships between each variable and COVID rates from the geographically weighted regression, with the variable with the highest R² value (<High School Graduate) highlighted.
Figure 5: High school nongraduates coefficients from the GWR results, where darker purple colors represent a stronger relationship.
Figure 6: Median household income coefficients from the GWR results, where darker green colors represent a stronger relationship.
Significant spatial autocorrelation was present in every one of the models tested under exploratory regression. A phenomenon as complicated as localized disease propagation would likely be better explained by including a variety of additional factors. The study by Berg et al. (2021) started from 20 covariates at the census tract-level, while Wu et al. (2020) selected from an initial 42 potential covariates at the county-level. The sign of each relationship from the exploratory models, either positive or negative, are in line with what would be initially expected for almost all of the explanatory variables. As the percent of the population ages 65 and older increased, however, the COVID-19 rates actually decreased. As the PM 2.5 concentrations increased, the predicted COVID-19 rates also decreased.
The results of the geographically weighted regression indicate an overall better model of COVID-19 rates than any of the models from the exploratory regression analysis. By its nature it can better account for how demographic data in particular can vary widely across nearby census tract areas. As data is not available on the individual-level for representative populations, GWR could be more useful in visualizing preliminary relationships and aiding in public health response.
One limitation that could affect this analysis is the use of COVID-19 infection rates, which as described earlier could lead to second-order nonstationarity effects. The variables used in this analysis could be more explanatory towards mortality rates instead, such as in the results of the county-level studies by Knittel & Ozaltun (2020) and Wu et al. (2020). Additionally, using static census data from 2019 to predict cumulative cases throughout the years of the pandemic could result in unclear relationships. After viewing the scatterplot matrix in Figure 4, transformations of some of the variables could also have led to more highly correlated linear relationships. Based on prior studies, it seemed that PM 2.5 concentrations would be a more strongly related predictive factor. However, calculating a single average value over the four years of data from the daily predicted concentrations for each census tract likely reduced localized variation that could have resulted in a more predictive model.
This study has aimed to identify and potentially understand correlations with COVID-19 rates and does not intend to imply causation. Although the high spatial correlation between census tract demographics would likely require additional variables and regression techniques to improve model performance, spatial risk factors can be used to help predict COVID-19 rates within Santa Clara County in California.
Berg, K., Romer Present, P., & Richardson, K. (2021). Long-term air pollution and other risk factors associated with COVID-19 at the census tract level in Colorado. Environmental Pollution (Barking, Essex : 1987), 287, 117584. https://doi.org/10.1016/j.envpol.2021.117584
Knittel, C. R., & Ozaltun, B. (2020). What does and does not correlate with COVID-19 death rates (Working Paper 27391). National Bureau of Economic Research. https://doi.org/10.3386/w27391
Wu, X., Nethery, R. C., Sabath, M. B., Braun, D., & Dominici, F. (2020). Air pollution and COVID-19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Science Advances, 6(45), eabd4049. https://doi.org/10.1126/sciadv.abd4049