Jekyll2023-03-17T17:12:19+00:00syadlowsky.com/feed.xmlTreatments and ObservationsBlog for interesting results from statistics, causal inference, and machine learning.On cross-fitting with plug-in estimators2022-10-24T23:36:56+00:002022-10-24T23:36:56+00:00syadlowsky.com/blog/semiparametric/2022/10/24/on-cross-fitting-with-plug-in-estimators<style> .nm_conditions ol { type: "i" } </style> <h3> Table of Contents </h3> <ul id="markdown-toc"> <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul> <li><a href="#background" id="markdown-toc-background">Background</a></li> <li><a href="#nuisance-parameter-estimates-with-kernel-smoothing" id="markdown-toc-nuisance-parameter-estimates-with-kernel-smoothing">Nuisance Parameter Estimates with Kernel Smoothing</a></li> </ul> </li> <li><a href="#same-statistical-properties-with-cross-fitting" id="markdown-toc-same-statistical-properties-with-cross-fitting">Same statistical properties with cross-fitting</a> <ul> <li><a href="#cross-fit-and-not-master-theorems" id="markdown-toc-cross-fit-and-not-master-theorems">Cross-fit (and not) master theorems</a></li> <li><a href="#kernel-smoothing-and-cross-fitting-analogue-of-theorem-811" id="markdown-toc-kernel-smoothing-and-cross-fitting-analogue-of-theorem-811">Kernel smoothing and cross-fitting (analogue of Theorem 8.11)</a></li> </ul> </li> <li><a href="#references" id="markdown-toc-references">References</a></li> </ul> <h2 id="introduction">Introduction</h2> <p>There’s a strong correlation between cross-fitting and doubly robust methods in the recent literature on semiparametric statistics that can make it difficult for new researchers to easily understand the purpose of each of these. While extremely powerful when used together, they solve different “problems” that can come up with semiparametric two-stage methods.</p> <p>In this post, I will discuss the role of cross-fitting, how it affects the asymptotic distribution of an estimator, and why, in some cases, it isn’t necessary. Additionally, by comparing estimators with and without cross-fitting to sample-splitting estimators, we will more clearly understand the effect of nuisance parameter estimation on the asymptotic distribution of the estimator.</p> <h3 id="background">Background</h3> <p>Here, we restrict our discussion to a specific subset of two-stage methods, where we first estimate some unknown (nonparametric) nuisance parameter $$\gamma(\cdot)$$ with a function $$\hat{\gamma}(x)$$ and then plug it in to a known function $$g(\cdot)$$ that we average,</p> $\hat{\theta} = \frac{1}{n} \sum_{i=1}^n g(Z_i, \hat{\gamma}).$ <p>Many of the ideas will apply to other semiparametric estimators, but this simplifies a lot of the algebra.</p> <p>An evident issue with such an estimator is the re-use of data for fitting the nuisance parameter estimator $$\hat{\gamma}$$ and evaluating it in $$g(Z_i, \hat{\gamma})$$. For example, suppose we wanted to estimate the average treatment effect of the treatment $$W$$ on the outcome $$Y$$ with observed confounders $$X$$ using the augmented inverse propensity weighting (AIPW) estimator, where</p> $g_{\text{AIPW}}(Z_i, \gamma) = \mu_1(X_i) - \mu_0(X_i) + \frac{W_i - \pi(X_i)}{\pi(X_i)(1 - \pi(X_i))}(Y_i - \mu_{W_i}(X_i)).$ <p>The residuals $$Y_i - \mu_{W_i}(X_i)$$ will be <em>in-sample</em> residuals, instead of <em>out-of-sample</em> residuals; It is well-known that in-sample residuals are often systematically smaller than out-of-sample residuals. However, the properties of out-of-sample residuals are important to giving the AIPW estimator the doubly-robust property. One solution to this problem is to use a technique called <em>cross-fitting</em>, which is closely related to <em>cross-validation</em> in machine learning.</p> <p>By cross-fitting, we mean the following procedure:</p> <ol> <li>splitting the data into $$k$$ folds, and,</li> <li>iterating over folds $$j=1,\dots,k$$, fitting $$\hat{\gamma}^{j}(\cdot)$$ on the data from all <em>but</em> the $$j$$-th fold, and</li> <li>finally, computing the estimate as the average</li> </ol> $\hat{\theta}_{\text{cross-fit}} = \frac{1}{n} \sum_{j=1}^k \sum_{i~\text{in fold}~j} g(Z_i, \hat{\gamma}^{j}).$ <p>Cross-fitting has gained significant popularity in the semiparametric literature recently, due to it’s importance for developing clean black-box master theorems for the use of machine learning in semiparametric estimation methods. While cross-fitting intuitively removes own-observation biases as in AIPW example above, it isn’t always clear in these master theorems what other roles the cross-fitting plays in defining the statistical properties of the resulting method.</p> <h3 id="nuisance-parameter-estimates-with-kernel-smoothing">Nuisance Parameter Estimates with Kernel Smoothing</h3> <p>An interesting point of reference is a classical semiparametric method and theoretical analysis from the Handbook of Econometrics, Vol IV, Chp 36, by Newey and McFadden. In Section 8, they discuss a plug-in method as suggested above using kernel smoothing estimators to fit the nuisance parameters nonparametrically, using no cross-fitting. Kernel smoothing is a good nonparametric estimation technique when the function to be estimated has a similar level of smoothness all across the input space. Assuming this is true, and the function is sufficiently smooth (we will formalize this later), then cross-fitting is not necessary to get an asymptotic normality of the plug-in estimator.</p> <details><summary>Expand for details on kernel smoothing nuisance parameter estimation</summary> </details> <p><br /></p> <p>The formal assumptions about the nuisance parameters and the bandwidth of the estimator to ensure than the estimated nuisance parameters converge sufficiently quickly are below. With these, we can restate Newey and McFadden’s result for plug-in semiparametric estimators with kernel density nuisance parameter estimators. Here, we state a slightly more precise statement (asymptotic linearity) that is shown in the proof in Newey and McFadden, but not stated in the main theorem. The asymptotic linearity directly implies the asymptotic normality due to the central limit theorem.</p> <details><summary>Expand for assumptions for kernel smoothing (Assumptions 8.1-8.3)</summary> Placeholder. </details> <p><br /></p> <p><strong>Theorem 8.11 <em>(Newey and McFadden, 1994)</em></strong></p> <p><em>Suppose that Assumptions 8.1-8.3 are satisfied, $$E[g(Z, \gamma_0)] = 0,$$ $$E[ \|g(Z, \gamma_0)\|^2 ] &lt; \infty,$$ $$\mathcal{X}$$ is a compact set, the kernel bandwidth $$\sigma$$ satisfies $$n \sigma^{2r + 4d} / (\operatorname{ln} n)^2 \to \infty$$ and $$n \sigma^{2m} \to 0,$$ and there is a vector of functionals $$G(z, \gamma)$$ that is linear in $$\gamma$$ such that</em></p> <ol> <li><em>(linearization) for all $$\gamma$$ with $$\gamma - \gamma_0$$ small enough, $$\| g(z, \gamma) - g(z, \gamma_0) - G(z, \gamma - \gamma_0) \| \le b(z) \| \gamma - \gamma_0 \|^2$$, and $$E[b(Z)] \sqrt{n} \| \hat{\gamma} - \gamma_0 \|^2 \stackrel{p}{\to} 0$$;</em></li> <li><em>(bounded linear) $$\vert G(z, \gamma)\vert \le c(z) \| \gamma \|,$$ with $$E[ c^2(Z) ] &lt; \infty$$;</em></li> <li><em>(representation) there is $$v(x)$$ with $$\int G(z, \gamma) \text{d}P(z) = \int v(x) \gamma(x) \text{d}{x}$$ for all $$\| \gamma \| &lt; \infty$$;</em></li> <li><em>$$v(x)$$ is continuous almost everywhere, $$\int \|v(x)\| \text{d} x &lt; \infty$$, and there is $$\epsilon &gt; 0$$ such that $$E[\sup_{\| \nu \| \le \epsilon} \| v(x + \nu) \|^4 ].$$</em></li> </ol> <p><em>Then, for $$\delta(z) = v(x) y - E[v(X) Y],$$</em></p> $\frac{1}{\sqrt{n}} \sum_{i=1}^n g(Z_i, \hat{\gamma}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^n g(Z_i, \gamma_0) + \delta(Z_i) + o_P\left(1\right).$ <h2 id="same-statistical-properties-with-cross-fitting">Same statistical properties with cross-fitting</h2> <p>In the rest of this post, we will show that using cross-fitting versus not does not affect the asymptotic properties of the plug-in estimator with kernel-smoothing estimates of the nuisance parameters, by articulating a cross-fit alternative of this result.</p> <h3 id="cross-fit-and-not-master-theorems">Cross-fit (and not) master theorems</h3> <p>We start with a more general master theorem (Theorem 8.1 in N+M) that can be used with any nuisance parameter estimator. As before, we state the asymptotic linearity result implied in the proof, but not stated in the main theorem.</p> <p><strong>Theorem 8.1 <em>(Newey and McFadden, 1994)</em></strong></p> <p><em>Suppose that $$E[g(Z, \gamma_0)] = 0,~E[\|g(Z, \gamma_0)\|^2 ] &lt; \infty,$$ and there is $$\delta(z)$$ with $$E[\delta(Z)] = 0,$$ $$E[\|\delta(Z)\|^2] &lt; \infty$$ and</em></p> <ol> <li><em>(linearization) there is a function $$G(z, \gamma - \gamma_0)$$ that is linear is $$\gamma - \gamma_0$$ such that for all $$\gamma$$ with $$\gamma - \gamma_0$$ small enough, $$\| g(z, \gamma) - g(z, \gamma_0) - G(z, \gamma - \gamma_0) \| \le b(z) \| \gamma - \gamma_0 \|^2$$, and $$E[b(Z)] \sqrt{n} \| \hat{\gamma} - \gamma_0 \|^2 \stackrel{p}{\to} 0$$;</em></li> <li><em>(stochastic equicontinuity) $$\tfrac{1}{\sqrt{n}} \sum_{i=1}^n [G(Z_i, \hat{\gamma} - \gamma_0) - \int G(z, \hat{\gamma} - \gamma_0) \text{d}{P}(z)] \stackrel{p}{\to} 0$$;</em></li> <li><em>(mean-square differentiability) there is a measure $$\tilde{F}$$ such that for all $$\| \gamma - \gamma_0 \|$$ small enough, $$\int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) = \int \delta(z) \text{d}\tilde{F}(z)$$;</em></li> <li><em>the measure $$\tilde{F}$$ is approaches the empirical measure, in the sense that $$\sqrt{n}[ \int \delta(z) \text{d}\tilde{F}(z) - \tfrac{1}{n} \sum_{i=1}^n \delta(Z_i) ] \stackrel{p}{\to} 0.$$</em></li> </ol> <p><em>Then, $$\tfrac{1}{\sqrt{n}} \sum_{i=1}^n g(Z_i, \hat{\gamma}) = \tfrac{1}{\sqrt{n}} \sum_{i=1}^n g(Z_i, \gamma_0) + \delta(Z_i) + o_{P}(1).$$</em></p> <p>Essentially, this result provides very general conditions, not specific to the choice of nuisance parameter estimator, under which we can <em>linearize</em> the first order effect of plugging in the estimated nuisance parameter $$\hat{\gamma}$$ instead of the true nuisance parameter $$\gamma_0$$, as simply a sum of per-observation components $$\delta(Z_i).$$</p> <p>This result can easily be generalized to a cross-fitting estimator, and in the process we will simplify the <em>stochastic equicontinuity</em> condition into a simple condition that the linearization $$G(z, \gamma)$$ is has $$L_2$$-bounded coefficients.</p> <p>The first step in doing so is articulating the result for a sample-splitting estimator. A sample plitting estimator is a simple form of cross-fitting where we split the data into two pieces, $$I_1$$ and $$I_2$$, each a constant fraction of the total sample size, $$\vert I_1 \vert = a \cdot n$$, for $$0 &lt; a &lt; 1$$, use $$I_2$$ to fit the nuisance parameters $$\hat{\gamma}$$, and use $$I_1$$ to construct the average, $$\hat{\theta}_{\text{sample-split}} = \tfrac{1}{an} \sum_{i \in I_1} g(Z_i, \hat{\gamma}).$$</p> <p><strong>Theorem 1</strong></p> <p><em>Suppose that $$E[g(Z, \gamma_0)] = 0,~E[\|g(Z, \gamma_0)\|^2 ] &lt; \infty,$$ and there is $$\delta(z)$$ with $$E[\delta(Z)] = 0,$$ $$E[\|\delta(Z)\|^2] &lt; \infty$$ and</em></p> <ol> <li><em>(linearization) there is a function $$G(z, \gamma - \gamma_0)$$ that is linear is $$\gamma - \gamma_0$$ such that for all $$\gamma$$ with $$\gamma - \gamma_0$$ small enough, $$\| g(z, \gamma) - g(z, \gamma_0) - G(z, \gamma - \gamma_0) \| \le b(z) \| \gamma - \gamma_0 \|^2$$, and $$E[b(Z)] \sqrt{n} \| \hat{\gamma} - \gamma_0 \|^2 \stackrel{p}{\to} 0$$;</em></li> <li><em>(bounded linear) $$\vert G(z, \gamma)\vert \le c(z) \| \gamma \|,$$ with $$E[ c^2(Z) ] &lt; \infty$$;</em></li> <li><em>(mean-square differentiability) there is a measure $$\tilde{F}$$ such that for all $$\| \gamma - \gamma_0 \|$$ small enough, $$\int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) = \int \delta(z) \text{d}\tilde{F}(z)$$;</em></li> <li><em>the measure $$\tilde{F}$$ is approaches the empirical measure over $$I_2$$, in the sense that $$\sqrt{n}[ \int \delta(z) \text{d}\tilde{F}(z) - \tfrac{1}{(1-a)n} \sum_{j \in I_2} \delta(Z_j) ] \stackrel{p}{\to} 0.$$</em></li> </ol> <p><em>Then, $$\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \hat{\gamma}) = \tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \gamma_0) + \tfrac{1}{(1-a) \sqrt{n}} \sum_{j \in I_2} \delta(Z_i) + o_{P}(1).$$</em></p> <details><summary>Expand to see proof of Theorem 1.</summary> <p><b><i>Proof.</i></b></p> <p> The proof here is very similar to the proof of Theorem 8.1 by Newey and McFadden (1994). However, we provide a bit more detail on the expansion here. The main idea is to take the terms in the sum and write them as an approximation that will be easier to analyze, and an approximation error term that we will show will be lower order. The first is to add and subtract a linearization of $$g(z, \gamma)$$ as $$G(z, \gamma)$$. \begin{aligned} \tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \hat{\gamma}) &amp;= \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \gamma_0)}_{\text{Term 1}} \\ &amp;~~~~~~~~~~~ + \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} (g(Z_i, \hat{\gamma}) - g(Z_i, \gamma_0)) - G(Z_i, \hat{\gamma} - \gamma_0)}_{\text{Term 2}} \\ &amp;~~~~~~~~~~~ + \tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma} - \gamma_0) \end{aligned} Then, we can look at the difference between this empirical average with the population average of the linearization, $$\int G(z, \hat\gamma - \gamma_0) \text{d}P(z),$$ \begin{aligned} \tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma} - \gamma_0) &amp;= \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma} - \gamma_0) - \int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z)}_{\text{Term 3}} \\ &amp;~~~~~~~~~~~ + \sqrt{n} \int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) \end{aligned} Finally, using the assumptions 3 and 4 of the theorem, we can linearize the integral as \begin{aligned} \sqrt{n} \int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) &amp;= \underbrace{\sqrt{n}\left(\int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) - \int \delta(z) \text{d}\tilde{F}(z)\right)}_{\text{Term 4}} \\ &amp;~~~~~~~~~~~ + \underbrace{\sqrt{n} \left[\int \delta(z) \text{d}\tilde{F}(z) - \tfrac{1}{(1-a)n} \sum_{i \in I_2} \delta(Z_i)\right]}_{\text{Term 5}} \\ &amp;~~~~~~~~~~~ + \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} \tfrac{1}{(1-a)n} \sum_{i \in I_2} \delta(Z_i)}_{\text{Term 6}} \end{aligned} Putting all of these pieces together, we have \begin{aligned} \tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \hat{\gamma}) &amp;= \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} g(Z_i, \gamma_0)}_{\text{Term 1}} \\ &amp;~~~~~~~~~~~ + \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} (g(Z_i, \hat{\gamma}) - g(Z_i, \gamma_0)) - G(Z_i, \hat{\gamma} - \gamma_0)}_{\text{Term 2}} \\ &amp;~~~~~~~~~~~ + \underbrace{\tfrac{1}{a \sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma} - \gamma_0) - \int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z)}_{\text{Term 3}} \\ &amp;~~~~~~~~~~~ + \underbrace{\sqrt{n}\left(\int G(z, \hat{\gamma} - \gamma_0) \text{d}P(z) - \int \delta(z) \text{d}\tilde{F}(z)\right)}_{\text{Term 4}} \\ &amp;~~~~~~~~~~~ + \underbrace{\sqrt{n} \left[\int \delta(z) \text{d}\tilde{F}(z) - \tfrac{1}{(1-a)n} \sum_{i \in I_2} \delta(Z_i)\right]}_{\text{Term 5}} \\ &amp;~~~~~~~~~~~ + \underbrace{\tfrac{1}{(1-a)\sqrt{n}} \sum_{i \in I_2} \delta(Z_i)}_{\text{Term 6}} \end{aligned} Terms 1 and 6 will remain as dominant terms, and the rest will be lower order. Terms 4 and 5 are lower order by assumptions 3 and 4 of the theorem, respectively. Therefore, to prove the result, it remains to show that terms 2 and 3 are lower order. </p> <p><b>Term 2</b></p> <p> Assumption 1 of the theorem and Chebyshev's inequality shows that this term converges to zero in probability--from the assumed convergence of $$\hat\gamma$$, for some $$n_0$$ large enough, we know that $$n \ge n_0$$ implies $$\hat\gamma - \gamma_0$$ is small enough such that the linearization holds. Therefore, for all $$n \ge n_0$$, \begin{aligned} \left\vert \tfrac{1}{a\sqrt{n}} \sum_{i \in I_1} (g(Z_i, \hat{\gamma}) - g(Z_i, \gamma_0) - G(Z_i, \hat{\gamma} - \gamma_0)) \right\vert &amp;\le \tfrac{1}{an} \sum_{i \in I_1} \sqrt{n} \vert g(Z_i, \hat{\gamma}) - g(Z_i, \gamma_0) - G(Z_i, \hat{\gamma} - \gamma_0) \vert \\ &amp;\le \tfrac{1}{an} \sum_{i \in I_1} \sqrt{n} b(Z_i)\| \hat\gamma - \gamma_0 \|^2. \end{aligned} The weak law of large numbers implies that $$\tfrac{1}{an} \sum_{i \in I_1} b(Z_i) \stackrel{p}{\to} E[b(Z)],$$ and by assumption, $$E[b(Z)] \sqrt{n} \| \hat\gamma - \gamma_0 \|^2 \stackrel{p}{\to} 0,$$ so by the continuous mapping theorem, $\tfrac{1}{an} \sum_{i \in I_1} \sqrt{n} b(Z_i)\| \hat\gamma - \gamma_0 \|^2 \stackrel{p}{\to} 0.$ </p> <p><b>Term 3</b></p> <p> This term is where sample splitting is crucial, and where our proof diverges from Newey and McFadden (1994). Given $$\hat{\gamma},$$ the observations $$\{ G(Z_i, \hat{\gamma}-\gamma_0) - \int G(z, \hat{\gamma}-\gamma_0) \}_{i \in I_1}$$ are independent and mean zero. Therefore, by Chebyshev's inequality, \begin{aligned} &amp;P\left( \left\vert \frac{1}{a\sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma}-\gamma_0) - \int G(z, \hat{\gamma}-\gamma_0) \right\vert &gt; \epsilon \middle\vert \hat\gamma\right) \\ &amp;~~~~~~~~\le \frac{\operatorname{var}\left( \frac{1}{a\sqrt{n}} \sum_{i \in I_1} G(Z_i, \hat{\gamma}-\gamma_0) - \int G(z, \hat{\gamma}-\gamma_0) \middle\vert \hat\gamma \right)}{\epsilon^2} \\ &amp;~~~~~~~~\le \frac{\operatorname{var}\left( G(Z_i, \hat{\gamma}-\gamma_0) - \int G(z, \hat{\gamma}-\gamma_0) \middle\vert \hat\gamma \right)}{a \epsilon^2} \\ &amp;~~~~~~~~\le \frac{\operatorname{E}\left[ G^2(Z_i, \hat{\gamma}-\gamma_0) \middle\vert \hat\gamma \right]}{a \epsilon^2}. \end{aligned} By assumption 2, $$\operatorname{E}\left[ G^2(Z_i, \hat{\gamma}-\gamma_0 \mid \hat{\gamma}\right] \le \operatorname{E}\left[ c^2(Z)\| \hat{\gamma} - \gamma_0 \|^2 \mid \hat{\gamma}\right] = \operatorname{E}[ c^2(Z) ] \|\hat\gamma - \gamma_0\|^2.$$ By assumption, $$\|\hat\gamma - \gamma_0\|^2 \stackrel{p}{\to} 0$$, so conditionally on $$\hat\gamma$$, this sequence converges. Chernozhukov et al. (2018), Lemma 6.1, shows that conditional convergence implies unconditional convergence, and so$$\text{Term 3} \stackrel{p}{\to} 0.$$ <span align="right" style="display:block; clear: both;"> $$\Box$$</span> </p> </details> <p><br /></p> <p>This representation more clearly isolates the component of the influence function that comes from estimating the nuisance parameters, in that we can see that the $$\delta(Z_j)$$ terms come from a sum over $$I_2,$$ where $$\hat{\gamma}$$ was estimated.</p> <p>From here, we can think of a cross-fit estimator as an average of sample-split estimators. We will use $$I_{1, j}$$ to denote the two pieces of each of the estimators, and in each one, $$a = 1/k.$$</p> \begin{aligned} \hat{\theta}_{\text{cross-fit}} &amp;= \frac{1}{k} \sum_{j=1}^k \frac{k}{n} \sum_{i \in I_{1, j}} g(Z_i, \hat{\gamma}^{j}) \\ &amp;= \frac{1}{k} \sum_{j=1}^k \left( \frac{k}{n} \sum_{i \in I_{1, j}} g(Z_i, \gamma_0) + \frac{k}{(k - 1) n} \sum_{i \in I_{2, j}} \delta(Z_i) + o_P\left(\tfrac{1}{\sqrt{n}}\right) \right) \\ &amp;= \frac{1}{n} \sum_{j=1}^k \sum_{i \in I_{1, j}} g(Z_i, \gamma_0) + \sum_{i \in I_{2, j}} \frac{1}{(k - 1)}\delta(Z_i) + o_P\left(\tfrac{1}{\sqrt{n}}\right) \end{aligned} <p>A given example $$i$$ appears in $$I_{1,j}$$ for just 1 value of $$j$$, and in $$I_{2,j}$$ for $$k-1$$ of them. Therefore, the sum $$\sum_{j=1}^k \sum_{i \in I_{2, j}} \frac{1}{k-1} \delta(Z_i) = \sum_{i=1}^n \delta(Z_i).$$ Altogether, this careful accounting gives us the following corollary:</p> <p><strong>Corollary 1</strong></p> <p><em>Under the assuptions of Theorem 1, the cross-fitting estimator has the following asymptotically linear representation:</em></p> $\hat{\theta}_{\text{cross-fit}} = \frac{1}{n} \sum_{i = 1}^n g(Z_i, \gamma_0) + \delta(Z_i) + o_P\left(\tfrac{1}{\sqrt{n}}\right).$ <h3 id="kernel-smoothing-and-cross-fitting-analogue-of-theorem-811">Kernel smoothing and cross-fitting (analogue of Theorem 8.11)</h3> <p>With this, we can give an analogue of Theorem 8.11 from Newey and McFadden (1994) for when the nuisance parameters are estimated using kernel smoothing. Interestingly, the assumption replacing the stochastic equicontinuity assumption in our Theorem 1 is exactly the same as used in Assuption 8.? from Newey and McFadden, so there are no additional assumptions or conditions in this theorem.</p> <p><strong>Theorem 2 <em>(cross-fitting analogue of Thm 8.11, Newey and McFadden, 1994)</em></strong></p> <p><em>Suppose that Assumptions 8.1-8.3 are satisfied, $$E[g(Z, \gamma_0)] = 0,$$ $$E[ \|g(Z, \gamma_0)\|^2 ] &lt; \infty,$$ $$\mathcal{X}$$ is a compact set, the kernel bandwidth $$\sigma$$ satisfies $$n \sigma^{2r + 4d} / (\operatorname{ln} n)^2 \to \infty$$ and $$n \sigma^{2m} \to 0,$$ and there is a vector of functionals $$G(z, \gamma)$$ that is linear in $$\gamma$$ such that</em></p> <ol> <li><em>(linearization) for all $$\gamma$$ with $$\gamma - \gamma_0$$ small enough, $$\| g(z, \gamma) - g(z, \gamma_0) - G(z, \gamma - \gamma_0) \| \le b(z) \| \gamma - \gamma_0 \|^2$$, and $$E[b(Z)] \sqrt{n} \| \hat{\gamma} - \gamma_0 \|^2 \stackrel{p}{\to} 0$$;</em></li> <li><em>(bounded linear) $$\vert G(z, \gamma)\vert \le c(z) \| \gamma \|,$$ with $$E[ c^2(Z) ] &lt; \infty$$;</em></li> <li><em>(representation) there is $$v(x)$$ with $$\int G(z, \gamma) \text{d}P(z) = \int v(x) \gamma(x) \text{d}{x}$$ for all $$\| \gamma \| &lt; \infty$$;</em></li> <li><em>$$v(x)$$ is continuous almost everywhere, $$\int \|v(x)\| \text{d} x &lt; \infty$$, and there is $$\epsilon &gt; 0$$ such that $$E[\sup_{\| \nu \| \le \epsilon} \| v(x + \nu) \|^4 ].$$</em></li> </ol> <p><em>Then, for $$\delta(z) = v(x) y - E[v(X) Y],$$</em></p> $\frac{1}{\sqrt{n}} \sum_{j=1}^k \sum_{i~\text{in fold}~j} g(Z_i, \hat{\gamma}^{j}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^n g(Z_i, \gamma_0) + \delta(Z_i) + o_P\left(1\right).$ <p>The proof is almost identical to the original proof of Theorem 8.11 in Newey and McFadden (1994), using Corollary 1 in place of their Theorem 8.1. Note that the second condition is the same between this result and that of Theorem 1 / Corollary 1, making the conditions of Theorem 2 identical to those assumed by Newey and McFadden. Therefore, we can see that cross-fitting is not necessary for kernel smoothing under the assumed conditions, and does not change the result. Therefore, any time a result in the literature cites Theorem 8.11 from N+M, a cross-fit version of the same estimator would work in the same place.</p> <h2 id="references">References</h2> <ol> <li>Newey, Whitney K., and Daniel McFadden. “Large sample estimation and hypothesis testing.” Handbook of Econometrics, Vol 4 (1994): 2111-2245.</li> <li>Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. “Double/debiased machine learning for treatment and structural parameters.” (2018): C1-C68.</li> </ol>Covariate Adjustment in Randomized Trials2020-11-01T23:36:56+00:002020-11-01T23:36:56+00:00syadlowsky.com/blog/semiparametric/2020/11/01/adjustment_se<style> .nm_conditions ol { type: "i" } </style> <h3> Table of Contents </h3> <ul id="markdown-toc"> <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li> <li><a href="#the-data-distribution-and-effect-estimators" id="markdown-toc-the-data-distribution-and-effect-estimators">The data distribution and effect estimators</a></li> <li><a href="#statistical-properties" id="markdown-toc-statistical-properties">Statistical properties</a></li> <li><a href="#conclusions" id="markdown-toc-conclusions">Conclusions</a></li> <li><a href="#references" id="markdown-toc-references">References</a></li> </ul> <h2 id="introduction">Introduction</h2> <p>$$\newcommand{\var}{ {\rm Var} } \newcommand{\cov}{ {\rm Cov} } \newcommand{\corr}{\mathop{\rm Corr} } \newcommand{\simiid}{\stackrel{\rm iid}{\sim} } \newcommand{\R}{\mathbb{R} } \newcommand{\E}{\mathbb{E} } \newcommand{\normal}{\mathcal{N} } \newcommand{\what}{\hat{#1}}$$ A frequent question that comes up is whether covariate adjustment will likely increase or reduce the estimation error of causal inferences made from a randomized trial. <a href="Freedman08">Freedman (2008)</a> gives an example of how applying regression with an invalid structural model can create bias and invalid standard errors. <a href="Lin13">Lin (2013)</a> shows that in the asymptotic large-sample regime (when the number of samples $$n$$ grows, $$n \to \infty$$), these issues can easily be fixed by using Huber-White robust standard errors in a model with treatment-covariate interactions, so that the regression adjusted estimator has equal, or lower, asymptotic variance.</p> <p>In this short note, we study the finite sample behavior in a simple model where both the randomized and regression adjusted estimators are justified. In this model, we give conditions under which the regression adjustment would lead to a higher variance estimator than the unadjusted regression, in terms of the dimension and strength of the effect of the covariates on the outcome. When the dimension of the covariates stays fixed as $$n \to \infty$$, the adjusted estimator will have no worse variance than the unadjusted estimator. However, for fixed sample sizes, or if $$\lim_{n \to \infty} d / n &gt; 0$$, then the adjusted estimator may have higher variance.</p> <h2 id="the-data-distribution-and-effect-estimators">The data distribution and effect estimators</h2> <p>Let $$Y_i$$ be the observed outcome in a randomized trial with treatment $$W \in \R$$ and observed pre-treatment covariates $$Z \in \R^d$$. Suppose that the following structural model holds:</p> $Y(w) = \tau w + \beta^\top Z + \epsilon,$ <p>where $$Y(w)$$ is the counterfactual outcome at treatment level $$w$$, $$\beta \in \R^d$$ is fixed (but unknown) and $$\epsilon \simiid \normal{}(0, 1)$$. Additionally, assume the standard consistency assumption, $$Y = Y(W)$$, and let the treatment be randomized, $$W \simiid \normal{}(0, 1)$$, and the covariates drawn from a population with $$X \simiid \normal{}(0, I_d)$$. We will use $$Z = (X^\top~W)^\top$$ and $$\gamma = (\beta^\top~\tau)^\top$$ to simplify notation.</p> <p>Given a sample of $$n$$ observations $$\{(Y_i, W_i, X_i)\}_{i=2}^n$$ with $$n \ge p$$, consider the following two consistent estimators of $$\tau$$. First,</p> $\what{\tau}_{\text{unadj}} = \frac{\tfrac{1}{n}\sum_{i=1}^n W_i Y_i}{\tfrac{1}{n}\sum_{i=1}^n W_i^2},$ <p>is the unadjusted OLS estimator of the treatment on the outcome. Second,</p> $\what{\tau}_{\text{adj}} = \left[ \begin{array}{cc} 0 &amp; 1 \end{array} \right] \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i Y_i$ <p>is the least squares estimator that adjusts for baseline covariates. Plugging in the model for $$Y_i = \tau W_i + \beta^\top Z_i + \epsilon_i$$ gives the expanded forms</p> $\what{\tau}_{\text{unadj}} = \tau + \frac{\tfrac{1}{n}\sum_{i=1}^n (\beta^\top X_i + \epsilon_i)W_i }{\tfrac{1}{n}\sum_{i=1}^n W_i^2},$ <p>and</p> \begin{aligned} \what{\tau}_{\text{adj}} &amp;= \left[ \begin{array}{cc} 0 &amp; 1 \end{array} \right] \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i (Z_i^\top \gamma + \epsilon_i) \\ &amp; = \tau + \left[ \begin{array}{cc} 0 &amp; 1 \end{array} \right] \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i. \end{aligned} <h2 id="statistical-properties">Statistical properties</h2> <p>Noting that $$\E[\epsilon_i \mid Z_i] = 0$$ shows that both of these estimators are unbiased. Next, we turn to their variance. Both we derive from the law of total variance, $$\var(A) = \var(\E[A \mid B]) + \E[\var(A \mid B)]$$.</p> <p>First, for the variance of $$\sqrt{n}(\what{\tau}_{\text{unadj}} - \tau)$$, we note that $$\var(\beta^\top X_i + \epsilon_i) = \|\beta\|_2^2 + 1$$, and apply the law of total variance conditionally on $$\{W_i\}_{i=1}^n$$, to get</p> \begin{aligned} \var(\sqrt{n}(\what{\tau}_{\text{unadj}} - \tau)) &amp;= n \E[\var(\what{\tau}_{\text{unadj}} - \tau \mid \{W_i\}_{i=1}^n)] + n \var( \E[ \what{\tau}_{\text{unadj}} - \tau \mid \{W_i\}_{i=1}^n]) \\ &amp;= \E\left[ \frac{\frac{1}{n} \sum_{i=1}^n ( \| \beta\|_{2}^2 + 1)W_i^2}{\left(\frac{1}{n} \sum_{i=1}^n W_i^2\right)^2} \right] \\ &amp;= \left( \| \beta\|_{2}^2 + 1 \right) \frac{n}{n - 2}. \end{aligned} <p>For the variance of $$\sqrt{n}(\what{\tau}_{\text{adj}} - \tau)$$, we apply a similar trick, conditioning on $$\{Z_i\}_{i=1}^n$$,</p> \begin{aligned} \var(\sqrt{n}(\what{\tau}_{\text{adj}} - \tau)) &amp;= \left[ \begin{array}{cc} 0 &amp; 1 \end{array} \right] \var\left(\sqrt{n}\left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i \right)\left[ \begin{array}{c} 0 \\ 1 \end{array} \right]. \end{aligned} <p>Noting that this is simply the bottom right element of $$\var(\sqrt{n}( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top )^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i )$$, we work out this variance to simplify notation.</p> \begin{aligned} \var\left(\sqrt{n}\left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i \right) &amp;= \E\left[ \var\left(\sqrt{n}\left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i~\middle|~\{Z_i\}_{i=1}^n \right) \right] \\ &amp;~~~~~~~~~~+ \var\left( \E\left(\sqrt{n}\left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \frac{1}{n} \sum_{i=1}^n Z_i \epsilon_i~\middle|~\{Z_i\}_{i=1}^n \right] \right) \\ &amp;= \E\left[ \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n Z_i I_{d+1} Z_i^\top \right) \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \right] \\ &amp;= \E\left[ \left( \frac{1}{n} \sum_{i=1}^n Z_i Z_i^\top \right)^{-1} \right] \\ &amp;= \frac{n}{n - d - 2} I_{d+1}, \end{aligned} <p>where the last equality follows from the fact that $$\left( \sum_{i=1}^n Z_i Z_i^\top \right)^{-1}$$ is a $$d+1$$-dimensional inverse Wishart matrix with $$n$$ degrees of freedom <a href="Haff79">Haff (1979)</a>. Therefore $$\var(\sqrt{n}(\what{\tau}_{\text{adj}} - \tau)) = (1 - (d+2)/n)^{-1}$$.</p> <h2 id="conclusions">Conclusions</h2> <p>Notice that the variance of the adjusted least squares is greater than the unadjusted estimator whenever</p> $\| \beta \|_2^2 \le \frac{d}{n - 2- d}.$ <p>If $$d$$ stays fixed as $$n \to \infty,$$ then $$\frac{d}{n - 2- d} \to 0$$, so the variance of the adjusted estimator will always have asymptotic variance less than or equal to the unadjusted estimator, consistent with the observations of <a href="Lin13">Lin (2013)</a>, and strictly less if $$\liminf_{n \to \infty} \|\beta\|_2 &gt; 0$$. However, in finite samples, it can have larger variance, especially if $$\|\beta\|_2$$ is small. The model with additional regression adjustment for interaction terms $$XW$$ would have $$d$$ more covariates, so that the variance would be $$(1 - (2d + 2)/n)^{-1}$$.</p> <h2 id="references">References</h2> <ol> <li>D. A. Freedman. On regression adjustments to experimental data. Advances in Applied Math- ematics, 40(2):180–193, 2008, <a href="Freedman08">Online here</a>.</li> <li>L. Haff. An identity for the wishart distribution with applications. Journal of Multivariate Analysis, 9(4):531–544, 1979, <a href="Haff79">Online here</a>.</li> <li>W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7(1):295–318, 2013, <a href="Lin13">Online here</a>.</li> </ol>Table of Contents