Adventures in the Land of py-stat: August 2012

Wednesday, August 22, 2012

Final Post

Now that GSOC is over, I'd like to provide a wrap up what I accomplished this summer.

In general, I have 4 main features that are tested and have results that match the Matlab/R counterparts. They are:

Inference on descriptive statistics
Inference on Linear Regression
Inference and Estimation of the accelerated failure time model
Inference and estimation of regression with instrumental variables

Although that these feature compute correctly, only the descriptive branch is ready to be merged into master as I am still generating documentation and working with the developers to determine exactly how to incorporate the new features.

Here is a bit more of the specifics:

Descriptive

This allows us to compute confidence intervals of descriptive statistics via EL
For example:

import statsmodels.api as sm
import numpy as np

data = np.random.standard_normal(20)
EL = sm.emplike.DescStat(data)
EL.ci_mean()
EL.ci_var()
EL.ci_kurt()
EL.ci_skew()

The above will return the empirical likelihood confidence regions for mean, variance, skewness and kurtosis of 'data.'

Regression

data = sm.datasets.stackloss.load()
model = sm.OLS(data.endog, sm.add_constant(data.exog, prepend=1)
fitted = model.fit()
fitted.ci_params((!))

This returns the confidence interval via EL for the first regression parameter. This is useful because it accounts for heteroskedasticity. Furthermore, we can add an option stochastic_exog=0 to change the regression assumption that the X matrix is stochastic to fixed. The resulting confidence interval will be smaller.

AFT model

data = sm.datasets.heart.laod() # heart not yet in master, I added it.
aftmod = sm.emplike.emplikeAFT(data.endog, data.exog, data.censors)
aftres = aftmod.fit()
aftres.params

This will give the parameters of an accelerated failure time model. I also implemented hypothesis tests and confidence intervals but the confidence intervals are very computationally intensive as it would involve a quadruply nested optimization, with on of the optimizations being a modified EM algorithm. This model was by far the most difficult model of my entire GSOC and took over 3 weeks to implement.

IVmodel

The basic framework for this is

model = sm.emplike.IVReg(endog, exog, instruments)
fittedIV = model.fit()
fitted.params

This will return the params of the model by solving for an optimal probability distribution such that Z(y-XB)=0 where Z is a matrix of instruments, y is the endogenous variable, X is a matrix of covariates and B are the model's parameters.

Like I said, only the descriptive branch is up to snuff but I plan on continuing work with EL in statsmodels and bringing the other branches up to snuff as well as working with the maintainers to incorporate the new features optimally.

As I mentioned in my last post, if any users have questions on how to conduct EL in statsmodels or would like a specific new feature implemented, please email me or ping the statsmodels mailing list and I will do my best to accommodate.

Special thanks again to everybody that helped this summer. Skipper and Josef were more helpful than I could have anticipated.

To see some of the code, check out pull-request #326 which contains a lot of the meat of the project, including the setup for optimizing an empirical likelihood ratio. Pull requests 427 and 406 are also EL code but are continually beaing edited to be ready to merge.

Sunday, August 19, 2012

Winding Down

GSOC comes to a close tonight at 12:00 midnight and I could not have asked for a more rewarding summer job.

As far as my proposal, I have implemented EL descriptive statistics, linear regression, censored regression and instrumental variable regression. Although they are not all ready to be merged into the statsmodels main, the computation is there and all that is left is the documentation and cleaning, something that I will be continually working on.

I plan on staying active in the statsmodels community and hope to continue contributing empirical likelihood models as well as other models in the indefinite future.

Also, if anybody would like to see other specific empirical likelihood models implemented in statsmodels, please post to the mailinglist or email me directly at justin.grana@gmail.com.

A special thanks to Skipper and Josef as well as Ralf for their help this summer. Your advice was invaluable.

Wednesday, August 8, 2012

Slight change of plans

Just to make it public, I have decided not to implement model selection criteria but instead to implement instrumental variable regression. More to come soon.

Tuesday, August 7, 2012

Estimating an AFT model

Estimation of the parametric Accelerated Failure Time (AFT) model is no walk in the park. Much of the research points to two main methods. The first is Koul et al (1981) in which he uses a synthetic data approach; and Stute (1993, 1996) in which the author uses WLS regression. These are often preferred to other estimators due to their computational simplicity.

HOWEVER, in all cases the method of Koul performs worse than that of Stute. This is because an AFT model requires an estimate CDF for survival probabilities. In Koul's theoretical paper, the authors do not use the Kaplan-Meier estimator because they note ,"in our proof we need to take the log), and using a KM estimator can potentially lead to taking the log of 0. However, if we replace Koul's estimate of the survival probabilities with the Kaplan Meier estimate, the bias in the estimates essentially goes away. Furthermore, by assuming that max(dependent_variable) is uncensored, we can avoid the divide by 0 dilemma. To add to the ambiguity, other papers that cite Koul claim that Koul uses the Kaplan-Meier estimate when he actually doesn't.

My issue is if I cannot find a theoretical justification for using the KM estimate in Koul's mehtod, I cannot use it to code. That said, is there any theoretical adjustment of adapting Koul's synthetic data method and replacing the survival probabilities with the KM estimate? I have looked but haven't been able to find any.