Wednesday, July 11, 2012

Midterm Review

To start things off, GSOC has been a 10/10 for me.  I enjoy the work I am doing and the downsides are very minimal.

Per my proposal, my GSOC project is right on target.  Some features I have so far are empirical likelihood hypothesis tests and confidence intervals for:

the mean
the variance
the skewness
the kurtosis
multivariate means
regression parameters
regression parameters in a model forced through the origin.
 * I also added some neat plots that I didn't expect to do.

Regression through the origin includes empirical likelihood estimates since this is the only case (so far) where the empirical likelihood estimates differ from the traditional estimates. Tomorrow (Thursday) I will be adding ANOVA.  I will also be adding an option to choose if the regressors are fixed or stochastic.  The only difference is that with fixed regressors, moment conditions on the independent variables are added.

I have not worked with Bartlett corrections yet because I cannot find a test case.  However, I am not ruling it out.

 In general, the problems I faced with my GSOC are not the ones I expected and the ones I expected were actually easier than I thought.  For example, I thought sifting through the theory to understand the correct computations would be difficult.  However, this was rather easy and it was easy to pick out the necessary building blocks from the textbooks.

Without question, the problem I didn't foresee that has given me the most trouble so far is working with GIT.  However, after creating a test repository of my own  and playing around for a couple hours, (completely destroying the test "project" at times) I finally got the hang of it.  If I have any recommendation to future GSOCers for statsmodels, I would recommend that they do this.  Like I said above, a couple hours of tinkering could save days of frustration.

What I can use help on
 Although I am catching on, I still feel there are many style elements that pass me by.  Maybe my documentation is not up to snuff.  Maybe the unit tests aren't exactly clear.  Anyway, I have been following the "no news is good news" motto, but I have  a feeling that there can't be all that much good news.  There is always room for improvement.

Statsmodels Team
100% helpful and timely in their responses.  Thanks to Skipper, Josef, Ralph and couple others that have been actively providing feedback and answers to my questions.

If I were to offer any suggestions in terms of feedback, it would be to make the recommendations as concrete as possible.  This is mostly because I am still getting the hang of the lingo.  I noticed it has become a bit easier to respond to comments as the summer has progressed.

Second Half Changes
I mentioned this to Skipper but I would like to make a couple changes in my plan.  I still want to start Monday with censored regression.  Then, instead of moving into model selection, I would like to model EL IV regression.  FInally, with my leftover time, I would like to create a general EL framework where the user enters an array of estimating equations and a parameter vector and a hypothesis and the program returns the results of the hypothesis test.  This is essentially the setup (but will not be close to the same code) as Bruce Hansen's MATLAB/Gauss package.  Then if I have time, I will do El model selection criteria.  Any comments on this idea would be greatly appreciated.

Some Code
I sent two files to the statsmodels mailing list that walk through what I have implemented so far.  Please pardon spelling, grammar and typos as I wanted to get these files out as soon as possible.  Both can be ran in their entirety but I suggest running line by line.  The code is full of in-line comments to guide the user through.  To run the code, make sure to install statsmodels from my emplike_reg branch.  I am assuming matplotlib is installed.  If you would like these files, feel free to email me at

I'd be happy to answer any questions on or off list and would be interested to hear of any suggested improvements . 

Other notes
I am using emacs as my editor.  Another benefit of GSOC is learning how to navigate this nifty and powerful programming.  Getting a taste of lisp isn't a bad side effect.

Almost all of my test cases are against MATLAB.  With censored regression I will most likely be using R. 

Sunday, July 8, 2012

Git holes

(title thanks to Skipper).

This week's post is a bit different but still relevant to GSOC.   Coming into GSOC I had no idea what GIt, Github or even what version control meant.  Now, I can still still say I don't know what they are, but I can say I have learned how (not) to use them.

I think the key to learning Git is creating a repository and trying out as many git commands as possible and seeing what they do first hand.  This was especially helpful for me because I was unfamiliar with the lingo and often I found the documentation very difficult to grasp for someone new to git.

From my minimal experience, anybody trying to learn git should create their own repo and make multiple branches, push commits, pull from remote repositories, cherry-pick, add and delete files and be familiar with the commands that make use of a remote repository and the ones that remain local.  2 Hours of tinkering for me provided more insight than days of reading documentation. 

Monday, July 2, 2012

Interesting problem

I've noticed an interesting problem with empirical likelihood that although Owen recognizes it is one of its shortfalls, I am facing it first hand.  One of the advantages of EL is that there are no distributional assumptions placed on the data or the parameters.  Therefore, EL is well suited for problems that have nonstandard distributions.  However, theoretically, EL inference can only be conducted if the hypothesized values lie in the convex hull of the data.  Without going into too much detail, the more skewed the data is, the more unlikely it is that a hypothesized value of the parameter would lie in the convex hull of the data.  When the hypothesized parameter value is not in the convex hull, optimization becomes infeasible.  This leads to an interesting conclusion: the strength of EL is also one of its major downfalls. 

On another note, next week is midterm evaluations and I think I will be looking to alter my second half plans.  I would like to include EL IV regression.  More to come on that...