*Change the directory to where the data is located in your computer*
use "C:\Users\Goosephie\Desktop\GradQuant\Panel\mus08psidextract.dta", clear
describe
summarize
sum
* If you want to see all the points in id, t, exp and wks use: list id t exp wks*
* tell Stata which variable is your id identifier and which one represents time.*
xtset id t
xtdescribe
xtsum
*Can see that about 2.5% of the individuals changed region during the panel time*
xttab south
*transition probabilities with xttrans*
xttrans south
* 99.7% of those in the non south stay in the non south for the next period. 99.2% of those in the south stay in the south the next period. Close to time invariance.*
*If you wat to look at a variable by year:
sort t
by t: sum exp
by t: sum exp if id==488
*We can sort by id too:
sort id
*Look at the data editor and see how it can be sorted either by t or id, depending on what you ask Stata

*Plot of first 20 individuals' logwages through time
quietly xtline lwage if id<=20, overlay legend(off) saving(lwage, replace)
quietly xtline wks if id<=20, overlay legend(off) saving(wks, replace)
graph combine lwage.gph wks.gph, iscale(1)
* legend(off) takes out the legend, since we have 20 individuals it would be cumbersome to have a legend. 
*The overlay command simply puts all time series in the same plot. The saving command stores the graphs that are later combined.
*iscale can change the scale of the axes labels.

*Plot of all data points of lwage against experience with linear and quadratic fit
graph twoway (scatter lwage exp) (lfit lwage exp) (qfit lwage exp)

*Pooled OLS regression
reg lwage exp exp2 wks ed

*Pooled OLS regression with cluster robust standard errors (since we suspect that the error term is correlated over time for a given individual).
*Notice how standard errors are larger. VCE(cluster id) command is correcting for within cluster correlation of errors with the regressor.
reg lwage exp exp2 wks ed, vce(cluster id)

*We already told Stata to organize the panel data with id being the individual identfier and t being the time variable. Lets run some fixed effects regressions
xtreg lwage exp exp2 wks ed, fe

* And use the vce(cluster id) option
xtreg lwage exp exp2 wks ed, fe vce(cluster id)

* Random effects
xtreg lwage exp exp2 wks ed, re vce(cluster id)

*Comparison
global xlist exp exp2 wks ed
quietly reg lwage $xlist, vce(cluster id)
estimates store Pooled_rob
quietly xtreg lwage $xlist, fe
estimates store FE
quietly xtreg lwage $xlist, fe vce(cluster id)
estimates store FE_rob
quietly xtreg lwage $xlist, re
estimates store RE
quietly xtreg lwage $xlist, re vce(cluster id)
estimates store RE_rob
estimates table Pooled_rob FE FE_rob RE RE_rob, b se stats(N r2 r2_0 r2_b r2_w sigma_u sigma_e rho) b(%7.4f)

*Hausman test. Since we already ran and sored the FE and RE estimates all we have to type is:
hausman FE RE, sigmamore

*Breusch-Pagan-LM test looks if there are any differences between individuals and if pooled OLS is preferred to RE
xtreg lwage exp exp2 wks ed, re
xttest0
*Null of zero variance accross individuals is strongly rejected, use RE instead of pooling

*Testing for contemporaneous correlation. Null is that there is no contemporaneous correlation
xtreg lwage exp exp2 wks ed, fe
xttest2
*If you dont have xttest type ssc install xttest2

*Test for heteroskedasticity in FE model:
xtreg lwage exp exp2 wks ed, fe
xttest3
*Null is homosckedasticity, which is strongly rejected. vce(robust) corrects for this, gives same results as vce(cluster id)
