Homework I: Stata Guide

Size: px

Start display at page:

Download "Homework I: Stata Guide"

Herbert Cain
6 years ago
Views:

1 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom 1 Homework I: Stata Guide This will serve as a guide for you to learn Stata. A program used to process data for statistical inference. These instructions will aid you in completing your first homework assignment. If anything, really anything, is unclear four of your best resources will be: I. Office Hours (found on TED). II. Use the help command in Stata; help x or google help x stata. Replace x with the command that you are unsure of. III. statalab.ucsd@gmail.com IV. a great self help guide from UCLA. Commands will be in bold (type the phrase in bold then hit enter). describe will show the variables contained in the dataset. Stata is extremely case sensitive. If you enter a command and the variable cannot be found; it is possible that you entered happins, not Happins. Clicking will be in italics. Title will be what you click, -> indicates what you click next. E.g, File-> open-> documents-> school-> Stata Homework -> dataset.dta. I. Logistics A) If you are using your own computer then this first step may be redundant. Once you open Stata, clear will remove all previous variables in the program. This will ensure that the only variables in Stata are related to the homework assignment. B) set more off will make the analysis run faster. However, if you have a fast computer this may not be necessary. C) set mem 15 (only if you run Stata 11 or earlier, which is unlikely if you use it through VCL or a UCSD computer). D) cap log close this will close the existing log file. A log file is what records what is done in Stata. E) Choose Working Directory: File -> Change Working Directory -> select a folder F) Create a log file in which the results of the programming will be saved. E.g: File -> log -> begin -> selected the folder where you want to save it -> pick a name -> Save it

2 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom 2 G) Open the dataset (dta file). File -> Open -> Find and select the file country_happiness.dta H) Save your data as a new file. This will make sure that you do not tamper with the original file. File -> Save As -> selected the folder where you want to save it -> pick a name -> Save IA. Analysis: Happiness A) describe allows you to see what variables are contained in the dataset. The dataset contains information about socioeconomic, and happiness scores for 75 countries. describe happins gdp2002 (the two variables that we are interested in for this homework assignment). B) summarize will give you summary statistics on the variables that you enter. It will give you: number of observations, mean, Std. Dev., min/max values. summarize happins summarize gdp2002 C) sort will re-arrange the variable in ascending order. This will allow us to see which countries are the happiest/saddest sort happins browse will show you the data in cell-format (like excel). Enter the command to see for yourself that the variables are re-arranged D) We want to find the least/most happy country in the dataset. In order to do so, we will use list. * _N is the total number of observations. * _n is the observation/ row number. E.g, _n==5 is the fifth unhappiest country in the dataset. i) To find the unhappiest country: list country_name happins if _n==1 ii) To find the happiest country: list country_name happins if _n==_n * List can be used to find the happiness index of particular countries. We want to see how happy people are in USA and Italy: iii) list cty happins if country_name == United States iv) list cty happins of country_name == Italy

3 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom 3 E) We can use count to see how many countries that are happier than a specific country. Let s see how many countries that are happier than the U.S. by using the happiness index for the U.S. It is also possible to see which those countries are, and values in between two countries: Portugal and USA. i) count if happins > ii) list country_name if happins > iii) a. list cty happins if country_name == Portugal b. List if happins > & happins < IB. Analysis: Religion A. Religion is a string variable, non-numerical. Summarize won t work for this. Instead we will use the tabulate command. It gives us frequencies, percentage, and cumulative distribution for each religion type. * describe religion, see for yourself * tabulate religion B. We can look at different countries to see what religion has a majority in a particular country. For example let s see in which countries Shiites are in majority. It is also possible to see which countries that don t practice certain religions. In order to do so we use!=, does not equal command. Don t forget quotation marks for string variables! i) list country_name happins religion if religion == Shia Islam ii) list country_name happins if religion!= Catholic Heavily C. Once again we want to look at the summary statistics for happiness scores and GDP/capita. * summarize happins gdp2002 D. As you saw previously, it is extremely easy to find the standard deviation, mean, etc. Let s test your understanding of statistics by finding Std. Dev. manually in Stata. This will be done in a few steps. i) We need to create a variable for the deviation. We will subtract the mean from each observation of happins. generate happins_deviation = happins (we got the mean by using summarize happins).

4 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom 4 ii) The deviation must be squared. generate happins_deviation_sq = happins_deviation^2 iii) Now it is time to add up all of the squared deviations. tabstat allows us to produce a table of statistics. tabstat happins_deviation_sq, statistics (sum) iv) In order to do calculations in Stata we use display. display 1+1, display 5*5, Display 1-1, Display 5/0 (j/k you can t divide by 0). In order to get the sample variance we will divide the squared deviation by N-1. display /74. Alternatively, display /(_N-1). v) In order to get the Standard Deviation we need to take the square root of the sample variance. display sqrt( ). E. Now try to calculate the standard deviation for the GDP variable. i) generate gdp_deviation_sq = (gdp )^2 ii) tabstat gdp_deviation_sq, statistics (sum) columns(variables) iii) display sqrt(1.05e+10/74) iv) The value won t be exactly the same as the one shown by using summarize. This is due to rounding. F. It is possible to plot the distribution using Stata graphical tools. We are to plot a normal distribution that has the same mean and standard deviation as happins. histogram happins, frequency normal G. Let s plot a histogram for GDP as well. histogram gdp2002, normal H. We can look at the correlation between GDP and Happiness in two ways. i) corr happins gdp2002, which gives us the correlation between happiness and GDP. ii) scatter happins gdp2002, which will graph a scatter plot of their relationship. iii)save the graph. In the graph window File -> Save As -> Save as type: Portable document format (*.pdf) -> select the folder where you want to save it -> pick a name -> Save File

5 Econ 120B 5 Stata Guide Hw1 Claudio Labanca Love Lofstrom I. Let s figure out what country is the one with a GDP/capita closets to $60,000. This could be hard doing by eye. Fortunately, we can add labels to the scatter plot. i) scatter happins gdp2002, mlab(country_name) mlabsize(small). Luxembourg should be that country. Notice the two axis, which are dataset labels for our two variables, happins gdp2002. The variable you type first will be displayed on the y-axis. ii) Let s make the graph user friendly. We can do so by naming the graph and the axis. scatter happins gdp2002, mlabel(country_name) mlabelsize(vsmall) title(scatterplot: Happiness Score and GDP/capita) ytitle(happiness Score) xtitle(gdp/capita) iii) Outliers can be dangerous in Econometrics. If consider Luxembourg an outlier we can easily get rid of the observation. By adding an if option we can graph the scatter diagram without displaying Luxembourg. drop if country == Luxembourg J. We are done with the analysis for these variables. However, let s save the dataset and close the log file before moving on. i) File -> save II. Analysis: Money A) It is now time to use a different dataset. Before getting started we need to use some of the commands from the logistics section on page 2. i) clear ii) set more off iii) cap log close iv) File -> log -> begin -> selected a folder -> pick a name -> Save it v) Open the dataset.. File -> open -> find and open CEOSAL1.dta vi) Then save the file before getting started. File -> Save As -> selected a folder -> pick a name -> Save it B) It s generally a good thing to look at the variable in the dataset. describe C) The two variables of interest are CEO salaries and return on equity. list salary roe if _n <25 D) sum E) The industry that the data is drawn from should give additional information. This is a discrete variable. To better way to describe this types of variables is

6 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom 6 through the command tab indus F) It is possible to look at the cross-tab of two discrete variables. The cross-tab reports the relative frequency within its row for each cell. In our example, it gives the conditional distribution of financial firms given that industrial firms take value 0 for the first row or 1 for the second row. It is essentially the conditional distribution of the column variable given the row variable. tabulate indus finance, row G) We can also find the conditional distribution of the row variable given the column variable. tabulate indus finance, column H) Lastly it is possible to get the joint distribution of industrial firms and financial firms. tabulate indus finance, cell I) Let us look at the correlation between salary and return on equity while excluding potential outliers. corr salary roe if salary <5000 J) It is time to create another scatter plot. In order for the axis to be easier to read we are going to format them. We want to see how many CEOS make more than $5,000,000/year and how many companies that have ROE of 50% or higher. scatter salary roe, yline(5000) xline(50) K) Let s plot a histogram for salary. i) hist salary ii) histogram salary, normal (this compares the histogram of salary to a normal plot) iii) Different representations of incomes, e.g, salary, are usually represented as the natural log of salary. histogram lsalary, normal. This creates a histogram that is more traceable compared to the previous one. L) hist roe, normal M) File -> Save N) File -> log -> close Good luck!

7 Econ 120B Stata Guide Hw1 Claudio Labanca Love Lofstrom Summary Table of the Logical Expressions in Stata 7 Command Short description < less than <= less than or equal == equal > greater than >= greater than or equal!= not equal & and or! not Summary Table of the Stata Commands seen in Tutorial 1 Command Short description Example describe summarize sort browse list count tabstat variable_name, statistics (sum) will show characteristics of the variable/s contained in the dataset will give you summary statistics on the variables that you enter. will re-arrange the variable in ascending order. will show you the data in cell-format (like excel). can be used to find the value of a particular variable. to see how many countries that are happier than a specific country. des variable_name sum variable_name sort variable_name list country_name happins religion if religion == Shia Islam count if happins > generate to create a variable gen variable_name = insert_formula tabstat allows us to produce a table of statistics. tabstat variable_name add up all of the values tabstat variable_name, stored for a certain statistics (sum) variable.

8 Econ 120B 8 Stata Guide Hw1 Claudio Labanca Love Lofstrom display In order to do calculations in Stata display sqrt(1.05e+10/74) histogram plot a histogram histogram variable_name, normal corr look at the correlation between variables corr variable_name1 variable_name2 scatter will graph a scatter plot of their relationship. scatter variable_name1 variable_name2 tab additional information to describe variables tab indus

9 STATA Tutorial #2 KEY Type into Command box Left Click If you need any additional guidance, or are having other issues with STATA, try the following: Attend office hours, the exact times of which can be found on TED. Use the help command on STATA or Google (i.e. help scatter if you want clarification on how the scatter command works). Send questions to 1. clear 2. cap log close a. The cap log close command, in this case, tells STATA to close any log files you may currently have open. 3. File > Log > Begin a. This allows you to begin a new log (which you will need to do in order to turn in your homework assignments). Make sure to save your log as a.log to receive full points on your homework assignment! 4. File > Open a. Open your dataset (wine.dta). b. Alternatively, you could choose to use STATA s use command, which also tells STATA to load a designated dataset. 5. save wine_out.dta, replace a. We don t want to actually alter the original dataset (wine.dta) so we will save it under a new name in this case, wine_out.dta. b. The replace command here tells STATA to replace our previous dataset file with our new wine_out.dta. 6. describe a. The describe command shows us what our dataset contains: the number of observations, variables, etc. Often, it will also give a brief description of what each variable represents. 7. scatter alcohol heart, mlabel(country) mlabsize(vsmall) a. We are now using the scatter command to create a scatterplot representing the relationship between alcohol consumption and heart disease. Note that alcohol consumption, listed first here, is on the Y-axis; while heart disease, listed second here, is on the X-axis.

10 b. The mlabel option allows us to label the points by country, while the mlabsize option allows us to manipulate the appearance of said labels (in this case, vsmall tells STATA to make the label text very small). c. We can see, based on the scatterplot produced, that the two variables appear to be negatively correlated such that the higher the wine consumption, the lower the deaths by heart disease. 8. scatter alcohol liver, mlabel(country) mlabsize(vsmall) a. We can create a similar scatterplot to observe the relationship between alcohol consumption and deaths by liver disease (in this case, the variables appear to be positively correlated). 9. regress heart alcohol, robust a. We now want to run a regression between deaths by heart disease and wine consumption. The regress command tells STATA to run a linear regression. i. Recall that if errors are not homoscedastic, we must use heteroscedastic robust standard errors in order to make valid inferences. We can tag on the robust option to accommodate this. b. STATA gives us a lot of information: in the top right corner, we can see the sample size, the standard error, and the R-Squared. We are also told the degrees of freedom, estimated coefficients, and standard errors, displayed in other regions of the command output. 10. display / a. We can manually calculate the R-squared using the display command. i. The Explained Sum of Squares (ESS) is given to us by Stata as the Model SS; the Unexplained Sum of Squared Residuals (SSR) is given to us as the Residual SS; and the Total Sum of Squares (TSS) is given to us as the Total SS. ii. To calculate the R-squared, divide the ESS value by the TSS value ( / ). 11. display 1-( / ) a. Alternatively, we can calculate the R-squared using the formula 1-(SSR/TSS). Again we can show this on STATA using the display command. 12. display _b[_cons] + _b[alcohol]* 8 a. STATA stores the coefficient values in the form of the variable _b. Thus _b[_cons] gives me the coefficient of the constant term (the intercept). Meanwhile _b[alcohol] gives us the slope of the regression line. b. To predict the value of deaths by heart disease in a country with a wine-per-capita consumption of 8 liters per year, use the display command as shown above. We are essentially plugging 8 [liters] into the regression line. 13. twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) a. The twoway command produces a twoway graph according to our specifications. i. The lfit option generates a line of best fit through our original scatterplot (initially generated in step 9). The next two steps (16-17) are somewhat irrelevant to the tutorial as a whole but will help you in the completion of your second homework assignment.

11 14. twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) (function y= *x, range(alcohol)) a. The function option appended to our command back in step 15 draws a function in the above graph in this case, y = *x. 15. twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) (function y= *x, range(alcohol)), legend(order(1 2 "Observed" 3 "A function of interest")) a. Here we ll attempt to make the graph legend a little clearer. The legend option allows us to label our graph more deliberately (to better illustrate this, try also twoway (lfit heart alcohol) (scatter heart alcohol, mlabel(country) mlabsize(vsmall)) (function y= *x, range(alcohol)) and see what your key would look like in this case). 16. predict yhat_h a. This saves all fitted values. b. Data > Variables Manager shows the new variable yhat_h, labelled Fitted Values. 17. predict uhat_h, residuals a. Let s also save the residuals from the regression. Again, Data > Variables Manager should show you the new variable uhat_h, labelled Residuals. 18. generate uhat_alt= heart - yhat_h a. Experimentally, we can verify that the difference between the actual observed value and the value predicted by the model equals the residual. 19. drop uhat_alt a. Drop the variable uhat_alt. 20. tabstat uhat_h, statistic(sum) a. We can check to see that the sum of the residuals equals zero using the tabstat command, with the statistic(sum) option. 21. rvpplot alcohol, yline(0) mlabel(country) mlabsize(vsmall) a. Using the rvpplot command, we can plot the residuals. Note that the value of the residual are shown on the vertical axis, and that level of alcohol consumption is displayed on the horizontal axis. 22. rvfplot, yline(0) mlabel(country) mlabsize(vsmall) a. Let s now instead plot the residuals against the fitted values. We observe a plot of the residuals against the fitted values, given by the rvfplot command. 23. sort uhat_h a. Use the sort command to organize the residuals in ascending order (recall the sort command from the first tutorial and homework assignment). 24. list country alcohol heart yhat_h uhat_h a. Using the list command, try to observe the typical size of the residuals. By observing the residual values, we can more readily see the countries that don t work well with the OLS regression. 25. regress heart alcohol if country!= "Japan" a. We can see that Japan doesn t seem to work well with this regression model (note its large residual). Let s try running the regression without Japan.

12 b. The if country!= Japan tells STATA to run the regression if the country s name is not Japan. 26. set seed a. STATA can be used to generate a random sample of size n; suppose this random sample is called bsample. In order to generate a sample we must set a seed value, in this case a number. The seed can be whatever number you like; let s here use bsample 10 a. To take our random sample, we ll use the bsample command, followed by our desired sample size. We ll use a sample size of n= describe a. Use the describe command to see your 10 observations. 29. regress heart alcohol a. Let s run the regression again, on our 10 observations. 30. save wine_out.dta, replace a. Close the current dataset. 31. clear a. Let s begin anew. 32. File > Open a. We will now use the dataset with CEO salaries. Locate and open it in STATA. 33. save ceosal2_tut2.dta, replace 34. describe a. Use the describe command to familiarize yourself with the new dataset. Observe the variables, their descriptions, etc. 35. regress salary ceoten a. Let s run a regression between predicted salary (salary) and the number of years an individual has been a CEO (ceoten). 36. twoway (scatter salary ceoten) (lfit salary ceoten), legend(order(1 "Observed" 2 "Fitted by Linear Model")) a. Use the twoway command to create a twoway graph that illustrates the relationship between salary and length of CEO tenure. Note the line of best fit that appears alongside the data points on the scatterplot. 37. regress lsalary ceoten a. We ll use the regress command to regress the log of salary on CEO tenure. 38. twoway (scatter lsalary ceoten) (lfit lsalary ceoten) a. Again, let s use the twoway command to create a twoway graph that shows us visually the line of best fit through a scatterplot of the data points. 39. Predicted_salary = exp(bo_hat + b1_hat * ceoten) a. It is possible for us to observe this relationship using salary instead of the log of salary. Note that if Predicted_log(salary) = b0_hat + b1_hat ceoten, then we can find a value for the predicte salary such that Predicted_salary = exp(bo_hat + b1_hat * ceoten). 40. twoway (scatter salary ceoten) (function y = exp(_b[_cons] + _b[ceoten]*x), range(ceoten)), legend(order(1 "Observed" 2 "Fitted by Log Model")) a. From here, we can now graph a twoway graph that visually expresses the relationship between salary and CEO tenure.

13 41. regress lsalary lsales a. Let s regress the log of salary on the log of sales. We are effectively estimating a constant elasticity model that relates the CEO s salary to sales generated by the firm in millions of dollars. This relationship is modeled by log(salary) = b0 + b1 log(sales) + u. 42. regress salary ceoten 43. summarize a. Recall that the summarize command can be used to familiarize ourselves with the dataset: here we can use it to find values such as the average salary and tenure of a CEO. 44. display _b[_cons] + _b[ceoten]* a. If we plug the average tenure of the CEO in our estimated regression, we should get back the average salary of a CEO. We can use STATA to verify this. 45. regress salary ceoten, robust a. Recall that if the errors are not homoscedastic, homoscedasticity-only standard errors of the estimators are not appropriate. If errors are not homoscedastic, then we must use heteroscedastic robust standard errors in order to make valid inferences. b. To tell STATA that we want heteroscedasticity-robust errors (as opposed to heteroscedasticity-only errors, which STATA gives us by default) we tag on the robust option. 46. set seed a. Again, STATA allows us to generate a random sample of size n. Recall that to do so, we must set a seed value, here just a numeric value. Let s use bsample 100 a. Let s set our sample size to describe a. The describe command should show you that we do in fact have 100 observations in our dataset now. 49. regress salary ceoten, robust a. We can perform our last regression again, but this time with our new, reduced set of 100 observations. 50. use CEOSAL2_tut2.DTA, clear a. Let s return to our old dataset. 51. describe a. Note that we are back to our original 177 observations. 52. set seed a. Now we ll take a different random sample and perform the regression again. In this case, let s now use a different seed value, bsample regress salary ceoten, robust a. Observe that the estimated coefficients are different than those obtained before, since we took a different random sample of size save CEOSAL2_tut2.dta, replace 56. File > Log > Close a. Close the log and finish!

14 Summary Table of the Stata Commands seen in Tutorial 2 Command Short description Example regress performs linear regression on variables regress depvar indepvar,option Note: depvar: vertical axis indepvar: horizontal axis the option robust can be used to obtain correct standard errors when errors are heteroskedastic twoway twoway lfit predict rvpplot rvfpplot bsample set seed plots twoway graphs (scatter, line, etc); adds a line of best fit to the graph obtains predictions, residuals, etc., after estimation plots the residual on the vertical axis and the specified variable on the horizontal axis plots residual on the vertical axis and the fitted y on the horizontal axis draws bootstrap samples (random samples with replacement) from the data in memory. must set seed value before generating sample twoway scatter variable1 variable 2 Note: when the only type of graph is scatterplot or line, twoway may be omitted when inputting the command twoway (scatter variable1 variable 2) (lfit variable1 variable2) predict variable, option Note: the option residuals generates residuals rvpplot variable Note: variable can be for example the x variable a regression rvfplot, options Note: some examples of options are yline(), mlabel(), mlabsize() bsample sample_size Note: before inputting the command, set seed number set seed number

15 STATA Tutorial #3 If you need any additional guidance, or are having other issues with STATA, try the following: Attend office hours, the exact times of which can be found on TED. Use the help command on STATA or Google (i.e. help scatter if you want clarification on how the scatter command works). Send questions to clear 2. cap log close a. The cap log close command, in this case, tells STATA to close any log files you may currently have open. 3. cd CURRENT DIRECTORY PATH The cd command will set the current directory in Stata. This is the directory where your data are saved and where you want the log files, graphs etc to be saved. In order for Stata to find that folder we need to indicate a CURRENT DIRECTORY PATH. To get this to work, create a folder on your desktop. In that folder create other two folders, one called logs, the second one called data. Save your data (i.e. dta files) in the data folder. To find out the CURRENT DIRECTORY PATH, right click on either the logs or data folder. Then click on Properties. In the window that pops up, copy and paste the path that you find on the right of Location in place of the words CURRENT DIRECTORY PATH after cd. Don t forget to keep the quotes. Example: cd C:\Desktop\Stata Tutorial 3\ will set the current directory to be the folder called Stata Tutorial 3 on the Desktop of this computer C. 4. log using logs\tutorial3.log, replace a. This allows you to begin a new log (which you will need to do in order to turn in your homework assignments). Make sure to save your log as a.log to receive full points on your homework assignment! The replace option will replace any existing log file. 5. use data\vote.dta, clear a. Begin by opening the dataset (vote1.dta). The clear option will clear the memory in Stata from any existing data file. 6. save vote1_out.dta, replace a. We don t want to actually alter the original dataset (vote1.dta) so we will save it under a new name in this case, vote1_out.dta. b. The replace command here tells STATA to replace our previous dataset file with our new vote1_out.dta.

16 7. describe a. The describe command shows us what our dataset contains: the number of observations, variables, etc. Often, it will also give a brief description of what each variable represents. 8. generate id=_n a. Let s generate and id each observation, using this command we now have the observations numbered. 9. browse a. Notice how there's a new variable (last column), the one you just generated (id). Also notice the units in which the variables are: for example, votea and prtystr are in percentage points, so a value of 43 for votea means that candidate A got 43% of the votes. 10. reg votea expenda expendb, robust a. Let s start by regressing the percentage vote received by the incumbent, and the campaign expenditures incurred by each candidate. b. In the top right corner, you will find, among others, the overall F-statistic (test of the joint hypothesis that all the slope coefficients are zero), the R-squared and what we call SER (standard error of the regression), which STATA calls Root MSE (mean squared error). In the following table, you find the 3 estimates of the coefficients, the robust standard errors and the t-statistics (test the hypothesis that each individual coefficient is zero). Now, let s interpret the meaning of the estimated regression coefficients. i. When expenditures for both parties are 0, the percentage of votes received by candidate A (the incumbent) is predicted to be 49.6 percentage points, on average. ii. An increase in expenditures by candidate A of $1000 is predicted to increase, on average, his/her total vote by 0.38 percentage points, keeping candidate B's (the challenger) expenditures constant. iii. For each $1000 increase in expenditures by candidate B, candidate A will lose, on average, about.036 percentage points, when candidate A's expenditures are held constant. c. 11. display _b[cons] + _b[expendb]*2+_b[expenda] a. Use the command to show the estimated increase in the percentage of votes for $1000 more expenda when expendb=2 12. test expenda expendb a. To test the hypothesis that both coefficients are equal to zero 13. test expenda a. To test the hypothesis that the coefficient on expenda is different from 0 we can use the command test as show above.

17 b. Being the P-value smaller than 0.01, we reject the null hypothesis 14. test (expenda=1) (expendb=0) a. We use this to test the joint hypothesis that the coefficient on expenda is equal to 1 and that the coefficient on expendb equals 0. To comment on the fit of the model, notice that both slope coefficients are highly significant and the R-squared demonstrates that this model explains about 53% of the variance of vote share. i. The SER (Root MSE) indicates that the typical deviation from the predicted value of each electoral district is about 11.6 percentage points, but this number is hard to evaluate in isolation. In short, this is a reasonably good fit for a model. 15. sum expenda expendb display _b[_cons]+ _b[expenda]* _b[expendb]* a. To predict the fraction of votes for candidate A at the average expenditure of A and expenditure B, first find out the average of expenda and expendb using the command sum (above) b. thus multiply the coefficient of each variable by the average found in point a 16. sum expenda a. We can see what happens to percent vote for the incumbent if incumbent campaign spending increased by one standard deviation, while the challenger's expenditures remains fixed 17. display _b[expenda]* a. Multiply the coefficient for expenda by its standard deviation b. All else equal, a one standard deviation increase in expenditures by the incumbent would lead to an increase in vote share in about 10.8 percentage points. 18. gen lnvotea=log(votea) gen lnexpenda=log(expenda) reg lnvotea lnexpenda expendb, robust a. Suppose you want to know the percentage change in votea for a 1% change in expenda. You can directly obtain this result from the regression by running a log regression. Keeping expenditure for candidate B constant, a 1% increase in expenditure for candidate A corresponds to a 0.17% increase in the percentage of votes received by candidate A. 19. generate expenda_sq= expenda^2 reg votea expenda expenda_sq, robust a. Imagine you are the adviser for an incumbent candidate. You come across with a theory that there are diminishing marginal returns to campaign expenditures by incumbent candidates. b. You want to test this theory, so you decide to model the relationship between

18 percent vote and expenditures for the incumbents as a quadratic function. i. What does the regression results show you? ii. There appear to be diminishing marginal returns to expenditures. Notice that the coefficient on the squared value of incumbent expenditures is negative. iii. This indicates that each new increase in expenditures will yield less new returns than the value before. Eventually, we will reach a point where increasing expenditures actually cost an incumbent votes. How do you explain this turn around point? iv. A possible explanation is that airwaves become fully saturated and overexposure leads voters in a particular district to turn against the candidate. 20. twoway (scatter votea expenda) (qfit votea expenda), legend(order(1 2 "Quadratic Fit")) a. We plot the estimated relation. b. Scatter shows you the points in your sample, qfit plots the estimated quadratic relationship 21. twoway (scatter votea expenda) (qfit votea expenda) (lfit votea expenda), legend(order(1 2 "Quadratic Fit" 3 "Linear Fit") a. In this graph, we compare the quadratic fit with the linear fit. To test the theory, beyond visual comparison of the two fits, we can formally test the hypothesis that the relationship between votea and expenda is linear, against the alternative that it is nonlinear. If the relationship is linear, the coefficient on expenda_sq is zero. The t-statistic for this test is -6, thus we reject the null hypothesis. There is evidence that the relationship is nonlinear 22. display (_b[_cons]+ _b[expenda]*110+_b[expenda_sq]*110^2) - (_b[_cons]+_b[expenda]*100+_b[expenda_sq]*100^2) 23. display (_b[_cons]+_b[expenda]*510+_b[expenda_sq]*510^2) - (_b[_cons]+_b[expenda]*500+_b[expenda_sq]*500^2) a. To show that there are diminishing marginal returns to campaign expenditures, we compute the effect of increasing campaign expenditure by $10,000, when spending is $100,000 and when spending is $500,000 i. Adding an additional $1000 in spending after having already spent $100,000 will lead to an additional 0.69 percentage points in voting for candidate A. ii. But, adding an additional $1000 in spending after having already spent $500,000 will only lead to an additional 0.23 percentage points in voting for candidate A. 24. count if expenda > 700 a. The visual analysis of the scatter plot reveals that there is a turning point at

19 around $700,000 in spending. We want to see if there are a lot of districts with incumbent expenditures over $700, list id state district expenda if expenda > 700 a. To know which are those districts, you can use the list command. 26. gen sharea_dummy=(sharea>50) gen votea_dummy=(votea>50) tab sharea_dummy tab votea_dummy reg votea_dummy sharea_dummy, robust a. Suppose candidate A wants to know: what's the effect of spending more than candidate B on the probability of getting more than 50% of the votes. You can find that out generating the variables above. b. Having higher expenditure increases the probability of having the majority of votes by (0.84*100) percentage points. 27. reg votea expenda expenda_sq expendb prtystra, robust a. There is other factors besides just incumbent spending that influence votes. Vote share of the incumbent is also affected by the opponent's spending (expendb) and the strength of your own party (prtystra). We run a regression controlling for those factors. b. All coefficients are significantly different from zero, at the 1% significance level. There are still diminishing marginal returns to incumbent campaign expenditure. c. With other variables held constant, an increase of $1000 in the opponent's spending, will cost the incumbent percentage points of the vote share. d. An increase in the strength of the incumbent's party of 1 percentage point, keeping all other variables constant, will yield 0.32 percentage point increase in the incumbent's vote share. e. With this model we have now explained 65% of the variation in the vote share of the incumbent. More importantly, we have reduced the SER, which indicates that we are starting to achieve a relatively good fit 28. sum expenda expendb prtystra 29. display _b[_cons]+_b[expenda]* _b[expenda_sq]*( ^2)+_b[expendB]* _b[prtystrA]*65 a. You want to predict the incumbent share of the vote, if party strength were 65 percent, and the candidates kept their expenditures at their mean levels. b. About 58.46% of the vote 30. reg votea lexpenda, robust a. In general, when you want to do a regression with a variable in logarithm form,

20 you have to generate that variable, by writting for example, generate ln_expenda=ln(expenda). In this case, the log of campaign expenditures for each candidate are already variables in this dataset, so we don't need to generate them. b. The coefficient in is highly significant and indicates that the 1% increase in expenditure, would yield an increase in vote share of (6.51/100)= percentage points. 31. twoway (scatter votea lexpenda) (lfit votea lexpenda), legend(order(1 "Actual Values" 2 "Fitted Values")) a. Plot the relationship between votea and log(expenda) and the fitted line. 32. reg votea lexpenda lexpendb prtystra, robust a. Now, we keep the linear-log specification but, fearing omitted variable bias, we add control variables log(expendb) and prtystra. b. Interpretation of results: A 1% increase in incumbent expenditures leads to an increase in incumbent vote share in the amount of percentage points, keeping all other variables constant. c. A 1% increase in challenger expenditures leads to a reduction in incumbent vote share of percentage points, keeping all other variables constant. d. An increase in the incumbent's party strength of 1 percentage point, leads to an increase in incumbent vote share of 0.15 percentage points, keeping all other variables constant. e. We are confident with the results of this model. All variables are highly significant. We have explained 79% of the variation in incumbent vote share and the SER has been reduced to only 7.7 percentage points 33. display_b[_cons]+_b[lexpenda]*(ln(400))+_b[lexpendb]*(ln(500))+_b[prtystra]* 50 a. Compute the predicted vote share for your candidate if his/her expenditures are $400,000 and the opponents are $500,000 and the incumbent's party strength is 50% 34. display_b[_cons]+_b[lexpenda]*(ln(600))+_b[lexpendb]*(ln(500))+_b[prtystra]* 50 a. Compute what happens if your candidate increases expenditures to $600,000, keeping the other variables constant. 35. display _b[lexpenda]*(ln(600)-ln(400)) a. The increase in your candidates' vote share would be 2.47 percentage points, from to percent. You can compute this increase directly by using the command above. 36. save vote1_out.dta, replace clear

21 a. Close this dataset. 37. log close a. Close the log. Summary Table of the Stata Commands seen in Tutorial 3 Command Short description Example regress Running a linear regression reg votea expenda expendb, robust on multiple variables test twoway generate count list running a log regression on multiple variables To test the hypothesis that the coefficient is different from 0 To test the joint hypothesis that the coefficient on variable one is different from 1 and that the coefficient on variable 2 is different from 0 To plot the estimated relation between two variables To generate dummy variables To generate and id for each observation To see how many districts are over a particular value To show the name of the those districts that are over the particular value reg lnvotea lnexpenda expendb, robust test expenda test (variable1=1) (expendb=0) twoway (scatter votea expenda) (qfit votea expenda) (lfit votea expenda), legend(order(1 2 "Quadratic Fit" 3 "Linear Fit") gen sharea_dummy=(sharea>50) generate id=_n count if expenda > 700 list id state district expenda if expenda > 700

Introduction of STATA

Introduction of STATA News: There is an introductory course on STATA offered by CIS Description: Intro to STATA On Tue, Feb 13th from 4:00pm to 5:30pm in CIT 269 Seats left: 4 Windows, 7 Macintosh For