# Homework

## Homework assignments

The breakdown of marks for this course is as follows:

1. Assignments (50%)
2. Presenting a talk on a topic/paper (20%)
3. Moderating a discussion on a topic/paper (20%)
4. Overall participation in discussion (10%)

## Assignment 1: Graphics

This assignment is due Friday, September 28, 2018.

• Find a graph drawn from data and published by your thesis supervisor. The graph should be one that could be much improved. (If your supervisor is flawless, pick a graph by yourself or another in your group or department.)
• Students from the same lab: don’t choose same or very similar graphs.
• Analyze the graph. Explain the study. Explain what patterns the graph is intended to display. Explain the flaws in the graph — why does it not succeed?
• Redraw the graph in R using principles of effective display. Try to obtain and make use of the raw data, otherwise extract them from the graph or simulate raw data.
• Analyze your new graph according to principles of good graph design. Explain how your improvements display the patterns more effectively than the original. Why does your graph succeed?
• Email paper to me as a single .pdf file: LASTNAME.FIRSTNAME.ASSIGNMENT1.PDF
• Grade will be based on: the quality of your analysis of the original graph; the degree of improvement of the new graph (choose your graph so as to leave yourself plenty of room for improvement); your interpretation of it and explanation of how it is improved; the quality of your R code.

## Assignment 2: Linear model

This assignment is due Friday, November 2nd, at 5PM.

Obtain a data set and analyze it by fitting a linear, mixed, or generalized linear model in R.

• Obtain a data set from your supervisor or online data depository (e.g. Dryad).
• Include just one response variable.
• For the explanatory variables, include at least one proper fixed factor, such as an experimental or observational treatment. Can be categorical or numeric.
• Include at least 1, and no more than 2, additional explanatory variables (random or fixed factors, blocks, covariates, etc).
• Explain (in a paragraph) the purpose of the study that yielded the data.
• Explain the specific data set you are using. For example, say where the data are from, give the meaning of the variables, and so on.
• Illustrate and describe the main patterns revealed in the data.
• State what parameters (magnitudes) you will estimate with these data.
• State what hypotheses you will test with these data.
• Fit a linear model to the data in R. Explain in words the model you fit.
• Interpret the output. To assess biological significance, explain the parameter estimates (magnitudes). What do they mean and what are your conclusions based on these parameter estimates. To assess statistical significance, explain the null hypotheses and interpret the test results.
• Visualize the model fit to the data. Explain what the graph is showing. (Can be similar to your initial graph, but with the model fit added.)
• Address how well the statistical assumptions of your analysis were met, and how you handled violations.
• State the overall conclusions reached from your analyses of biological and statistical significance.
• Include your clean R code in an appendix.
• Include all your writing and graphs in a single pdf file (titled LASTNAME.FIRSTNAME.ASSIGNMENT2.PDF) and email to me.

## Assignment 3: What is the best model?

This assignment is due Friday, November 30.

Clues to the inheritance patterns of population differences can be gained by fitting linear models to measurements of traits in parents and hybrids. In this assignment you will use model selection methods to compare the fit of three alternative genetic models of divergence in soil arsenic tolerance in two populations of the grass Agrostis capillaris (Watkins and MacNair 1991, Genetics of arsenic tolerance in Agrostis capillaris. Heredity 66: 47-54). One population occurred on an abandoned, arsenic-contaminated mine; the other was from an edaphically similar, non-toxic site.

To accomplish this you will need to choose a criterion (AIC or BIC) to decide the fit of models to the data, and to determine which is best suited to your purposes. You need to defend your choice of method vigorously in your report, which will require some research. Why did you decide to use it instead of the other criterion? Decide on the criterion before you analyze the data.

Height of plant tillers of different cross generations can be downloaded here.

Height is the cube root of tiller height (in mm) when grown on arsenic-laced soil. Line refers to the parent population from the contaminated site (“high” tolerance), the parent population from the uncontaminated site (“low” tolerance), their F1 and F2 hybrids (“f1”, “f2”), and the backcrosses between the F1 hybrid and each parent population (“bh” for high and “bl” for low tolerance). I’ll refer to these crosses as genotypes.

Analyze these data in R according to the following methods. Fit linear models with fixed effects only. Assume that all the data for a given cross type are independent. Provide all necessary explanations in your report. Always show your model fits graphically, as usual. No P-values are allowed in your report. Include your R commands in an appendix.

1. Graph the data. Explain your graph. What is the pattern in the data?
2. Create a table of means and standard deviations of genotypes. Design the table as you would if you were publishing it.
3. Add a numeric variable in the data set to represent the proportion of the genome inherited from the high-tolerance parent:
1 for the high-tolerance parent genotype
0 for the low-tolerance parent genotype
0.5 for the F1 and F2 hybrids
0.25 for the backcross to the low tolerance population
0.75 for the backcross to the high tolerance population
Make sure that the variable is numeric rather than a factor or character.
4. Fit the numeric variable you created in (3) to the height data using a linear model. This is called the additive model, whereby tolerance increases linearly with the proportion of the genome inherited from the high tolerance parent. Evaluate the model fit (Remember: no P values!).
5. Add another numeric variable to the data set to represent dominance effects that might be present in the hybrids:
0 for both parent genotypes
1 for the F1 hybrid
0.5 for the remaining three hybrid genotypes
Make sure that the variable is numeric rather than a factor or character.
6. Fit a second model to the same data that includes both of the numeric variables created in (3) and (5). Leave out any interaction terms. This is the additive plus dominance model. Any dominance effects present will displace the mean value of the hybrids toward one or other of the parents relative to the values predicted by the additive model. Evaluate model fit.
7. Finally, fit a third model that has the original genotype variable as the only explanatory variable. The fit of this model will deviate from the model fitted in (6) if there is interaction (epistasis) between genes inherited from the two parents.
8. Present your results, comparing model fits. Which genetic model best fit the data? Explain and summarize.
9. Explain how the procedure you used above to analyze these data differs from that of conventional null hypothesis significance testing. In your view, would a null hypothesis significance testing approach be a poorer, equivalent, or superior approach to the one used above to decide between the three models? Explain.
10. Include your clean R code in an appendix.
11. Email paper to me as a pdf file: LASTNAME.FIRSTNAME.ASSIGNMENT3.PDF