header.knit

Working with biological data with R


The following dataset contains the integration sites (positions where a retrovirus was found in a sample to be integrated into the host genome) of the fictional virus RandoVirus. These integration sites were annotated using known datasets of the host genome with respect to the nearest gene and CpG islands, both measures of the transcriptional activity at the site.

Link to dataset


1. If you have not yet, download and save the dataset to your data directory.


2. read in the data to a data frame in R


3. Using ggplot() and geom_bar(), plot how many integration sites are found in each chromosome? Is the order of the columns in the histogram what you expected?

Example:

geom_bar(aes(x = Chr))


4. Using ggplot() and geom_boxplot(), plot how many integration sites are found in each orientation. Is RandoVirus biased for integration in either orientation? How would you know for sure? Plot the same plot separately for each chromosome. Do you find the same bias for all chromosomes?


5. The relative orientation between an integrated provirus and the host gene may affect the expression of either.


6. The Distance.to.nearest.5. column describes the genomic distance (along the chromosome) between each viral integration site and the nearest 5’ end of a host gene. This distance is reported as a negative value if the host gene is upstream to the integrated virus. - Using ggplot() and geom_boxplot(), plot the distribution of the values observed here.

geom_boxplot(aes(y = Distance.to.nearest.5., x = Site.orientation, 
                   col = Site.orientation))
  ggbeeswarm::geom_beeswarm(aes(y = abs(Distance.to.nearest.5.), Site.orientation, 
                                col = Site.orientation))


7. Save your last plot to file called “exs3.png” and to a file called “exs3.pdf”. What is the difference?