The following dataset contains the integration sites (positions where a retrovirus was found in a sample to be integrated into the host genome) of the fictional virus RandoVirus. These integration sites were annotated using known datasets of the host genome with respect to the nearest gene and CpG islands, both measures of the transcriptional activity at the site.
1. If you have not yet, download and save the dataset to your
data directory.
2. read in the data to a data frame in R
3. Using ggplot() and geom_bar(), plot
how many integration sites are found in each chromosome? Is the order of
the columns in the histogram what you expected?
Example:
geom_bar(aes(x = Chr))
4. Using ggplot() and geom_boxplot(),
plot how many integration sites are found in each orientation. Is
RandoVirus biased for integration in either orientation? How would you
know for sure? Plot the same plot separately for each chromosome. Do you
find the same bias for all chromosomes?
5. The relative orientation between an integrated provirus and
the host gene may affect the expression of either.
ifelse(), define a new relort
column where the value is “Same” if the
virus(Site.orientation) and gene orientation
(Orientation.of.gene) are equal, the “Opposite” if they are
not.ggplot2 andgeom_bar() to plot the
question - is RandoVirus biased for integration in either relative
orientation.
6. The Distance.to.nearest.5. column describes the
genomic distance (along the chromosome) between each viral integration
site and the nearest 5’ end of a host gene. This distance is reported as
a negative value if the host gene is upstream to the integrated virus. -
Using ggplot() and geom_boxplot(), plot the
distribution of the values observed here.
geom_boxplot(aes(y = Distance.to.nearest.5., x = Site.orientation,
col = Site.orientation))
Distance.to.nearest.5. you will be plotting
abs(Distance.to.nearest.5.). What is the difference?ggbeeswarm using
install.packages(). Use
ggbeeswarm::geom_beeswarm() instead of boxplot. What is the
difference? What would happen if you include both? ggbeeswarm::geom_beeswarm(aes(y = abs(Distance.to.nearest.5.), Site.orientation,
col = Site.orientation))
7. Save your last plot to file called “exs3.png” and to a file
called “exs3.pdf”. What is the difference?