The following dataset contains the integration sites (positions where a retrovirus was found in a sample to be integrated into the host genome) of the fictional virus RandoVirus. These integration sites were annotated using known datasets of the host genome with respect to the nearest gene and CpG islands, both measures of the transcriptional activity at the site.
Download and save the dataset to your data directory
Open the data in Excel, what is you impression of the file in terms of the concepts of messy/tidy we looked over.
What format is the file?
load the file into R and assign the data to an object. What structure does your data have?
What are the column headers of the data? how many columns? how many rows?
What does the unique() function do?
Which values can the column Site.orientation
take?
Subset the data to only the integration sites which are within a
gene.
Subset the data to only the integration sites which are in an
intron.
What are the dimentions of the new tables?
Which genes have integration sites in them? extract a list of
these gene names so that you can upload it to pathway analysis tools, to
see if particular pathways are targeted by the RandoVirus.
How long is this list?
Export (write) the data to a file using
writeLines().