Wednesday, 17 February 2016

Data Analysis in R - 2013 American Community Survey

Hi everyone,

Mary and I both recently finished a summer internship at UBS in their FRC department. 
It was a highly rewarding experience for the both of us. During that time I mastered a new technical skill, R, and Mary learnt Excel VBA. R is a programming language that enables statistical computing and output of graphics. 

Note this blog post is related to data mining rather than finance. 

To further consolidate on my R knowledge, we decided to use DataCamp's R platform to undertake a quick analysis of the 2013 American Community Survey, a dataset provided in Kaggle. This survey is similar to ABS surveys in Australia. The aim is to determine whether it is worthwhile pursuing a PhD. 

This was our approach: 

1. Load in data to identify how observations are formatted in the dataset.
2. Load in dplyr package. This package provides tools to manipulate datasets efficiently. 
3. Using the dplyr package, we convert the dataset into a table.
4. Clean up the dataset: remove NA values, use only university level education qualifications including Bachelors, Masters and PhDs and then grouping by such levels for further analysis.
5. We perform an inner join of our formatted table to one which contains data on the number of higher level education holders to produce a bar graph. 
6. Using a separate income dataset (code for relevant calculations such as min, median, max and interquartile ranges provided below), we use a box plots to compare incomes. 







  Calculations for box plot:



Our next few blog posts will be related to data mining and analysis. 

Thanks for reading! 



No comments:

Post a Comment