Wednesday, 17 February 2016

Data Analysis in R - Interesting Datasets & Graphs

Hi everyone, 

This blog post is also related to functions in R useful for data analysis. The datasets we used were quite interesting. If you're a student attending University of Sydney, a great Business School major to take is Business Analytics as it allows us to perform our own analysis on whatever dataset we like, and R is a highly recommended tool by our academics. Read on for tips on how to approach datasets and apply functions in R. 

1) We are using a database on race results available on the American Racing Pigeon Union's website. The aim is to identify relationship between a pigeon's speed and the colour of its feathers. For greater reliability, we formatted the data frame to only include colours that appear more than ten times in the database. 

We import ggplot2, which enables powerful graphics to create complex plots such as correlation plots. Calculations include average flying speed based on the group colour. Speed is our independent variable and rank is the dependent variable, though these variables could be switched. 

Note: %>% is a piping operator which enables functions relating to the same variable to be passed along without needing to re-enter the variable name for manipulation. It is a component of the dplyr library. 

2) There is a research paper on the optimum length of chopsticks- the key performance indicator is the number of peanuts able to be picked up and placed into a cup (variable name: food picking efficiency). Whilst the results of the study have been publicised, it is always fun and rewarding to reproduce the graph in a new software environment. Data visualisation is just as important as the analysis itself. 

We use ggplot to visualise the data. Aes refers to aesthetic, i.e. mapping user specified variable to user specified part of the plot. We are using fill to group the data by chopstick length, and the dependent variable is relative frequency, that is, how often a certain food picking efficiency is noted compared to its total number. Geom_density displays a smooth density estimate relevant to relative frequency, while alpha refers to transparency, useful for when there are multiple overlapping plots. Other optional settings to include are weight of an observation (weight), border colour (colour), size and line type (linetype). We can conclude a chopstick length of 240mm is optimal. 

3) We are reproducing two graphs based on Spanish Silver production during the 18th century. Firstly, we are plotting the annual silver production as a time series graph, and secondly we are plotting the annual amount of silver but another time series plot overlapping that to demonstrate cumulative production over time.

We use the ggplot2 function for graphical output. Geom_area refers to producing area plots, which is similar to a continuous stacked bar chart. It aims to visualise how the composition varies over the x range (time). We refer to the cumulative graph as new variable silver_cs. We mutate the data frame to add an extra column named cumsum, which is the cumulative sum of silver production. Mutate seems to act much faster than transform for large data frames. The user defined format #c0c0c0 refers to the colour silver. Here transparency is slightly lesser with alpha set to 0.5. 

No comments:

Post a Comment