Webscraping using readLines and RCurl

There is a massive amount of data available on the web. Some of it is in the form of precompiled, downloadable datasets which are easy to access. But the majority of online data exists as web content such as blogs, news stories and cooking recipes. With precompiled files, accessing the data is fairly straightforward; just download the file, unzip if necessary, and import into R. For "wild" data however, getting the data into an analyzeable format is more difficult. Accessing online data of this sort is sometimes reffered to as "webscraping".

Helpful statistical references

In a previous article I provided a list of R programming resources. As a complement to that post, I've compiled a list of statistically oriented websites that colleagues and I have found useful below. For the most part, these sites focus on statistics and quantitative research methods rather than programming.

Positioning charts with fig and fin

R offers several ways to spatially orient multiple graphs in a single graphing space. The layout() function and mfrow/mfcol parameter settings are adequate solutions for many tasks and allow the graphing space to be broken up into tabular or matrix-based arrangements. For more fine grained manipulation, the fig and fin parameter settings are available. This article illustrates the capabilities and use of fig and fin.

First we'll create some simulation data to work with:

Online R programming resources

R can legitimately be called both a programming language and a statistical package. Many books address both the programming and statistical components of R, but invariably the discussion of statistical topics is more detailed than the discussion of programming capabilities. As a supplement, I've started the list of links below. Each of these sources deals specifically and almost exclusively with the the programming aspects of R: objects, arrays, loops and conditional statements, custom functions, debugging, and so on. I'll add to this list as I become aware of other sites.

A Handbook of Statistical Analyses Using R - Everitt and Hothorn (2006)

Title: A Handbook of Statistical Analyses Using R
Author(s): Brian S. Torvitt; Torsten Hothorn
Publisher/Date: Chapman & Hall/2006
Statistics level: Intermediate to advanced
Programming level: Intermediate
Overall recommendation: Highly recommended

A Handbook of Statistical Analyses Using R addresses a list of several common statistical analyses in great detail. Over a course of 15 chapters, the handbook takes the reader from an introduction to R through a discussion of statistical inference, to linear and logistic regression, tree analysis, survival analysis, longitudinal analysis, meta-analysis, factoring, scaling, and clustering. The handbook has a peer-reviewed journal style that will be familiar to academic researchers and each chapter stands on its own. This approach makes the text exceptionally useful in the academic setting as a professor can distribute and assign the first chapter of the book to her Research Methods 101 course; the final chapters on scaling and dimensionality to her Psychometrics Methods course; the last chapter on clustering to her Marketing Research course; and require the entire book for her graduate methods course. For custom research shops making the transition to R or who frequently hire new entry level R users, this book will work well as a reference and training manual.

The handbook does show typical first edition flaws. There are sporadic mistakes in grammar such as misspellings and incorrect words. The overall organization of the book is strong, but the chapter level organization is less effective. Each chapter begins with a discussion of all of the datasets used in that chapter and is followed by examples and applications based on those datasets. In chapters where there are several examples, the discussion of the data is too detached from its corresponding example. When the reader reaches the example based on the first dataset they have likely forgotten the relevant details about that data's structure. Grouping the data discussions with the examples they accompanied would have made the example based approach more effective.

Controlling margins and axes with oma and mgp

When creating graphs, we're usually most concerned with what happens near the center of our displays, as this is where most of the important information is generally held. But sometimes, either for aesthetics or clarity, we want to adjust what's outside of the box - in the margins, labels or tick marks. The par() function offers several ways to do this and I'll discuss two that deal primarily with spatial orientation - rather than content - below.

The oma, omd, and omi options

Syndicate content