While a lot of statistics deals with linear relationships, we live in a very non-linear world. There are power law distributions (80/20 relationships, the Pareto principal) in many areas of business, economics, and the social sciences. A handful of observations at the fringes of your distribution rise in a very non-linear fashion, making it difficult to fit a linear trend line through the data series. This can make it difficult to see patterns and distort your analysis. Fortunately, log transforms can help. By taking the logarithm of your data, you can reduce the range of values and make it easier to see patterns and relationships. Additionally, log transforms can help make your data more normally distributed, which is often necessary for statistical analysis. In this article, we’ll show you how to use R to perform log transforms, and explain why they are helpful and sometimes necessary for working with data.
You can use logarithmic transformation to change the dependent variable and independent variable, and counter any skewed data that may mess with your linear regression, arcsine transformation, geometric mean, negative value, or other linear relationship in your original data. By doing a logarithmic transformation on your original data distribution, you can give it a better normality assumption, making it an easier linear model to perform any statistical test one as transformed data.
Introducing the log() function in R
A Log transformation in R is handled via the log() function This function takes the format log(value, base) and returns the logarithm of the value in the specified base. This function will default to the natural logarithm of the value. Log transformations can help to make your data more normally distributed, remove skewness, and create a numeric variable that better fits regression analysis and scatter plots. While log transformations may not be the simplest data transformation method, they can produce some of the best outcomes compared to other linear transformations, such as logit, square root, arcsine, reciprocal, or inverse transformations. In addition, there are shortcut variations available for base 2 and base 10.
# log in r - core syntax
> log(9,3)
[1] 2
This is the basic logarithm function with 9 as the value and 3 as the base. The results are 2 because 9 is the square of 3.
# log in r example
> log(5)
[1] 1.609438
Here, the second perimeter has been omitted resulting in a base of e producing the natural logarithm of 5.
# log in R - base 10 log
> log(100,10)
[1] 2
> log10(100)
[1] 2
Here, we are comparing a base 10 log of 100 with its shortcut. For both cases, the answer is 2.
# log in r - base notation
> log(8,2)
[1] 3
> log2(8)
[1] 3
Here, we have a comparison of the base 2 logarithm of 8 obtained by the basic logarithm function and by its shortcut. For both cases, the answer is 3 because 8 is 2 cubed.
How Does A Log Transformation Help Us Analyze Data?
There are several reasons why you might want to do a log transform of your data:
- Reduce skewness: This is useful for statistical analysis, since many statistical tests assume normality.
- Reduce variance: If your data has unequal variances across different groups or levels, a log transform can help stabilize the variances and make them more equal.
- Make patterns visible: Sometimes, it can be easier to see patterns in data on a log scale than on a linear scale.
- To simplify interpretation: In some cases, a log transform can help simplify the interpretation of the data
How To Apply a log transformation to an R Vector
To perform a a log transformation on vectors, add 1 to the vector and apply the log() function. The new vector will be less skewed than the original.
# log in R - vector transformation
> v = c(100,10,5,2,1,0.5,0.1,0.05,0.01,0.001,0.0001)
> q=log(v+1)
> q
[1] 4.6151205168 2.3978952728 1.7917594692 1.0986122887 0.6931471806 0.4054651081
[7] 0.0953101798 0.0487901642 0.0099503309 0.0009995003 0.0000999950
> plot(v)
> plot(q)
A close look at the numbers above shows that v is more skewed than q. This fact is more evident by the graphs produced from the two plot functions including this code.
How To Apply a log transformation to an R Data Frame
Applying a log transformation to an R data frame can be a bit trickier than a vector. You usually need to apply the log transformation to a specific column rather than the entire data structure.
This can be addressed via R’s column operations, where you create a new column in the data frame with a log transformed value. This is also good from a collaboration perspective, since this R code is relatively easy for a colleague or new analyst to understand (in the future, after your hand off a project).
# log in R example - data frame column
> ChickWeight$logweight=log(ChickWeight$weight)
> head(ChickWeight)
weight Time Chick Diet logweight
1 42 0 1 1 3.737670
2 51 2 1 1 3.931826
3 59 4 1 1 4.077537
4 64 6 1 1 4.158883
5 76 8 1 1 4.330733
6 93 10 1 1 4.532599
> plot(head(ChickWeight$Time),head(ChickWeight$logweight))
> plot(head(ChickWeight$Time),head(ChickWeight$weight))
As you can see the pattern for accessing the individual columns data is dataframe$column. The head() returns a specified number rows from the beginning of a dataframe and it has a default value of 6. These plot functions graph weight vs time and log weight vs time to illustrate the difference a log transformation makes.
Final Thoughts
While log functions themselves have numerous uses, in data science, they can be used to format the presentation of data into an understandable pattern. They are handy for reducing the skew in data so that more detail can be seen. In R, they can be applied to all sorts of data from simple numbers, vectors, and even data frames. The usefulness of the log function in R is another reason why R is an excellent tool for data science.