When dealing with statistics there are times when data get skewed by having a high concentration at the one end and lower values at the other end. These results in a peak towards one end that trails off. One way of dealing with this type of data is to use a logarithmic scale to give it a more normal pattern to the data. You can use logarithmic transformation to change the dependent variable and independent variable, and counter any skewed data that may mess with your linear regression, arcsine transformation, geometric mean, negative value, or other linear relationship in your original data. By doing a logarithmic transformation on your original data distribution, you can give it a better normality assumption, making it an easier linear model to perform any statistical test one as transformed data.
Log in R
The basic way of doing a log in R is with the log() function in the format of log(value, base) that returns the logarithm of the value in the base. By default, this function produces a natural logarithm of the value. This will create a better fitted value from your data distribution, helping to remove any skewness and transform the data into a numeric variable regression model that better fits a normal arithmetic mean, regression analysis, and scatter plot. This particular data transformation method is not the simplest, but it is one that creates some of the best log transformed data and response variable outcomes of any similar linear transformation, such as a logit transformation, a square root transformation, an arcsine transformation, a reciprocal transformation, or an inverse transformation. There are shortcut variations for base 2 and base 10.
# log in r - core syntax > log(9,3)  2
This is the basic logarithm function with 9 as the value and 3 as the base. The results are 2 because 9 is the square of 3.
# log in r example > log(5)  1.609438
Here, the second perimeter has been omitted resulting in a base of e producing the natural logarithm of 5.
# log in R - base 10 log > log(100,10)  2 > log10(100)  2
Here, we have a comparison of the base 10 logarithm of 100 obtained by the basic logarithm function and by its shortcut. For both cases, the answer is 2 because 100 is 10 squared.
# log in r - base notation > log(8,2)  3 > log2(8)  3
Here, we have a comparison of the base 2 logarithm of 8 obtained by the basic logarithm function and by its shortcut. For both cases, the answer is 3 because 8 is 2 cubed.
A log transformation is a process of applying a logarithm to data to reduce its skew. This is usually done when the numbers are highly skewed to reduce the skew so the data can be understood easier. Log transformation in R is accomplished by applying the log() function to vector, data-frame or other data set. Before the logarithm is applied, 1 is added to the base value to prevent applying a logarithm to a 0 value. The resulting presentation of the data is less skewed than the original making it easier to understand.
Doing a log transformation in R on vectors is a simple matter of adding 1 to the vector and then applying the log() function. The result is a new vector that is less skewed than the original.
# log in R - vector transformation > v = c(100,10,5,2,1,0.5,0.1,0.05,0.01,0.001,0.0001) > q=log(v+1) > q  4.6151205168 2.3978952728 1.7917594692 1.0986122887 0.6931471806 0.4054651081  0.0953101798 0.0487901642 0.0099503309 0.0009995003 0.0000999950 > plot(v) > plot(q)
A close look at the numbers above shows that v is more skewed than q. This fact is more evident by the graphs produced from the two plot functions including this code.
Log transforming your data in R for a data frame is a little trickier because getting the log requires separating the data. Taking the log of the entire dataset get you the log of each data point. However, you usually need the log from only one column of data.
# log in R example - data frame column > ChickWeight$logweight=log(ChickWeight$weight) > head(ChickWeight) weight Time Chick Diet logweight 1 42 0 1 1 3.737670 2 51 2 1 1 3.931826 3 59 4 1 1 4.077537 4 64 6 1 1 4.158883 5 76 8 1 1 4.330733 6 93 10 1 1 4.532599 > plot(head(ChickWeight$Time),head(ChickWeight$logweight)) > plot(head(ChickWeight$Time),head(ChickWeight$weight))
As you can see the pattern for accessing the individual columns data is dataframe$column. The head() returns a specified number rows from the beginning of a dataframe and it has a default value of 6. These plot functions graph weight vs time and log weight vs time to illustrate the difference a log transformation makes.
While log functions themselves have numerous uses, in data science, they can be used to format the presentation of data into an understandable pattern. They are handy for reducing the skew in data so that more detail can be seen. In R, they can be applied to all sorts of data from simple numbers, vectors, and even data frames. The usefulness of the log function in R is another reason why R is an excellent tool for data science.