Descriptive Statistics in R

In a series of learning data analysis using R, Let’s see different methods to perform descriptive statistics in R. This includes measures of central tendency, variability, and distribution shape for continuous variable.

For this tutorial, we shall use built-in dataset mtcars. This dataset consists of 32 observations and 11 variables. We shall use three variables alone for calculating descriptive statistics.

Refine the data

> data("mtcars")
> data<-c("mpg","hp","wt")
> head(mtcars[data])
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460
> data("mtcars")

The base R installation has summary() function which shall be used to obtain descriptive statistics.

Example for descriptive statistics

> summary(mtcars[data])
      mpg              hp              wt       
 Min.   :10.40   Min.   : 52.0   Min.   :1.513  
 1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
 Median :19.20   Median :123.0   Median :3.325  
 Mean   :20.09   Mean   :146.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :335.0   Max.   :5.424

The summary() function provides the minimum, maximum, quartiles, and the mean for numerical variables and frequencies for factors and logical vectors. The above results doesn’t include Standard deviation, Skewness, Kurtosis and Variance. What if you need to calculate these statistics?. For this you may use stat.desc function in pastecs package.

> install.packages("pastecs")
> library(pastecs)
> stat.desc(mtcars[data])
                     mpg           hp          wt
nbr.val       32.0000000   32.0000000  32.0000000
nbr.null       0.0000000    0.0000000   0.0000000
nbr.na         0.0000000    0.0000000   0.0000000
min           10.4000000   52.0000000   1.5130000
max           33.9000000  335.0000000   5.4240000
range         23.5000000  283.0000000   3.9110000
sum          642.9000000 4694.0000000 102.9520000
median        19.2000000  123.0000000   3.3250000
mean          20.0906250  146.6875000   3.2172500
SE.mean        1.0654240   12.1203173   0.1729685
CI.mean.0.95   2.1729465   24.7195501   0.3527715
var           36.3241028 4700.8669355   0.9573790
std.dev        6.0269481   68.5628685   0.9784574
coef.var       0.2999881    0.4674077   0.3041285

There are many other packages that are available you may try describe() function in psych package. and let me know which is your preferable function/package for calculating descriptive statistics.

Importing data from SPSS in R

SPSS is one of the popular statistical package used for data analysis. But remember you have to pay a huge amount to buy this software. The SPSS datasets can be imported into R using read.spss() function in foreign package .

Alternatively, you can use the spss.get() function in the Hmisc package .

spss.get() is a wrapper function that automatically sets many parameters of read.spss() for you, making the transfer easier and more consistent with what data analysts expect as a result.

First, download and install the Hmisc package, foreign package is already installed by default in R

install.packages("Hmisc")

Then use the following code to import the data:

library(Hmisc)
dataframe <- spss.get("sample.sav", use.value.labels=TRUE)

In this code, sample.sav is the SPSS data file to be imported, use.value.
labels=TRUE tells the function to convert variables with value labels into R factors with those same levels, and dataframe is the resulting R data frame.

 

Useful functions for working with data objects in R

We have seen different data structures in R. Now let’s see some useful functions for working with these data objects.

Functions Purpose
length(x) Number of elements/components.
dim(x) Dimensions of x.
str(x) Structure of x.
class(x) Class or type of x.
mode(x) How x is stored.
names(x) Names of components in x.
c(x1, x2,...) Combines x1,x2 into a vector.
cbind(x1, x2, ...) Combines x1,x2 as columns.
rbind(x1, x2, ...) Combines x1,x2 as rows.
x Prints the x.
head(x) Lists the first part of the x.
tail(x) Lists the last part of the x.
ls() Lists current objects.
rm(x1, x2, ...) Deletes one or more objects. The statement
rm(list = ls()) will remove most objects
from the working environment.
y <- edit(x) Edits x and saves as y.
fix(x) Edits in place.

If I have missed any functions, Let me know in the comment section below.

Data structures in R – Part 2

R has wide options for holding data, such as scalars, vectors, matrices, arrays, data frames, and lists. In Data structures in R – Part 1 we have seen scalars, vectors, matrices, arrays. Now let’s see data frames and lists.

Data frames

A data frame is more is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.  Here, different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets that we see in IBM SPSS, SAS and Stata. Data frames are the most common data structure that is used within R.

Characteristics of a data frame

  1. Column names should not be empty.
  2. Row names should be unique.
  3. Data stored in a data frame can be of numeric, factor or character type.
  4. Each column should contain the same number of data items.

Let’s see some example,

> marklist<-data.frame( + rollno = c(1001:1006), + name = c("Abdul","Balu","Charlie","Daniel","Elisa","Fathima"), + marks = c(87,91,66,57,83,72) + ) > marklist
  rollno    name marks
1   1001   Abdul    87
2   1002    Balu    91
3   1003 Charlie    66
4   1004  Daniel    57
5   1005   Elisa    83
6   1006 Fathima    72

In the above example, you can observe that each column must have only one data type but you can have different columns inside the data frame with the different data type.

We can subscript data frame like the way we subscript matrices. Let’s see this with an example with the above used marklist dataset.

> marklist[1,3]
[1] 87
> marklist[1:3]
  rollno    name marks
1   1001   Abdul    87
2   1002    Balu    91
3   1003 Charlie    66
4   1004  Daniel    57
5   1005   Elisa    83
6   1006 Fathima    72
> marklist[c(1,3)]
  rollno marks
1   1001    87
2   1002    91
3   1003    66
4   1004    57
5   1005    83
6   1006    72
> marklist[c("rollno","marks")]
  rollno marks
1   1001    87
2   1002    91
3   1003    66
4   1004    57
5   1005    83
6   1006    72
> marklist$name
[1] "Abdul"   "Balu"    "Charlie" "Daniel" 
[5] "Elisa"   "Fathima"

Factors

Factors are used to categorize the data and store it as levels. They can store both strings and integers. This is useful in the columns which have a limited number of unique values. For example, Male, Female, Neutral and True, False etc. They are useful in data analysis for statistical modelling. Factors are created using the factor () function by taking a vector as input.

We will see more about factors practically when we discuss about statistical methods.

Data structures in R – Part 1

R has wide options for holding data, such as scalars, vectors, matrices, arrays, data frames, and lists. Let’s look at each structure in this post.

Scalars

Scalars are one-element vectors. These are used to hold constants.

Example

a <- 1
b < "Phone"
c <- TRUE

Vectors

Vectors are one-dimensional arrays that hold numbers, characters, or logical data. The combine function c() is used to form a vector. Vectors can hold only one data type you can mix numbers with characters. Let’s look at some example

Numeric vector

a <- c(2,10,-5,15)

Character vector

b <- c("Male", "Female", "Neutral")

Logical vector

c <- c(TRUE, FALSE, FALSE, TRUE)

To refer an elements of a vector you can use square brackets. For example,

a<-c(2,4,6,8,10,12,14,16,18,20)
> a[6]
[1] 12
> a[3:6]
[1] 6 8 10 12
> a[c(1,7)]
[1] 2 14

Matrices

A matrix is a two-dimensional array where each element has the same data type. Matrices are created with the matrix function. The syntax for matric function is

a <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,
byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames))

where vector contains the elements for the matrix, nrow and ncol specify the row and column dimensions, and dimnames contains optional row and column labels stored in character vectors. The option byrow indicates whether the matrix should be filled in by row ( byrow=TRUE ) or by column ( byrow=FALSE ). The default is by column. The following listing demonstrates the matrix function.

Let’s see some examples for matrices now

Creating a 5×2 matrix

> a<-matrix(1:10, nrow=5,ncol=2)
> a
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Let’s create a 2x2 matrix with row and column label

> cells <- c(2,8,12,16)
> r <- c("A1","A2")
> c <- c("X1","X2")
> b<-matrix(cells,nrow = 2, ncol = 2, byrow = TRUE,dimnames = list(r,c))
> b
   X1 X2
A1  2  8
A2 12 16

In the above example, a matrix was created byrow = TRUE, try the same argument with FALSE and see the difference.

Subscripts in matrix

You can also subscript matrix using square brackets

> x<-matrix(11:20, nrow=2)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]   11   13   15   17   19
[2,]   12   14   16   18   20
> m <-x[,3]
> m
[1] 15 16
> n <-x[1,4]
> n
[1] 17
> o <-x[2,c(3,4,5)]
> o
[1] 16 18 20

First, we created a 2x5 matrix, then we subscript the matrix with square brackets mentioning the column number and row number.

Arrays

Arrays are similar to matrices, the difference is this can have more than two dimensions. If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type. This can be created with array function. The syntax for the function is

array<-array(vector, dimentions, dimnames)

Here vector contains the data for the array, dimensions is the numeric vector giving maximal index for each dimension and dimnames is an optional list of dimension labels. This is useful in programming new statistical methods.

Let’s see this with the following examples,

> column <- c("COL1","COL2","COL3") > row <- c("ROW1","ROW2","ROW3") > matrix <- c("Matrix1","Matrix2") > a <- array(1:24,c(3,3,2),dimnames = list(column,row,matrix)) > a
, , Matrix1

     ROW1 ROW2 ROW3
COL1    1    4    7
COL2    2    5    8
COL3    3    6    9

, , Matrix2

     ROW1 ROW2 ROW3
COL1   10   13   16
COL2   11   14   17
COL3   12   15   18

Keep reading about data structures. Data structures in R – Part 2

Getting help in R

As part of Learn Data analysis using R tutorials. This post explains how to use help in R or how to find help inside R.

There is extensive online help in the R system, the best starting point is to run the function help.start(). This will launch a local page inside your browser with links to the R manuals, R FAQ, a search engine and other links.

Help function

Now let’s see how to get help on a particular function. In the R Console, the function help can be used to see the help file of a specific function.

Example: Getting help for mean function in R

Use the following command to get help on mean function.

help(mean)

You will get the following Output explaining arguments available in function and examples on how to use the function.

Arithmetic Mean

Description:

     Generic function for the (trimmed) arithmetic mean.

Usage:

     mean(x, ...)
     
     ## Default S3 method:
     mean(x, trim = 0, na.rm = FALSE, ...)
     
Arguments:

       x: An R object.  Currently there are methods for numeric/logical
          vectors and date, date-time and time interval objects.
          Complex vectors are allowed for ‘trim = 0’, only.

    trim: the fraction (0 to 0.5) of observations to be trimmed from
          each end of ‘x’ before the mean is computed.  Values of trim
          outside that range are taken as the nearest endpoint.

   na.rm: a logical value indicating whether ‘NA’ values should be
          stripped before the computation proceeds.

     ...: further arguments passed to or from other methods.

Value:

     If ‘trim’ is zero (the default), the arithmetic mean of the values
     in ‘x’ is computed, as a numeric or complex vector of length one.
     If ‘x’ is not logical (coerced to numeric), numeric (including
     integer) or complex, ‘NA_real_’ is returned, with a warning.

If ‘trim’ is non-zero, a symmetrically trimmed mean is computed

with a fraction of ‘trim’ observations deleted from each end
before the mean is computed.

References:

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
Language_. Wadsworth & Brooks/Cole.

See Also:

‘weighted.mean’, ‘mean.POSIXct’, ‘colMeans’ for row and column
means. 
Examples:

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

help.search Function

Use the function help.search to list help files that contain a certain word. Use the following command to get help on word “linear regression”.

help.search("linear regression")

You will get the following Output

Help files with alias or concept or title matching ‘linear regression’
using fuzzy matching:


datasets::anscombe Anscombe's Quartet of 'Identical' Simple Linear
Regressions
KernSmooth::dpill Select a Bandwidth for Local Linear Regression
MASS::area Adaptive Numerical Integration
Concepts: Non-linear Regression
MASS::rms.curv Relative Curvature Measures for Non-Linear
Regression
Concepts: Non-linear Regression
stats::D Symbolic and Algorithmic Derivatives of Simple
Expressions
Concepts: Non-linear Regression
stats::getInitial Get Initial Parameter Estimates
Concepts: Non-linear Regression
stats::nlm Non-Linear Minimization
Concepts: Non-linear Regression
stats::nls Nonlinear Least Squares
Concepts: Non-linear Regression
stats::nls.control Control the Iterations in nls
Concepts: Non-linear Regression
stats::optim General-purpose Optimization
Concepts: Non-linear Regression
stats::plot.profile.nls
Plot a profile.nls Object
Concepts: Non-linear Regression
stats::predict.nls Predicting from Nonlinear Least Squares Fits
Concepts: Non-linear Regression
stats::profile.nls Method for Profiling nls Objects
Concepts: Non-linear Regression
stats::vcov Calculate Variance-Covariance Matrix for a
Fitted Model Object
Concepts: Non-linear Regression

Type ’help(FOO, package = PKG)’ to inspect entry ’FOO(PKG) TITLE’.

Each package in R comes up with manual which can be accessed from R or can be read from CRAN.

 

Learn Data analysis in R

R is a popular open-source program used in data analysis. Mukesh Ambani, Indian Billionaire pointed out data is the new oil. In such an era, programs like R are becoming very popular among data analysts. Let’s dive in and learn different concepts and learn how to use R. For user-friendly experience and ease of use for beginners the tutorials are demonstrated using R studio and IDE for R program.

Installing packages in R

One of the main reason that R has become so popular is the vast collection of packages available at the CRAN repository. The base R system comes with basic functionality. The packages are developed and published by the larger R community. In the past few years, the availability of packages has grown exponentially from one thousands to ten thousands.

Command-line method to install packages in R

Installing packages using the command line is faster and easier process once you get used to it. Let’s see how to install a package named “quantmod“.

install.packages("quantmod")

Now we have learned how to install a single package in R. Next lets see how to install multiple packages. Use the following command to install a couple of packages namely quantmod and MASS.

install.packages(c("quantmod", "MASS"))

Graphical method to install packages in R

If you are looking to install R without entering commands then follow these steps in RStudio

Step 1: Go to Tools -> Install Package, and you will get a pop-up window.

Step 2: Type the package you want to install in the pop-up window.

Installing R packages from external repositories

The above methods will work for those packages that are available in CRAN i.e., Official repository of R. In certain cases you might be looking to install packages available in external repositories. For this first install a package devtools using the command, install.packages("devtools"). Once installed then you can use the following commands.

Installing R package from Bioconductor

install_bioc()

Installing R package from a git repository

install_git()

Installing R package from Bitbucket

install_bitbucket()

Installing R package from a URL

install_url()

Loading installed packages

Once you have installed packages. You have to load a package in R to make use of it. For this use the simple command

library(quantmod)

Introduction to R programming

R is an open-source programming language designed for statistical analysis. In the late 1970s at Bell Laboratories, R was developed from the commercial S language. R is now licensed under freely available under the GNU General Public License by researchers at the University of Auckland, New Zealand.

R is very popular in the academic community, even I learned and used R for my PhD in finance.  R has wonderful graphing functionality. The demand for R has grown to a new level in recent years. Not just in academics, large companies have started to use R for big data analysis.

The main advantage of R is also extremely customizable. There are thousands of extensions for R, up from about 16081 by the time I am writing this post now in August 2019. Extension packages incorporate everything from time series analysis, to genomic science, to text mining. You can find these extensions in CRAN, a free repository maintained by R.

R also boasts impressive graphics, free and polished integrated development environments (IDEs), programmatic access to and from many general‐purpose languages, interfaces with popular proprietary analytics solutions including MATLAB and SAS, and even commercial support from Revolution Analytics.