Descriptive Statistics in R

In a series of learning data analysis using R, Let’s see different methods to perform descriptive statistics in R. This includes measures of central tendency, variability, and distribution shape for continuous variable.

For this tutorial, we shall use built-in dataset `mtcars`. This dataset consists of 32 observations and 11 variables. We shall use three variables alone for calculating descriptive statistics.

Refine the data

```> data("mtcars")
> data<-c("mpg","hp","wt")
mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Valiant           18.1 105 3.460
> data("mtcars")```

The base R installation has `summary()` function which shall be used to obtain descriptive statistics.

Example for descriptive statistics

```> summary(mtcars[data])
mpg              hp              wt
Min.   :10.40   Min.   : 52.0   Min.   :1.513
1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581
Median :19.20   Median :123.0   Median :3.325
Mean   :20.09   Mean   :146.7   Mean   :3.217
3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610
Max.   :33.90   Max.   :335.0   Max.   :5.424```

The `summary()` function provides the minimum, maximum, quartiles, and the mean for numerical variables and frequencies for factors and logical vectors. The above results doesn’t include Standard deviation, Skewness, Kurtosis and Variance. What if you need to calculate these statistics?. For this you may use `stat.desc` function in `pastecs` package.

```> install.packages("pastecs")
> library(pastecs)
> stat.desc(mtcars[data])
mpg           hp          wt
nbr.val       32.0000000   32.0000000  32.0000000
nbr.null       0.0000000    0.0000000   0.0000000
nbr.na         0.0000000    0.0000000   0.0000000
min           10.4000000   52.0000000   1.5130000
max           33.9000000  335.0000000   5.4240000
range         23.5000000  283.0000000   3.9110000
sum          642.9000000 4694.0000000 102.9520000
median        19.2000000  123.0000000   3.3250000
mean          20.0906250  146.6875000   3.2172500
SE.mean        1.0654240   12.1203173   0.1729685
CI.mean.0.95   2.1729465   24.7195501   0.3527715
var           36.3241028 4700.8669355   0.9573790
std.dev        6.0269481   68.5628685   0.9784574
coef.var       0.2999881    0.4674077   0.3041285```

There are many other packages that are available you may try `describe()` function in `psych` package. and let me know which is your preferable function/package for calculating descriptive statistics.

Importing data from SPSS in R

SPSS is one of the popular statistical package used for data analysis. But remember you have to pay a huge amount to buy this software. The SPSS datasets can be imported into R using `read.spss()` function in `foreign` package .

Alternatively, you can use the `spss.get()` function in the `Hmisc` package .

`spss.get()` is a wrapper function that automatically sets many parameters of `read.spss()` for you, making the transfer easier and more consistent with what data analysts expect as a result.

First, download and install the `Hmisc` package, foreign `package` is already installed by default in R

`install.packages("Hmisc")`

Then use the following code to import the data:

```library(Hmisc)
dataframe <- spss.get("sample.sav", use.value.labels=TRUE)```

In this code, `sample.sav` is the SPSS data file to be imported, `use.value.`
`labels=TRUE` tells the function to convert variables with value labels into R factors with those same levels, and `dataframe` is the resulting R data frame.

Useful functions for working with data objects in R

We have seen different data structures in R. Now let’s see some useful functions for working with these data objects.

 Functions Purpose `length(x)` Number of elements/components. `dim(x)` Dimensions of x. `str(x)` Structure of x. `class(x)` Class or type of x. `mode(x)` How x is stored. `names(x)` Names of components in x. `c(x1, x2,...)` Combines x1,x2 into a vector. `cbind(x1, x2, ...)` Combines x1,x2 as columns. `rbind(x1, x2, ...)` Combines x1,x2 as rows. `x` Prints the x. `head(x)` Lists the first part of the x. `tail(x)` Lists the last part of the x. `ls()` Lists current objects. `rm(x1, x2, ...)` Deletes one or more objects. The statement rm(list = ls()) will remove most objects from the working environment. `y <- edit(x)` Edits `x` and saves as `y`. `fix(x)` Edits in place.

If I have missed any functions, Let me know in the comment section below.

Data structures in R – Part 2

R has wide options for holding data, such as scalars, vectors, matrices, arrays, data frames, and lists. In Data structures in R – Part 1 we have seen scalars, vectors, matrices, arrays. Now let’s see data frames and lists.

Data frames

A data frame is more is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.  Here, different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets that we see in IBM SPSS, SAS and Stata. Data frames are the most common data structure that is used within R.

Characteristics of a data frame

1. Column names should not be empty.
2. Row names should be unique.
3. Data stored in a data frame can be of numeric, factor or character type.
4. Each column should contain the same number of data items.

Let’s see some example,

```> marklist<-data.frame( + rollno = c(1001:1006), + name = c("Abdul","Balu","Charlie","Daniel","Elisa","Fathima"), + marks = c(87,91,66,57,83,72) + ) > marklist
rollno    name marks
1   1001   Abdul    87
2   1002    Balu    91
3   1003 Charlie    66
4   1004  Daniel    57
5   1005   Elisa    83
6   1006 Fathima    72```

In the above example, you can observe that each column must have only one data type but you can have different columns inside the data frame with the different data type.

We can subscript data frame like the way we subscript matrices. Let’s see this with an example with the above used `marklist` dataset.

```> marklist[1,3]
[1] 87
> marklist[1:3]
rollno    name marks
1   1001   Abdul    87
2   1002    Balu    91
3   1003 Charlie    66
4   1004  Daniel    57
5   1005   Elisa    83
6   1006 Fathima    72
> marklist[c(1,3)]
rollno marks
1   1001    87
2   1002    91
3   1003    66
4   1004    57
5   1005    83
6   1006    72
> marklist[c("rollno","marks")]
rollno marks
1   1001    87
2   1002    91
3   1003    66
4   1004    57
5   1005    83
6   1006    72
> marklist\$name
[1] "Abdul"   "Balu"    "Charlie" "Daniel"
[5] "Elisa"   "Fathima"```

Factors

Factors are used to categorize the data and store it as levels. They can store both strings and integers. This is useful in the columns which have a limited number of unique values. For example, Male, Female, Neutral and True, False etc. They are useful in data analysis for statistical modelling. Factors are created using the `factor ()` function by taking a vector as input.

We will see more about factors practically when we discuss about statistical methods.

Data structures in R – Part 1

R has wide options for holding data, such as scalars, vectors, matrices, arrays, data frames, and lists. Let’s look at each structure in this post.

Scalars

Scalars are one-element vectors. These are used to hold constants.

Example

```a <- 1
b < "Phone"
c <- TRUE```

Vectors

Vectors are one-dimensional arrays that hold numbers, characters, or logical data. The combine function `c()` is used to form a vector. Vectors can hold only one data type you can mix numbers with characters. Let’s look at some example

Numeric vector

`a <- c(2,10,-5,15)`

Character vector

`b <- c("Male", "Female", "Neutral")`

Logical vector

`c <- c(TRUE, FALSE, FALSE, TRUE)`

To refer an elements of a vector you can use square brackets. For example,

```a<-c(2,4,6,8,10,12,14,16,18,20)
> a[6]
[1] 12
> a[3:6]
[1] 6 8 10 12
> a[c(1,7)]
[1] 2 14```

Matrices

A matrix is a two-dimensional array where each element has the same data type. Matrices are created with the `matrix` function. The syntax for matric function is

```a <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns,
byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames))```

where `vector` contains the elements for the matrix, `nrow` and `ncol` specify the row and column dimensions, and `dimnames` contains optional row and column labels stored in character vectors. The option `byrow` indicates whether the matrix should be filled in by row ( `byrow=TRUE` ) or by column ( `byrow=FALSE` ). The default is by column. The following listing demonstrates the matrix function.

Let’s see some examples for matrices now

Creating a 5×2 matrix

```> a<-matrix(1:10, nrow=5,ncol=2)
> a
[,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10```

Let’s create a `2x2` matrix with row and column label

```> cells <- c(2,8,12,16)
> r <- c("A1","A2")
> c <- c("X1","X2")
> b<-matrix(cells,nrow = 2, ncol = 2, byrow = TRUE,dimnames = list(r,c))
> b
X1 X2
A1  2  8
A2 12 16```

In the above example, a matrix was created `byrow = TRUE`, try the same argument with `FALSE` and see the difference.

Subscripts in matrix

You can also subscript matrix using square brackets

```> x<-matrix(11:20, nrow=2)
> x
[,1] [,2] [,3] [,4] [,5]
[1,]   11   13   15   17   19
[2,]   12   14   16   18   20
> m <-x[,3]
> m
[1] 15 16
> n <-x[1,4]
> n
[1] 17
> o <-x[2,c(3,4,5)]
> o
[1] 16 18 20```

First, we created a `2x5` matrix, then we subscript the matrix with square brackets mentioning the column number and row number.

Arrays

Arrays are similar to matrices, the difference is this can have more than two dimensions. If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type. This can be created with `array` function. The syntax for the function is

`array<-array(vector, dimentions, dimnames)`

Here `vector` contains the data for the array, `dimensions` is the numeric vector giving maximal index for each dimension and `dimnames` is an optional list of dimension labels. This is useful in programming new statistical methods.

Let’s see this with the following examples,

```> column <- c("COL1","COL2","COL3") > row <- c("ROW1","ROW2","ROW3") > matrix <- c("Matrix1","Matrix2") > a <- array(1:24,c(3,3,2),dimnames = list(column,row,matrix)) > a
, , Matrix1

ROW1 ROW2 ROW3
COL1    1    4    7
COL2    2    5    8
COL3    3    6    9

, , Matrix2

ROW1 ROW2 ROW3
COL1   10   13   16
COL2   11   14   17
COL3   12   15   18```

Keep reading about data structures. Data structures in R – Part 2

Getting help in R

As part of Learn Data analysis using R tutorials. This post explains how to use help in R or how to find help inside R.

There is extensive online help in the R system, the best starting point is to run the function `help.start()`. This will launch a local page inside your browser with links to the R manuals, R FAQ, a search engine and other links.

Help function

Now let’s see how to get help on a particular function. In the R Console, the function `help` can be used to see the help file of a specific function.

Example: Getting help for mean function in R

Use the following command to get help on `mean` function.

`help(mean)`

You will get the following Output explaining arguments available in function and examples on how to use the function.

```Arithmetic Mean

Description:

Generic function for the (trimmed) arithmetic mean.

Usage:

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)

Arguments:

x: An R object.  Currently there are methods for numeric/logical
vectors and date, date-time and time interval objects.
Complex vectors are allowed for ‘trim = 0’, only.

trim: the fraction (0 to 0.5) of observations to be trimmed from
each end of ‘x’ before the mean is computed.  Values of trim
outside that range are taken as the nearest endpoint.

na.rm: a logical value indicating whether ‘NA’ values should be
stripped before the computation proceeds.

...: further arguments passed to or from other methods.

Value:

If ‘trim’ is zero (the default), the arithmetic mean of the values
in ‘x’ is computed, as a numeric or complex vector of length one.
If ‘x’ is not logical (coerced to numeric), numeric (including
integer) or complex, ‘NA_real_’ is returned, with a warning.

If ‘trim’ is non-zero, a symmetrically trimmed mean is computed

with a fraction of ‘trim’ observations deleted from each end
before the mean is computed.

References:

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S

‘weighted.mean’, ‘mean.POSIXct’, ‘colMeans’ for row and column
means.
Examples:

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
```

help.search Function

Use the function `help.search` to list help files that contain a certain word. Use the following command to get help on word “linear regression”.

`help.search("linear regression")`

You will get the following Output

```Help files with alias or concept or title matching ‘linear regression’
using fuzzy matching:

datasets::anscombe Anscombe's Quartet of 'Identical' Simple Linear
Regressions
KernSmooth::dpill Select a Bandwidth for Local Linear Regression
Concepts: Non-linear Regression
MASS::rms.curv Relative Curvature Measures for Non-Linear
Regression
Concepts: Non-linear Regression
stats::D Symbolic and Algorithmic Derivatives of Simple
Expressions
Concepts: Non-linear Regression
stats::getInitial Get Initial Parameter Estimates
Concepts: Non-linear Regression
stats::nlm Non-Linear Minimization
Concepts: Non-linear Regression
stats::nls Nonlinear Least Squares
Concepts: Non-linear Regression
stats::nls.control Control the Iterations in nls
Concepts: Non-linear Regression
stats::optim General-purpose Optimization
Concepts: Non-linear Regression
stats::plot.profile.nls
Plot a profile.nls Object
Concepts: Non-linear Regression
stats::predict.nls Predicting from Nonlinear Least Squares Fits
Concepts: Non-linear Regression
stats::profile.nls Method for Profiling nls Objects
Concepts: Non-linear Regression
stats::vcov Calculate Variance-Covariance Matrix for a
Fitted Model Object
Concepts: Non-linear Regression

Type ’help(FOO, package = PKG)’ to inspect entry ’FOO(PKG) TITLE’.```

Each package in R comes up with manual which can be accessed from R or can be read from CRAN.

Learn Data analysis in R

R is a popular open-source program used in data analysis. Mukesh Ambani, Indian Billionaire pointed out data is the new oil. In such an era, programs like R are becoming very popular among data analysts. Let’s dive in and learn different concepts and learn how to use R. For user-friendly experience and ease of use for beginners the tutorials are demonstrated using R studio and IDE for R program.

Installing packages in R

One of the main reason that R has become so popular is the vast collection of packages available at the CRAN repository. The base R system comes with basic functionality. The packages are developed and published by the larger R community. In the past few years, the availability of packages has grown exponentially from one thousands to ten thousands.

Command-line method to install packages in R

Installing packages using the command line is faster and easier process once you get used to it. Let’s see how to install a package named “quantmod“.

`install.packages("quantmod")`

Now we have learned how to install a single package in R. Next lets see how to install multiple packages. Use the following command to install a couple of packages namely quantmod and MASS.

`install.packages(c("quantmod", "MASS"))`

Graphical method to install packages in R

If you are looking to install R without entering commands then follow these steps in RStudio

Step 1: Go to` Tools -> Install Package`, and you will get a pop-up window.

Step 2: Type the package you want to install in the pop-up window.

Installing R packages from external repositories

The above methods will work for those packages that are available in CRAN i.e., Official repository of R. In certain cases you might be looking to install packages available in external repositories. For this first install a package `devtools` using the command, `install.packages("devtools")`. Once installed then you can use the following commands.

Installing R package from Bioconductor

`install_bioc()`

Installing R package from a git repository

`install_git()`

Installing R package from Bitbucket

`install_bitbucket()`

Installing R package from a URL

`install_url()`

Once you have installed packages. You have to load a package in R to make use of it. For this use the simple command

`library(quantmod)`

Introduction to R programming

R is an open-source programming language designed for statistical analysis. In the late 1970s at Bell Laboratories, R was developed from the commercial S language. R is now licensed under freely available under the GNU General Public License by researchers at the University of Auckland, New Zealand.

R is very popular in the academic community, even I learned and used R for my PhD in finance.  R has wonderful graphing functionality. The demand for R has grown to a new level in recent years. Not just in academics, large companies have started to use R for big data analysis.

The main advantage of R is also extremely customizable. There are thousands of extensions for R, up from about 16081 by the time I am writing this post now in August 2019. Extension packages incorporate everything from time series analysis, to genomic science, to text mining. You can find these extensions in CRAN, a free repository maintained by R.

R also boasts impressive graphics, free and polished integrated development environments (IDEs), programmatic access to and from many general‐purpose languages, interfaces with popular proprietary analytics solutions including MATLAB and SAS, and even commercial support from Revolution Analytics.