03
JanChristmas Special : Upto 40% OFF! + 2 free courses - SCHEDULE CALL
R packages provide some sample data to learn the tools of data science but in real-time you have to work on your custom data for real-world applications. The objective of this tutorial is to explain how to read plain-text rectangular files into R. We will mainly focus on data import packages, but many of the principles will translate to other forms of data. We’ll explore some concepts of packages that are useful for other types of data.
Importing external data in R environment means that we can read data from external files, write data to external files, and can access those files from outside the R environment. File formats like csv, xml, xlsx, json, and web data can be imported into the R environment to read the data and perform data analysis, data manipulations and after data analysis data in R can be exported to external files in the same file formats.
The methods given below are the base functions of R for importing different types of external files, so no separate packages are required to install to use these functions.
Each of the functions come with default arguments which separates them from each other. These arguments are:-
TRUE
, then function assumes that your file has a header row. If headers are not given in excel file, then give value as FALSE
.TRUE
value, unequal length rows will be added with blank fields implicitly.“\t”
is used for a tab-delimited file and a comma is used for comma-separated files.TRUE
if you want your text data to be converted to factors while importing.We also need to give filename with a complete path or complete URL of the file is on the web.
Data Science Training - Using R and Python
To import a file located in your local machine, follow the below steps: -
setwd ("")
and then provide the file name in the function.choose()
within the import function. This will ask to select a file from your local machine.The functions in this package (more or less) are used in a similar way as of the base functions. The main thing is that readr package functions work almost 10 times much faster than the base functions. So in the case of bulk files, importing this package will give very good speed in importing data.
CSV file is a kind of text file in which the values in columns are separated by a comma.
First, we have to set our working directory with the setwd()
function to the directory where the file is located locally.
For example:
setwd("C:\Users\sarvendra.singh\Desktop\Janbask\Blogs\New blogs_july19\Random forest in R”)
Then to read values from CSV, there is a built-inunction read.csv() that export the data from the file as a data frame.
For example:
read.data <- read.csv("file1.csv")
print(read.data)
Analyzing a CSV File
#For printing number of columns
print(ncol(read.data))
Output:
[1] 5
#For printing number of rows
print(nrow(read.data))
Output:
[1] 8
Writing values of pandas dataframe to an external CSV file, we use the write.csv()
function.
per.sal <- subset(read.data, empsalary >= "30000" & empsalary <= "40000")
# Writing data into a new CSV file
write.csv(per.sal,"output.csv")
new.data <- read.csv("output.csv")
print(new.data)
XML (Extensible Markup Language) file contains both data and file format using the ASCII text. It is a little similar to an HTML file that contains markup tags, but the tags in an XML file describe the meaning of the data contained in the file rather than the structure of the page.
To load data of XML file in R, we have to install the XML package, as described below-
install.packages("XML")
To read XML files, we use the in-built function xmlParse().
For example:
#To load required xml package to read XML files
library("XML")
#To load other required packages
library("methods")
#To give the input file name to the function
newfile <- xmlParse(file = "samplefile.xml")
print(newfile)
Converting an XML to a Data Frame
Read: What Exactly Does a Data Scientist Do?
To perform data analysis effectively after importing data in R, we convert the data in an XML file to a Data Frame. After converting, we can perform data manipulation and other operations as performed in a data frame.
For example:
library("XML")
library("methods")
#To convert the data in xml file to a data frame
xmlRdataframe <- xmlToDataFrame("samplefile.xml")
print(xmlRdataframe)
Output:
ID | NAME | SALARY | STARTDATE | DEPT | |
1 | 1 | Sam | 32000 | 01/01/2001 | HR |
2 | 2 | Rob | 36000 | 09/03/2006 | IT |
3 | 3 | Max | 42000 | 01/05/2011 | Sales |
4 | 4 | Ivar | 50000 | 25/01/2001 | Tech |
5 | 5 | Robert | 25000 | 13/07/2015 | Sales |
6 | 6 | Leon | 57000 | 05/01/2000 | IT |
7 | 7 | Samuel | 45000 | 27/03/2003 | Operations |
8 | 8 | Jack | 24000 | 06/01/2016 | Sales |
To import JSON files data into R, we first need to install or load the rjson package. JSON (JavaScript Object Notation) file is generally used for data exchange between a web application and a server. They are text-based human-readable files and editable by a normal text editor.
install.packages("rjson")
Now to read json files, we use the in-built function from JSON() which stores the data as a list.
For example:
#To load rjson package
library("rjson")
#To give the file name to the function
newfile <- fromJSON(file = "samplefile1.json")
#To print the file
print(newfile)
Output:
$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"
$Name
[1] "Sam" "Rob" "Max" "Robert" "Ivar" "Leon" "Samuel" "Ivar"
$Salary
[1] "32000" "27000" "35000" "25000" "37000" "41000" "36000" "51000"
$StartDate
[1] "1/1/2001" "9/3/2003" "1/5/2004" "14/11/2007" "13/7/2015" "4/3/2007"
[7] "27/3/2013" "25/7/2000"
$Dept
[1] "IT" "HR" "Tech" "HR" "Sales" "HR"
[7] "Operations" "IT"
Converting a JSON File to a Data Frame
To convert JSON file to a Data Frame, we use the as.data.frame()
function.
For example:
library("rjson")
newfile <- fromJSON(file = "samplefile1.json")
#To convert a JSON file to a data frame
jsonRdataframe <- as.data.frame(newfile)
print(jsonRdataframe)
Output:
ID | NAME | SALARY | STARTDATE | DEPT | |
1 | 1 | Sam | 32000 | 01/01/2001 | IT |
2 | 2 | Rob | 27000 | 09/03/2003 | HR |
3 | 3 | Max | 35000 | 01/05/2004 | Tech |
4 | 4 | Ivar | 25000 | 14/11/2007 | HR |
5 | 5 | Robert | 37000 | 13/07/2015 | Sales |
6 | 6 | Leon | 41000 | 04/03/2007 | HR |
7 | 7 | Samuel | 36000 | 27/03/2013 | Operations |
8 | 8 | Jack | 51000 | 25/07/2000 | IT |
Data Science Training - Using R and Python
Microsoft Excel is a very popular file format that stores data in xls and xlsx format. We can read data from excel to R and write data from R back to Excel file using the readxl
package in R.
To install the readxl
package, run the following command
install.packages("readxl")
read_excel() function is used to import file into R and stores it as a data frame.
newfile <- read_excel("sheet1.xlsx)
print(newfile)
Output:
ID | NAME | DEPT | SALARY | AGE | |
1 | 1 | SAM | SALES | 32000 | 35 |
2 | 2 | ROB | HR | 36000 | 23 |
3 | 3 | MAC | IT | 37000 | 40 |
4 | 4 | IVAR | IT | 25000 | 37 |
5 | 5 | MAX | R&D | 30000 | 22 |
6 | 6 | ROBERT | HR | 27000 | 32 |
7 | 7 | SAMUEL | FINANCE | 50000 | 41 |
8 | 8 | RAGNAR | SALES | 45000 | 29 |
Reading and retrieving HTML tables from the web, we use the XML and RCurl packages in R programming.
Below scripts are used to install XML and Rcurl libraries.
install.packages("XML")
install.packages("RCurl")
For load the packages, use the following command:
library("XML")
library("RCurl")
For example, we will fetch the ‘Fiscal annual growth table from a url using the readHTMLTable() function which stores it as a Data Frame.
#To fetch a table from any website paste the url
url <- "https://en.wikipedia.org/wiki/Fiscal_Annuak_Growth#Ranking"
tabs <- getURL(url)
#To fetch the first table, if the webpage has more than one table, we use which = 1
tabs <- readHTMLTable(tabs,which = 1, stringsAsFactors = F)
head(tabs)
Output:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | |
1 | Classification | Jurisdiction | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 |
2 | Very High | New Zealand | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 3 | 3 | 2 | 2 |
3 | Very Low | Singapore | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | Medium | Denmark | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 5 | 6 | 6 | 5 |
5 | Outstanding | Hong Kong | 4 | 5 | 4 | 4 | 3 | 2 | 2 | 2 | 2 | 3 | 4 |
6 | Very Easy | South Korea | 5 | 4 | 5 | 5 | 5 | 7 | 8 | 8 | 16 | 19 | 23 |
str() function to analyze the structure of the data frame.
Read: Data Scientist Resumes That Will Get You An Interview Call
str(tabs)
Output:
'data.frame': 191 obs. of 16 variables:
$ V1 : chr "Classification" "Very Easy" "Very Easy" "Very Easy" ...
$ V2 : chr "Jurisdiction" "New Zealand" "Singapore" "Denmark" ...
$ V3 : chr "2019" "1" "2" "3" ...
$ V4 : chr "2018" "1" "2" "3" ...
$ V5 : chr "2017" "1" "2" "3" ...
$ V6 : chr "2016" "2" "1" "3" ...
$ V7 : chr "2015" "2" "1" "4" ...
$ V8 : chr "2014" "3" "1" "5" ...
$ V9 : chr "2013" "3" "1" "5" ...
$ V10: chr "2012" "3" "1" "5" ...
$ V11: chr "2011" "3" "1" "6" ...
$ V12: chr "2010" "2" "1" "6" ...
$ V13: chr "2009" "2" "1" "5" ...
$ V14: chr "2008" "2" "1" "5" ...
$ V15: chr "2007" "2" "1" "7" ...
$ V16: chr "2006" "1" "2" "8" ...
foreign package is required to install to read SPSS files in R. ”read” function is used to read and SPS files in R."to.data.frame" option is used to return a data frame from reading spss file.
library(foreign) # load the foreign package
help(read.spss) # documentation
mydata = read.spss("myfile", to.data.frame=TRUE)
To read Minitab Portable Worksheet format in R, function read.mtp can be used from the foreign package. It returns a list of components in the Minitab worksheet.
library(foreign) # load the foreign package
help(read.mtp) # documentation
mydata = read.mtp("mydata.mtp") # read from .mtp file
In Real-world applications data is mostly located in external databases, not in CSV files. So it is must to know how to make a connection and access data in R from SQL/NO SQL Databases.
Connecting R to MySQL is easy with the RMySQL package. First, install the package and load the library.
install.packages("RMySQL")
library(RMySQL)
Once the RMySQL library is installed create a database connection object.credentials like username, password needs to be provided as arguments in dbconnect function.
mysqldb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')
Listing Tables and columns:
After the connection has been established we list the tables and fields in the database we connected to.
dbListTables(mysqldb)
A list of the tables will be returned in our connection.
dbListFields(mysqldb, 'some_table')
This will return a list of the fields in some_table.
Executing Queries:
With dbSendQuery function, we can run queries on tables.
dbSendQuery(mydb, 'drop table if exists some_table, some_other_table')
Creating Tables:
We can create tables in the database using R data frames.
dbWriteTable(mydb, name='table_name', value=data.frame.name)
To retrieve data from the database we need to save a results set object.
Read: 10 Most In-demand Skills of Data Scientist to Flourish in Your Career
rs = dbSendQuery(mydb, "select * from some_table")
Query results will remain on the MySQL server. To access the results in R, we need to use the fetch function.
data = fetch(rs, n=-1)
fetch command will save the results of the query in a data frame object. The n in the function specifies the number of records to retrieve, using n=-1 retrieves all pending records.
MongoDB connection with R is quite straightforward. The Package required to install for this connection is RMongo package. New features of MongoDB which are implemented in the last few years have not been included in the community R drivers.
require(RMongo)
mongo <- mongoDbConnect('kenblog', 'localhost', 27017)
In the mongoDbConnect method, we need to give the name of the database, server name, and port number to which we want to connect.
Now to retrieve records, we create a query. For this example, let’s get only the examdata from our scores collection. We can use the dbGetQuery method for this which takes a connection object, the collection name, and the query.
examQuery <- dbGetQuery(mongo, 'scores', "{'type': 'exam'}")
This loads in all the records from our scores collection of type exam. Let’s take the values of our exam scores and create a vector from them.
exam_scores <- examQuery[c('score')]
Now we can get a simple summary of our data with command summary(exam_scores):
score
Min. : 60.00
1st Qu.: 72.00
Median : 79.00
Mean : 79.45
3rd Qu.: 86.00
Max. :100.00
Data Science Training - Using R and Python
If we are working with large datasets that may consume lots of memory of our system, its better to have the data loaded in own server with an SQL/PostgreSQL database on it, where you can query the data in smaller digestible chunks
Let’s learn how to connect the PostgreSQL database with R. We need to install RPostgreSQL package for this approach.
To connect, we need to enter the following commands in R:
# install.packages("RPostgreSQL")
require("RPostgreSQL")
# create a connection
# save the password that we can "hide" it as best as we can by collapsing it
pw <- {
"new_user_password"
}
# loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
# note that "con" will be used later in each connection to the database
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = 5432,
user = "openpg", password = pw)
rm(pw) # removes the password
# check for the cartable
dbExistsTable(con, "cartable")
Library | Objective | Function |
base | Export csv | write.csv() |
xlsx | Export excel | write.xlsx() |
haven | Export spss | write_sav() |
haven | Export sas | write_sas() |
haven | Export stata | write_dta() |
base | Export R | save() |
googledrive | Upload Google Drive | drive_upload() |
googledrive | Open in Google Drive | drive_browse() |
googledrive | Retrieve file ID | drive_get(as_id()) |
googledrive | Dowload from Google Drive | download_google() |
googledrive | Remove file from Google Drive | drive_rm() |
rdrop2 | Authentification | drop_auth() |
rdrop2 | Create a folder | drop_create() |
rdrop2 | Upload to Dropbox | drop_upload() |
rdrop2 | Read csv from Dropbox | drop_read_csv |
rdrop2 | Delete file from Dropbox | drop_delete() |
Conclusion
Reading data from different sources for analysis and exporting the results to some other system for report writing can be frustrating tasks because this can only take a lot of time compared to statistical analysis. Below is the summarised table having a library and corresponding functions name to import a particular file in R.
Were you able to understand how to import data into R? Now that you can import data in R, Login to Janbask Training and learn about more such techniques!
Read: The Best Data Science Projects (Beginner To Advanced)
A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.
Cyber Security
QA
Salesforce
Business Analyst
MS SQL Server
Data Science
DevOps
Hadoop
Python
Artificial Intelligence
Machine Learning
Tableau
Search Posts
Related Posts
Receive Latest Materials and Offers on Data Science Course
Interviews