Black Friday Deal : Up to 40% OFF! + 2 free self-paced courses + Free Ebook  - SCHEDULE CALL

- Python Blogs -

Python Pandas Tutorial Guide for Beginners

Pandas is a Python package and data manipulation tool developed by Wes McKinney. It is built on top of the Numpy package and its main data structure is DataFrame. Pandas provide fast and flexible data structures that can work with relational and classified data with great ease and intuitively. It provides fundamental high-level building blocks to perform practical and real-world data analysis in Python. Also, pandas is one of the most powerful and flexible open-source data analysis and manipulation tools available in any language. Pandas popularity has grown with time and now it is estimated that 5 to 10 million users use this and it is now a must-use tool in the Python data science toolkit.

Python Pandas Tutorial

Pandas is a BSD-licensed Python library. Python with Pandas is used among the different array of fields like academic and commercial domains like finance, economics, statistics, analytics. In this tutorial, we will learn different features of Python Pandas and its practical applications. Pandas can be called as “SQL of Python”. It helps to manage two-dimensional data tables in Python.

Prerequisites

Pandas library is built on top of NumPy and so it uses most of the functionalities of NumPy. So it is recommended to go through our tutorial on NumPy before proceeding with this tutorial.

Installing Pandas

The source code is currently hosted on GitHub at: https://github.com/pandas-dev/pandas

Below are the commands to install using conda and pip


# conda
conda install pandas
# or PyPI
pip install pandas

Pandas Features

Below are a few of the main features of the pandas:-

  • Missing data handling (called as NaN)
  • Data frame fields can be inserted and deleted so data frames are mutable
  • Arranging data in an ascending or descending order
  • Dataframe filtration with any condition
  • Powerful group by functionality to perform Data aggregation and data transformation
  • Merging, concatenating and joining data sets.
  • It provides reshaping and pivoting of data sets with many options
  • Strong IO mechanism for importing data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Creating data range, frequency conversion, date shifting, lagging, and other time-series functionality

Pandas data structures

There are two types of data structures in pandas: Series and DataFrames.

1). Series: Series is just like a one-dimensional array-like object. It can contain any data types, e.g. integers, floats, strings, Python objects, and so on. It can be compared with two arrays: one as the index/labels, and the other one containing actual data.

We can create fine a sample Series object in the following example by instantiating a Pandas Series object with a list.


import pandas as pd
S = pd.Series([11, 28, 72, 3, 5, 8])
S

The above Python code returned the following result:


0     11
1     28
2     72
3     3
4     5
5     8
dtype: int64

2). DataFrame: Pandas DataFrame is a two or more dimensional data structure – basically like a relational table with rows and columns. The columns have names and the rows have indexes.

The idea of a Data-Frame is based on spreadsheets. We can see the data structure of a Data-Frame is just like a spreadsheet. A Data-Frame has both a row and a column index. a Data-Frame object contains an ordered collection of columns similar to excel sheet. Different fields of data-frame can have different types, for example, the first column may consist of string, while the second one consists of boolean values and so on.

Pandas data structures

Pandas - Descriptive Statistics Functions

Pandas have an array of functions and methods that collectively calculate descriptive statistics on data-frame columns. Basic aggregation methods are like sum(), mean(), but some of them, like sumsum() produces an object of the same size. Axis argument can be provided in these methods, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer.

DataFrame − “index” (axis=0, default), “columns” (axis=1)

Data-frame Creation: Dataframe can be created with pandas just like below.


import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df

Output-


Age  Name   Rating
25   Tom     4.23
26   James   3.24
25   Ricky   3.98
23   Vin     2.56

sum()

This function returns the sum of the values for the requested axis. By default, axis is index (axis=0). Below is a sample example to calculate the sum of the numeric field of data-frame.


import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)

Output:


print      df.sum(1)
0            29.23
1            29.24
2            28.98
3            25.56
4            33.20
5            33.60
6            26.80
7            37.78
8            42.98
9            34.80
10           55.10

mean()

Read: What is a DataFrame in Python?

Returns the average value

std()

It returns the standard deviation of the numerical columns.

Functions & Description

Summarizing Data

The describe() function computes the given statistics of the DataFrame columns. This function gives the mean, std and IQR values and function excludes the character columns and given a summary about numeric columns. This function excludes string fields and gives statistics of only numeric fields.


import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()

Its output is as follows −


            Age                Rating
count     12.000000           12.000000
mean      31.833333            3.743333
std       9.232682             0.661628
min       23.000000            2.560000
25%       25.000000            3.230000
50%       29.500000            3.790000
75%       35.500000            4.132500
max       51.000000            4.800000

Pandas - Function Application

Going forward to a high level, there are three important methods. The usage depends on whether we want to apply an operation on an entire Data-set, row/column-wise, or elements- wise.

  • Data-frame Function Application: pipe()
  • Row/Column level Function Application: apply()
  • Element level Function Application: applymap()

Data-frame-wise Function Application

Custom functions can be applied by passing the function name with the appropriate number of parameters as pipe arguments. Thus, an operation is performed on the whole Data-Frame.

This adder function adds two integer values passed as parameters and returns the sum.


def adder(ele1,ele2):
return ele1+ele2

Let’s apply custom function on data-frame level.


df = pd. DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)

Let’s see the full program −


import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df.apply(np.mean)

Its output is as follows −


     col1        col2        col3
0   2.176704   2.219691   1.509360
1   2.222378   2.422167   3.953921
2   2.241096   1.135424   2.696432
3   2.355763   0.376672   1.182570
4   2.308743   2.714767   2.130288

Row or Column Wise Function Application

With apply() method, all user-defined functions can be applied along the axes(row or column-wise) of a Data-Frame. If any axis is not defined then by default, the column-wise operation is performed.

Example 1


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print df.apply(np.mean)

Output −


col1   -0.288022
col2   1.044839
col3   -0.187009
dtype: float64

If axis parameter is defined with value 1 , operations can be performed row wise as  described below.

Example 2


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print df.apply(np.mean)

Output


col1    0.034093
col2   -0.152672
col3   -0.229728
dtype: float64

Example 3


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print df.apply(np.mean)

Output:

Read: Python vs Java : Which Programming Language is Best for Your Career?

col1   -0.167413
col2   -0.370495
col3   -0.707631
dtype: float64

Element Wise Function Application

applymap() method applies a function that accepts and returns a scalar to every element of a Data-Frame.

Example


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# My custom function
df['col1'].map(lambda x:x*100)
print df.apply(np.mean)

Output:


col1        0.480742
col2        0.454185
col3        0.266563
dtype: float64

Pandas Iteration Operation

Iteration is a term for extracting each item of something like a list or array, one after another. To iterate over data-frame, we have to iterate a data-frame like a dictionary because data-frame is consisting of rows and columns.

we can iterate an element in two ways in pandas datasets

  • Iterating over rows
  • Iterating over columns

Iterating over rows : We have three built-in functions iteritems(), iterrows(), and itertuples() to iterate over rows. .

Iteration over rows using iterrows(): Now we apply iterrows() function to get each element of rows.


# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
# iterating over rows using iterrows() function
for i, j in df.iterrows():
print(i, j)
print()

Iteration  using iteritems():

The Second method to iterate rows is using iteritems() function. This function runs through again over each column as key, value pair with a label as key and column value as a Series object.


# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
# using iteritems() function to retrieve rows
for key, value in df.iteritems():
print(key, value)
print()

Iterating over Columns:

We have to create a list of data-frame columns to iterate over columns, after that we can iterate over that list to cover all columns.


# creating a list of dataframe columns
columns = list(df)
for i in columns:
# printing the third element of the column
print (df[i][2])

Output:


Sudhir
M.tech
80

Pandas – Sorting

Pandas have sort_values() function to sort a data frame by particular column in ascending or descending order. It’s different than the sorted Python function 

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

Parameters:

  • by: List of columns by which dataset needs to be sorted.
  • axis: 0 for rows and 1 ’ for Column.
  • ascending: Perform sorting in ascending order if True.
  • inplace: Boolean value. Change the passed dataset with sorted value if True.
  • kind: Sorting algorithm(‘quicksort’, ‘mergesort’ or ‘heapsort’) to apply during sorting.
  • na_position: Two values ‘last’ or ‘first’ to set the position of Null values. Default is ‘last’.

# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv")
# display
data

Output:

Pandas – Sorting


# sorting data frame by name
data.sort_values("Salary", axis = 0, ascending = True,
inplace = True, na_position ='first')

As shown in the output image, The NaN values are at the top and after that comes the sorted value of Salary.

Pandas – Sorting

Pandas - Indexing and Selecting Data

Pandas indexing operators "[ ]" and attribute operator "." provide a quick way to access Pandas data structures across a wide range of use cases.

Pandas have three types of Multi-axes indexing; the three types are mentioned in the following table −

.loc()

Read: Python Career Path - How & Why to Pursue Python Career Options!

Pandas provide several methods to have purely label based indexing. When slicing, the start bound is also included. Integers are appropriate labels, but they point to the label and not the location.

.loc() has multiple access methods like −

  • A single scalar label
  • A list of labels
  • A slice object
  • A Boolean array

loc takes two single/list/range operators separated by ','. The first indicates the row and the second one represents columns.

Example1


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
#select all rows of dataset for a particular column
print df.loc[:,'A']

Output


a   0.391548
b  -0.070649
c  -0.317212
d  -2.162406
e   2.202797
f   0.613709
g   1.050559
h   1.122680
Name: A, dtype: float64

Example 2


# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select all rows for more than one columns
print df.loc[:,['A','C']]

Its output is as follows −

     A            C
a   0.391548     0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e   2.202797     0.528067
f   0.613709     0.286414
g   1.050559     0.216526
h   1.122680   -1.621420

.iloc() Pandas have several methods to get integer-based indexing. Like python and numpy, these are 0-based indexing. The various access methods are as follows:

  • An Integer
  • A list of integers
  • A range of values

Example 1


# importing  the pandas and numpy library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(12, 4), columns = ['A', 'B', 'C', 'D'])
# select all rows for a specific column
print df.iloc[:4]

Its output is as follows −


     A            B            C            D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378     0.025070   0.230806
3   0.539042  -1.284314     0.826977   -0.026251

Pandas - Missing Data

Missing Data can occur when no information is provided for any cell in data-frame. Missing Data is a very big problem in real-life scenarios because they can affect model behavior. Missing Data can also be called as NA(Not Available) values in pandas.

In Pandas missing data is represented by two values:

  • None: None is a Python singleton object that is often used for missing data in Python code.
  • NaN: NaN  is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. Pandas provide several useful functions for finding, removing, and replacing null values in Pandas Data-Frame :

  • isnull()
  • notnull()
  • dropna()
  • fillna()
  • replace()

Example1


# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# creating bool series True for NaN values
bool_series = pd.isnull(data["Gender"])
# filtering data
# displaying data only with Gender = NaN
data[bool_series]

Example2:


# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe using dictionary
df = pd.DataFrame(dict)
# using notnull() function
df.notnull()

Pandas - Working with Text Data

Series have a set of string processing methods to make several operations on array elements. These methods  are used with str attribute and generally have names matching the equivalent built-in string methods:

Splitting and Replacing Strings:  str.split() is a  function that returns a list of strings after splitting the given string by the specified separator but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. To replace data, we use str.replace(). This function works like Python .replace() method only, but it works on Series too before calling.

String Concatenation

We can use str.cat() to concatenate strings .This function is used to concatenate strings to the passed caller series of string. The values of a different series can be different but the length of both the series has to be the same.

Example1


# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# making copy of address column
new = df["Address"].copy()
# concatenating address with name column
# overwriting name column
df["Name"]= df["Name"].str.cat(new, sep =", ")
# display
print(df)

Example2:


# importing pandas module
import pandas as pd
# importing csv from link
data = pd.read_csv("nba.csv")
# making copy of team column
new = data["Team"].copy()
# concatenating team with name column
# overwriting name column
data["Name"]= data["Name"].str.cat(new, sep =", ")
# display
Data 

Pandas str methods:

FUNCTION DESCRIPTION
str.lower() This is to convert a string’s characters to lowercase
str.upper() This is to convert a string’s characters to uppercase
str.find() This is used to search for a substring in each string present in a series
str.rfind() This is used to search for a substring in each string present in a series from the Right side
str.findall() This is also used to find substrings or separators in each string in a series
str.isalpha() This is used to check if all characters in each string in series are alphabetic(a-z/A-Z)
str.isdecimal() This method is used to check whether all characters in a string are decimal
str.title() This method is used to capitalize the first letter of every word in a string
str.len() This method returns a count of the number of characters in a string
str.replace() This method replaces a substring within a string with another value that the user provides
str.contains() This method tests if pattern or regex is contained within a string of a Series or Index
str.extract() Extract groups from the first match of regular expression pattern.
str.startswith() This tests if the start of each string element matches a pattern
str.endswith() This tests if the end of each string element matches a pattern
str.isdigit() This is used to check if all characters in each string in series are digits
str.lstrip() This removes whitespace from the left side (beginning) of a string
str.rstrip() This removes whitespace from the right side (end) of a string
str.strip() This to remove leading and trailing whitespace from a string
str.split() This splits a string value, based on the occurrence of a user-specified value
str.join() This method is used to join all elements in the list present in a series with passed delimiter
str.cat() This method is used to concatenate strings to the passed caller series of string.
str.repeat() This method is used to repeat string values in the same position of passed series itself
str.get() This method is used to get the element at the passed position
str.partition() This method splits the string only at the first occurrence unlike str.split()
str.rpartition() This method is used splits string only once and that too reversely. It works in a similar way like str.partition() and str.split()
str.pad() This method is used to add padding (whitespaces or other characters) to every string element in a series
str.swapcase) This method is used  to swap case of each string in a series

Conclusion

Python with Pandas is used in a different and wide range of domains like academic and commercial domains including finance, Retail, Statistics, analytics, etc. Pandas is such a great library for all tasks from importing data to Data analysis and deriving insightful results. Packages like NumPy and matplotlib make most of your data analysis and data visualization very easy and handy.

Read: Python Skills for Staying Ahead in a Rapidly-Changing Field


fbicons FaceBook twitterTwitter lingedinLinkedIn pinterest Pinterest emailEmail

     Logo

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

Cyber Security Course

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
Cyber Security Course

Upcoming Class

0 day 22 Nov 2024

QA Course

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
QA Course

Upcoming Class

1 day 23 Nov 2024

Salesforce Course

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
Salesforce Course

Upcoming Class

0 day 22 Nov 2024

Business Analyst Course

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
Business Analyst Course

Upcoming Class

0 day 22 Nov 2024

MS SQL Server Course

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
MS SQL Server Course

Upcoming Class

1 day 23 Nov 2024

Data Science Course

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
Data Science Course

Upcoming Class

0 day 22 Nov 2024

DevOps Course

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
DevOps Course

Upcoming Class

5 days 27 Nov 2024

Hadoop Course

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
Hadoop Course

Upcoming Class

0 day 22 Nov 2024

Python Course

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
Python Course

Upcoming Class

8 days 30 Nov 2024

Artificial Intelligence Course

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
Artificial Intelligence Course

Upcoming Class

1 day 23 Nov 2024

Machine Learning Course

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
Machine Learning Course

Upcoming Class

35 days 27 Dec 2024

 Tableau Course

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
 Tableau Course

Upcoming Class

0 day 22 Nov 2024

Search Posts

Reset

Receive Latest Materials and Offers on Python Course

Interviews