Introduction to Data Wrangling:
Primarily, data wrangling refers to the processing of data into various formats for the purpose of analysis like: grouping, concatenation and other requirements. In python, we have various built-in functions to perform required data wrangling process for better analytics. In the code, given below I have highlight some of the functions.
Pre Requisites:
Have Installed Python (I am using Jupyter Notebook)
Python Libraries:
Numpy
Pandas
Dataset(s) file (I am using CSV Data file, however we can perform using any other data frame and can also put data runtime as well. You can access the data using following link:
https://docs.google.com/spreadsheets/d/12HsYU0P6DmlsrmVTj0i-liSSGMxGZ1pXIfiglw7f1Qc/edit?usp=sharing
Data wrangling functions:
First of all lets import numpy and pandas libraries
import numpy as np
import pandas as pd
Read and Replace values:
#Read CSV File and Replace NULL Values with NAN
originaldata = pd.read_csv(‘./Group 1_CustomerData.csv’, encoding = ‘ISO-8859-1’, na_values = [‘NA’, ‘N/A’,’-‘])
print(originaldata)
Data Manipulation using Grouping and Filtering:
grouping = originaldata.groupby(‘city’)
print(grouping.get_group(‘Hong Kong’))
Data Manipulation using Sorting Data:
sorted_data = originaldata.sort_values(‘city’, # sort by City
ascending=False, # enable descending order
kind=’heapsort’, #sorting algorithm
na_position=’last’) # keep NaN value at last
print(sorted_data)
Dealing with Missing values:
#check weather there are missing values or not
sorted_data.isnull().values.any()
#lets calculate the no of missing values against each category
sorted_data.isnull().sum()
#####For missing values I am replacing with “Not Provided”
sorted_data[“middleName”].fillna(“Not Provided”, inplace = True)
###repeat this step for all categories/columns and verify the results
print(sorted_data)
Data Manipulation using Merge:
Data manipulation includes sorting, merging, grouping, and altering the data. Pandas offers multiple functions to manipulate data as: Sorting, Grouping and Merging.
we have already discuss the sorting and grouping. For merge we are using the same code and here is the working of merge:
In Python Pandas library has merge
function which can be used to combine two dataframes, just like two SQL tables using joins as:
head
and tail
will get the three rows from the top and bottom as dataframes.
outer
is required to combine two dissimilar (no common rows) dataframes (tables).
- Enabling
indicator
will provide information about the dataframe source of each row (left or right).
mergedata = pd.merge(sorted_data.head(3),
sorted_data.tail(3),
how=’outer’,
indicator = True)
#lets see the difference
print(mergedata)
#Similarly for Concatenate:
top_df = sorted_data.head(3)
bottom_df = sorted_data.tail(3)
combined_originaldata = pd.concat( [top_df, bottom_df] )
print(combined_originaldata)
Reshaping and PIVOT data:
#PIVOT data
Pivotdata = combined_originaldata.pivot(index=’customerNo’,columns=’firstName’,values=’zipCode’)
print(Pivotdata)
#Stacked Data
Stacked_data = Pivotdata.stack()
print(Stacked_data)
#Now let’s check the duplicates:
Duplicates = sorted_data.duplicated(subset=None, keep=’first’)
print(Duplicates)
Aggregate functions:
total_customers = sorted_data[‘customerNo’].count()
print(total_customers)
#Insert count in to data file:
sorted_data.insert(2, “total_customers”, total_customers)
print(sorted_data)
Aggregated_data = sorted_data.pivot_table(index=’customerNo’,columns=’firstName’,values=’zipCode’, aggfunc=[‘count’])
print(Aggregated_data)
We can also rename the headers using following code:
sorted_data.rename(columns = {‘Unnamed: 102′:’Renamed Column 1’}, inplace = True)
print(sorted_data)
Write data into CSV file:
Write data into a CSV file and save it into directory
#Create CSV File and put wrangling data into file
sorted_data.to_csv(‘datawrangling.csv’,
index=False,
sep=’,’,
encoding=’utf-8′)
Summary:
In this post, I have discussed various data wrangling functions available built-in in Pandas library for Python. In coming posts, we will try to explore more features.