Python Data Wrangling

Pandas is a package in Python that can be used for data manipulation.

What is Data Manipulation?

Data manipulations can be organized around six key verbs:

arrange: order dataframe by index or variable or sort the data
select: choose a specific variable or set of variables or select columns in data
filter: subset a dataframe according to condition(s) in a variable(s) or select rows in data
mutate: transform dataframe by adding new variables or add a calculated column
group_by: create a grouped dataframe
summarize: reduce variable to summary variable (e.g. mean)

Here, variable is a column in data set.

We'll cover how to perform above operations on a dataset using Pandas.

text = '''colA colB
Jan 239
Feb 234
'''

from io import StringIO
import pandas as pd
pd.read_csv(StringIO(text),delimiter=' ')

We can filter data to get a set of rows from complete dataset. It is similar to WHERE clause in SQL.

R is also an excellent programming language for data manipulation. dplyr is a package in R that can be used to perform above operations.

An excellent article by Ben, The 5 verbs of dplyr, can provide you more details on this.

Another article that compares R and Python can be found here.

Pandas docs excellent details with examples.

2025-01-12 Jul 2018