Skip to content

Pandas

  • df.T - Transposes dataframe

Read CSV

# Data Types
dtype_ = {
    'Quantity' : 'int64',
    'Amount' : 'float64',
}

# Date Parser
my_date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y", errors='coerce')

# CSV with proper data and currency read
df = pd.read_csv(file, low_memory=False, dtype=dtype_, thousands=',',
                 parse_dates=['load_date', 'transaction_date'], date_parser=my_date_parser)

# find date range in huge CSV
df = pd.read_csv(filepath, usecols=['Date'], parse_dates=['Date'], date_parser=my_date_parser)
df.Date.max(), df.Date.min()

Select data

  • You can select data by index or label. Not only select by also assign, that is write data.
  • loc take real labels, that is column-names, and row-names. Does not take index.
  • iloc takes indexes, and also in different sq brackets can take labels.
  • at and iat are for accessing only single value in dataframe.
  • no function just slicing, df[2:4] works only on row indexes.
df.iloc[1:3,2:4] # pass row,column, takes index in same way as list.
df.iloc[1:3][2:4] # same as above
df.iloc[:,[2,4,6,22]] # all rows and specific index cols
df.iloc[1:3]['product','category'] # works

df.loc[1:3,'category']
df.loc[1:3,'product':'value'] # shows all columns between product and value
df.loc[1:3,['product','value']] # shows only prod and value, not in betweens.
df.loc[1,'product'] # returns cell, as the format of column, str, int.

df[2:20:2] - # row 3 to 20 with freq2, that is every alt row

Filter Conditions - Where clause

df = df[df.column_name != value]
df[df.column_name.isin(values_list)]
df[~df.column_name.isin(values_list)] # reverse of condition
df = df[(df.height != 11) & (df.age != 31)] # multiple conditions

Delete Data

  • Ops1: You can use where-clause, then index slicing and copy result to df_new variable
  • Ops2: Use df.drop and pass indexes to drop. To find indexes to drop, use query or df-slicing.
# Ops1: find True-False list, slice dataframe, store results
df_new = df[df.col_name != value]

# Ops2: where value is 0, slice dataframe, get index, pass to drop as row indexes to drop, do in place.
df.drop( df[df['value']==0].index, inplace=True)

Updates / Inserts

  • the way you select cell/column, there you can assign value to make updates or inserts
df.loc[4,'emp_name'] = 'John' # updates cell at row_index 4 and column "emp_name" with value 'John'

To Insert new row in dataframe, you can

  • concat the new row as dataframe to the existing dataframe
  • df.loc[-1] can be used to insert at start.

Insert DataFrame using Concat

# df is existing dataframe

targets = [
    {
        "year": 2023,
        "brand": "Target",
        "score": 52
    },
    {
        "year": 2024,
        "brand": "Target",
        "score": 68
    }
]
df_targets = pd.DataFrame(targets)

df = pd.concat([df, df_targets ]).reset_index(drop=True)

Here, new dataframe can have one or more dict as list. More on insert to pd -stackOverFlow

Insert List at Index of Row, like start, end, middle etc

 df.loc[-1] = [2, 3, 4]  # adding a row
 df.index = df.index + 1  # shifting index
 df = df.sort_index()  # sorting by index

Data Type Change

To avoid errors in converting float IDs to String and handle NULLs, Do

df.cust_id.fill_na('')
df.cust_id = df.cust_id.apply(lambda x: str(x).replace('.0', ''))
# this will change datatype but will keep decimal, 11.0 to '11.0'
df['emp_id'] = df['emp_id'].astype(str)

# this will convert 1.04 to 1
df.cust_id = df.cust_id.apply(lambda x : str(int(x)) )

# to check
df.dtypes

More on Stackoverflow

Group / Summarize / Aggregate

Group Data

  • aggregation functions - mean, sum, count, size, max, min, first, last. alos, agg
# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate(
    {
        'Duration':'count',
        'Fee':['min','max'],
        'user_id': 'nunique'        # count distinct
    }
)

# Aggregate and rename, x y z are new col names
df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean'))

# agg and aggregate are same.

# Example
df_array2.groupby(['region']).aggregate(
    p25 = ('score',lambda x: x.quantile(0.25) ),
    min_salary = ('salary', 'min'),
    p75 = ('score',lambda x: x.quantile(0.75) )
)

#Create a groupby object
df_group = df.groupby("Product_Category")
#Select only required columns
df_columns = df_group[["UnitPrice(USD)","Quantity"]]
#Apply aggregate function
df_columns.mean()

User defined function with aggregate

def my_email_merge(x):
    ...

df_grouped = df.groupby(grp_by_cols).aggregate(
    email_id=('email_address', lambda x: my_email_merge(x)),
)

In above, my_email_merge() is a user defined functions, it takes dataframe col email_address and does processing, result is stored in new column email_id in grouped dataFrame.

Group and Work on Grouped Data

grouped = df_m[:].groupby(cols_to_group, dropna=False)
# dropna false will keep rows that have null value in cols_to_group

def build_category(x):
    # x is df in a group
    ...

df['category'] = grouped.apply(lambda x: build_category(x), include_groups=False).values


# returning multiple values from UDF on Grouped Data
res = grouped.apply(lambda x: build_category(x), include_groups=False).values
df_g['category'] = [r[0] for r in res]
df_g['data'] = [r[1] for r in res]


# add more columns from grouped
df_g['name'] = grouped['name'].first().values
df_g['surname'] = grouped['surname'].first().values

Sort Data

# Sort Descending
df.sort_values(by='count',ascending=False)

# Sort by custom order
source_sort_order = {'External': 1, 'Third Party': 2, 'Internal': 3}
df_m.sort_values(by=['source'], key=lambda x: x.map(source_sort_order), inplace=True)

Create New Columns or Calculated Fields

# applying col wise, so x is a row and function will apply column wise on each row

def build_some(x):
    fname = x['first_name'].strip().title()

    if pd.notnull(x['surname']) and len(x['surname']) > 0:
        name = fname + ' ' + x['surname']

    return (name, fname)

df[ ['name', 'fname'] ] = df.apply(lambda x : build_some(x) , axis = 1, result_type='expand')

# Example using function
df['item_type'] = df.apply(lambda x : get_item_type(x), axis=1)
def get_item_type(x):
    y = ""
    if x['order_type'] == "Valid" and len(x['inbound_reason']) > 1:
        y = "Inbound"
    elif x['order_type'] == "Need more info to respond":
        y = "Need Info"
    elif x['order_type'] == "Error in Data":
        y = "Error in Reporting"
    else:
        y = "Awaiting Category"
    return y

# Exmple single line
df['has_responded'] = df.apply(lambda x : 'Awaiting' if x['responded_by'] is None else 'Responded', axis=1)

ID col

sql_max_id = """
select max(id) as max_id from [dbo].[employee]
"""

connection = sqlalchemy.create_engine(connection_string, echo=False)
df_max_id = pd.read_sql(sql=sql_max_id, con=connection)

# hold max id in table
max_id = df_max_id.iat[0,0] or 0

# adds id from max to new lenght
df_users_dvs['id'] = range(max_id+1, max_id+1+len(df_users_dvs))

# Ideally use, auto-increment / Identity

Date Operations

df_wm['date_wm'].dt.year # year from datetime

Renaming Column Names

import re

def readable_column_names(df):
    """Makes column name sentance readable

    re.sub(pattern, repl, string)
    find a pattern and replaces in a string

    Args:
        df (DataFrame): data frame to clean

    Returns:
        DataFrame: cleaned readable column names
    """
    col_list = []  # holds names to check duplicates
    renamer = dict()  # holds col name and its number of duplicates

    for col in df.columns:

        new_col_name = col

        # camel case to space
        new_col_name = re.sub('([a-z0-9])([A-Z])', r'\1 \2', new_col_name).lower()

        new_col_name = new_col_name.replace("_", " ").replace("-", " ").title()

        if new_col_name in col_list:
            # rename
            index = int(renamer.get(new_col_name, 0))
            renamer[new_col_name] = index + 1
            new_col_name += "_" + str(index + 1)
            pass
        else:
            col_list.append(new_col_name)
        # print(f'col: {col}, new: {new_col_name}')
        df.rename(columns={col: new_col_name}, inplace=True)

    return df


def system_column_names(df):
    """Converts column names to snake_case

    Args:
        df (dataframe): dataframe

    Returns:
        dataframe: updated columnname dataframe
    """
    col_list = []     # holds names to check duplicates
    renamer = dict()  # holds col name and its number of duplicates
    for col in df.columns:

        new_col_name = col.strip()

        # Camel case to space
        new_col_name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', new_col_name).lower()

        # convert multi space to single space
        new_col_name = re.sub('([\s]{2,})', r' ', new_col_name)

        new_col_name = new_col_name.lower().replace(' ', '_').replace('-', '_')

        # multi _ to single _
        new_col_name = re.sub('([_]{2,})', r'_', new_col_name)


        if new_col_name in col_list:
            # rename
            index = int(renamer.get(new_col_name, 0))
            renamer[new_col_name] = index + 1
            new_col_name += '_' + str(index + 1)
            pass
        else:
            col_list.append(new_col_name)
        # print(f'col: {col}, new: {new_col_name}')
        df.rename(columns={col: new_col_name}, inplace=True)
    return df

Plot in Pandas

  • Params
    • kind:str - The kind of plot to produce:
      • 'line' : line plot (default)
      • 'bar' : vertical bar plot
      • 'barh' : horizontal bar plot
      • 'hist' : histogram
      • 'box' : boxplot
      • 'kde' : Kernel Density Estimation plot
      • 'density' : same as 'kde'
      • 'area' : area plot
      • 'pie' : pie plot
      • 'scatter' : scatter plot (DataFrame only)
      • 'hexbin' : hexbin plot (DataFrame only)
import pandas as pd
import matplotlib.pyplot as plt
# ...
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()

pandas.DataFrame.query

df_new = df.query('sales != 0')

pandas.DataFrame.drop

  • pandas.DataFrame.drop
  • is used for dropping whole columns or rows
  • Params
    • labels - list of index or column label, that is, index of row or column name.
    • axis - 0 for row, 1 for columns. default=0 or row.
  • Returns
    • DataFrame
df.drop(labels=['Employee_Name2'], axis=1) # deletes column
df.drop([0, 1]) # deletes row at index 0 and 1, that is, first and second row.

pandas.to_datetime

df['date'] = pd.to_datetime(df['date_'], format='%d %b %Y')
df['load_datetime'] = pd.to_datetime(arg='now', utc=True)

2025-01-12 January 12, 2025