Pandas

df.T - Transposes dataframe

Read CSV

# Data Types
dtype_ = {
    'Quantity' : 'int64',
    'Amount' : 'float64',
}

# Date Parser
my_date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y", errors='coerce')

# CSV with proper data and currency read
df = pd.read_csv(file, low_memory=False, dtype=dtype_, thousands=',',
                 parse_dates=['load_date', 'transaction_date'], date_parser=my_date_parser)

# find date range in huge CSV
df = pd.read_csv(filepath, usecols=['Date'], parse_dates=['Date'], date_parser=my_date_parser)
df.Date.max(), df.Date.min()

Select data

Indexing and Slicing Data - pandas.pydata.org

You can select data by index or label. Not only select by also assign, that is write data.
loc take real labels, that is column-names, and row-names. Does not take index.
iloc takes indexes, and also in different sq brackets can take labels.
at and iat are for accessing only single value in dataframe.
no function just slicing, df[2:4] works only on row indexes.

df.iloc[1:3,2:4] # pass row,column, takes index in same way as list.
df.iloc[1:3][2:4] # same as above
df.iloc[:,[2,4,6,22]] # all rows and specific index cols
df.iloc[1:3]['product','category'] # works

df.loc[1:3,'category']
df.loc[1:3,'product':'value'] # shows all columns between product and value
df.loc[1:3,['product','value']] # shows only prod and value, not in betweens.
df.loc[1,'product'] # returns cell, as the format of column, str, int.

df[2:20:2] - # row 3 to 20 with freq2, that is every alt row

Filter Conditions - Where clause

df = df[df.column_name != value]
df[df.column_name.isin(values_list)]
df[~df.column_name.isin(values_list)] # reverse of condition
df = df[(df.height != 11) & (df.age != 31)] # multiple conditions

Delete Data

Ops1: You can use where-clause, then index slicing and copy result to df_new variable
Ops2: Use df.drop and pass indexes to drop. To find indexes to drop, use query or df-slicing.

# Ops1: find True-False list, slice dataframe, store results
df_new = df[df.col_name != value]

# Ops2: where value is 0, slice dataframe, get index, pass to drop as row indexes to drop, do in place.
df.drop( df[df['value']==0].index, inplace=True)

Updates / Inserts

the way you select cell/column, there you can assign value to make updates or inserts

df.loc[4,'emp_name'] = 'John' # updates cell at row_index 4 and column "emp_name" with value 'John'

To Insert new row in dataframe, you can

concat the new row as dataframe to the existing dataframe
df.loc[-1] can be used to insert at start.

Insert DataFrame using Concat

# df is existing dataframe

targets = [
    {
        "year": 2023,
        "brand": "Target",
        "score": 52
    },
    {
        "year": 2024,
        "brand": "Target",
        "score": 68
    }
]
df_targets = pd.DataFrame(targets)

df = pd.concat([df, df_targets ]).reset_index(drop=True)

Here, new dataframe can have one or more dict as list. More on insert to pd -stackOverFlow

Insert List at Index of Row, like start, end, middle etc

 df.loc[-1] = [2, 3, 4]  # adding a row
 df.index = df.index + 1  # shifting index
 df = df.sort_index()  # sorting by index

Data Type Change

To avoid errors in converting float IDs to String and handle NULLs, Do

df.cust_id.fill_na('')
df.cust_id = df.cust_id.apply(lambda x: str(x).replace('.0', ''))

# this will change datatype but will keep decimal, 11.0 to '11.0'
df['emp_id'] = df['emp_id'].astype(str)

# this will convert 1.04 to 1
df.cust_id = df.cust_id.apply(lambda x : str(int(x)) )

# to check
df.dtypes

Group / Summarize / Aggregate

df.product.nunique() - unique products in dataframe. pandas.Series.nunique
df.product.value_counts() - unique products and count of records for each. pandas.Series.value_counts

Group Data

aggregation functions - mean, sum, count, size, max, min, first, last. alos, agg

# Groupby multiple columns & multiple aggregations
result = df.groupby('Courses').aggregate(
    {
        'Duration':'count',
        'Fee':['min','max'],
        'user_id': 'nunique'        # count distinct
    }
)

# Aggregate and rename, x y z are new col names
df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean'))

# agg and aggregate are same.

# Example
df_array2.groupby(['region']).aggregate(
    p25 = ('score',lambda x: x.quantile(0.25) ),
    min_salary = ('salary', 'min'),
    p75 = ('score',lambda x: x.quantile(0.75) )
)

#Create a groupby object
df_group = df.groupby("Product_Category")
#Select only required columns
df_columns = df_group[["UnitPrice(USD)","Quantity"]]
#Apply aggregate function
df_columns.mean()

User defined function with aggregate

def my_email_merge(x):
    ...

df_grouped = df.groupby(grp_by_cols).aggregate(
    email_id=('email_address', lambda x: my_email_merge(x)),
)

In above, my_email_merge() is a user defined functions, it takes dataframe col email_address and does processing, result is stored in new column email_id in grouped dataFrame.

Group and Work on Grouped Data

grouped = df_m[:].groupby(cols_to_group, dropna=False)
# dropna false will keep rows that have null value in cols_to_group

def build_category(x):
    # x is df in a group
    ...

df['category'] = grouped.apply(lambda x: build_category(x), include_groups=False).values


# returning multiple values from UDF on Grouped Data
res = grouped.apply(lambda x: build_category(x), include_groups=False).values
df_g['category'] = [r[0] for r in res]
df_g['data'] = [r[1] for r in res]


# add more columns from grouped
df_g['name'] = grouped['name'].first().values
df_g['surname'] = grouped['surname'].first().values

More on Pandas Pydata Aggregate
Examples from PyData Pandas User Guide

Sort Data

# Sort Descending
df.sort_values(by='count',ascending=False)

# Sort by custom order
source_sort_order = {'External': 1, 'Third Party': 2, 'Internal': 3}
df_m.sort_values(by=['source'], key=lambda x: x.map(source_sort_order), inplace=True)

Create New Columns or Calculated Fields

# applying col wise, so x is a row and function will apply column wise on each row

def build_some(x):
    fname = x['first_name'].strip().title()

    if pd.notnull(x['surname']) and len(x['surname']) > 0:
        name = fname + ' ' + x['surname']

    return (name, fname)

df[ ['name', 'fname'] ] = df.apply(lambda x : build_some(x) , axis = 1, result_type='expand')

# Example using function
df['item_type'] = df.apply(lambda x : get_item_type(x), axis=1)
def get_item_type(x):
    y = ""
    if x['order_type'] == "Valid" and len(x['inbound_reason']) > 1:
        y = "Inbound"
    elif x['order_type'] == "Need more info to respond":
        y = "Need Info"
    elif x['order_type'] == "Error in Data":
        y = "Error in Reporting"
    else:
        y = "Awaiting Category"
    return y

# Exmple single line
df['has_responded'] = df.apply(lambda x : 'Awaiting' if x['responded_by'] is None else 'Responded', axis=1)

ID col

sql_max_id = """
select max(id) as max_id from [dbo].[employee]
"""

connection = sqlalchemy.create_engine(connection_string, echo=False)
df_max_id = pd.read_sql(sql=sql_max_id, con=connection)

# hold max id in table
max_id = df_max_id.iat[0,0] or 0

# adds id from max to new lenght
df_users_dvs['id'] = range(max_id+1, max_id+1+len(df_users_dvs))

# Ideally use, auto-increment / Identity

Date Operations

df_wm['date_wm'].dt.year # year from datetime

Renaming Column Names

import re

def readable_column_names(df):
    """Makes column name sentance readable

    re.sub(pattern, repl, string)
    find a pattern and replaces in a string

    Args:
        df (DataFrame): data frame to clean

    Returns:
        DataFrame: cleaned readable column names
    """
    col_list = []  # holds names to check duplicates
    renamer = dict()  # holds col name and its number of duplicates

    for col in df.columns:

        new_col_name = col

        # camel case to space
        new_col_name = re.sub('([a-z0-9])([A-Z])', r'\1 \2', new_col_name).lower()

        new_col_name = new_col_name.replace("_", " ").replace("-", " ").title()

        if new_col_name in col_list:
            # rename
            index = int(renamer.get(new_col_name, 0))
            renamer[new_col_name] = index + 1
            new_col_name += "_" + str(index + 1)
            pass
        else:
            col_list.append(new_col_name)
        # print(f'col: {col}, new: {new_col_name}')
        df.rename(columns={col: new_col_name}, inplace=True)

    return df


def system_column_names(df):
    """Converts column names to snake_case

    Args:
        df (dataframe): dataframe

    Returns:
        dataframe: updated columnname dataframe
    """
    col_list = []     # holds names to check duplicates
    renamer = dict()  # holds col name and its number of duplicates
    for col in df.columns:

        new_col_name = col.strip()

        # Camel case to space
        new_col_name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', new_col_name).lower()

        # convert multi space to single space
        new_col_name = re.sub('([\s]{2,})', r' ', new_col_name)

        new_col_name = new_col_name.lower().replace(' ', '_').replace('-', '_')

        # multi _ to single _
        new_col_name = re.sub('([_]{2,})', r'_', new_col_name)


        if new_col_name in col_list:
            # rename
            index = int(renamer.get(new_col_name, 0))
            renamer[new_col_name] = index + 1
            new_col_name += '_' + str(index + 1)
            pass
        else:
            col_list.append(new_col_name)
        # print(f'col: {col}, new: {new_col_name}')
        df.rename(columns={col: new_col_name}, inplace=True)
    return df

Plot in Pandas

Params
- kind:str - The kind of plot to produce:
  - 'line' : line plot (default)
  - 'bar' : vertical bar plot
  - 'barh' : horizontal bar plot
  - 'hist' : histogram
  - 'box' : boxplot
  - 'kde' : Kernel Density Estimation plot
  - 'density' : same as 'kde'
  - 'area' : area plot
  - 'pie' : pie plot
  - 'scatter' : scatter plot (DataFrame only)
  - 'hexbin' : hexbin plot (DataFrame only)

import pandas as pd
import matplotlib.pyplot as plt
# ...
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()

pandas.DataFrame.query

pandas.DataFrame.query
pass where condition

df_new = df.query('sales != 0')

pandas.DataFrame.drop

pandas.DataFrame.drop
is used for dropping whole columns or rows
Params
- labels - list of index or column label, that is, index of row or column name.
- axis - 0 for row, 1 for columns. default=0 or row.
Returns
- DataFrame

df.drop(labels=['Employee_Name2'], axis=1) # deletes column
df.drop([0, 1]) # deletes row at index 0 and 1, that is, first and second row.

pandas.to_datetime

df['date'] = pd.to_datetime(df['date_'], format='%d %b %Y')
df['load_datetime'] = pd.to_datetime(arg='now', utc=True)

Links:
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Towardsdatascience - 15 Ways To Create A Pandas Dataframe

2025-01-12 January 12, 2025