Data Structures and Algorithms¶

Recursion

Break big problem into smaller problem that are same and solve the smallest one, and built on top of it.

Sorting Algorithms

insertion
merge
quick sort
understanding which to use depends on the data and how sorted it is in different scenarios

Searching Algorithms

Binary
Breadth first
Depth first
Use bradth-first search, when finding shortest path between two nodes in a tree
Don't use it on tree having high branching factor.

DSA Overview¶

Why is DSA important?
- In every problem there is a budget, which defines time and money. Hence, it is important to understand the cost of memory and time of execution.
Edge Cases
- extreme cases of input or situation
- like: empty array, bad format
Steps to solve a problem using DSA
1. state problem clearly
2. define inputs and outputs and their formats
3. state test cases and edge cases in english
4. state correct solution in english
5. code and test
6. repeat 2-5
7. analyze algo's complexity or inefficiency
8. refactor code, test. repeat 2-8
Best Practices in Coding using DSA
- create a function
- give meaningful name and follow standards
- do test driven development
- 10 mins code built test repeat, small steps
- abstract code into functions and make them generic. eg, repetitive logic can be moved to a separate function and used in changing condition function. as in binary-search logic is same but matching condition may vary, so make a generic binary-search function and use it in other functions.
- The solving approach and coding both are important, approach & right technique is more important that coding
- don't start coding directly, rather write clearly the problem statement and then solution in plain english, then code.
linear search algorithm or brute force algorithm
- simple, start from 0 and traverse whole array
Time calculation of code execution
- 1Ghz machine can execute 10^9 instructions in 1 second (giga = 9). Instructions are low level binary code. so a high level python statement can execute tens or thousands of instructions, but that's how you calculate execution time of once statement, and the set of statements in function and then number iterations of that function in worst case gives its complexity.
- iterations --of--> function of statements --having--> statement --build_of--> instructions --processed_on--> CPU
Analyze complexity and find inefficiency
- can you reduce the number of iterations in the loop? eg, use binary search instead of linear search on a sorted array.
- complexity of algo is the max time it will take to finish, that is total time in worst case. it is usually based on the size of array n
- O(n) or Big O Notation or 'Order of n' means that a code-block is executed n number of times at max.
- O(log n) means that code will iterate log n times.

Time Complexity¶

Time Complexity

Amount of time, in worst case scenario, given the length of input.
O(1) simple var access - constant time complexity
- given a list, you do not iterate whole list, instead just access one element. even if list lenght iceases, the time will not increase as your function just prints one element.
O(n) loop over list - linear
O(log n) binary search, where you eliminate half result.
O(n^2), for loop within for loop, you iterate over each list element. - quadratic
O(n log n), O(2^n) and O(n!) (n factorial), not common.
You should know time complexity of solution.
If there are multiple steps, time complexity is worst step, that is what we want to know.
Big O
- O represents order of. O(n) means, the time complexity increases as order of n. If n increases the time increases by it.
- if you plot time and n on graph, let's say eq is $t = an + b$, where $a$ and $b$ are constant, then t we see that t increases majorly by $a x n$, which is order of n. hence, O(n). You ignore the co-efficients from the equations.
- For $t = 20$ it is $t = 20 x 1$, hence O(1)
- For $t = an^2 + c$, notation is, $O(n^2)$
- Big O does not depend on machine capability, twice fast machine will change coefficient, which is ignored. However, the function complexity will remain the same, that is, constant, linear or quadratic etc irrespective of machine's capability.
Evaluating Big O
- the assignment operation will have complexity constant, of 1 so $O(1)$
- Now, if this is in for loop, it repeats n times, hence time is $n \times O(1)$ that will be $O(n)$
- Now, if there is another for loop outside this, time taken is $n \times O(n)$, that will be $O(nn)$ or $O(n^2)$

Comparison of time complexity with common algo and common orders

Algo	Binary Search	Simple Search	Quicksort	Selection Sort	The Traveling Salesman
Complexity / Size	$O(\log n)$	$O(n)$	$O(n\log n)$	$O(n^2)$	$O(n!)$
10	0.3s	1s	3.3s	10s	4.2 days
100	0.6s	10s	66.4s	16.6m	$2.9 \times 10^{149}$ yrs
1000	1s	100s	996s	27.7 hrs	$1.27 \times 10^{2559}$ yrs

Link - YK Sugi CS Dojo

In [6]:

Copied!





from scipy.special import factorial
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go

x = np.arange(1,5,1)

fig = make_subplots(rows=1, cols=5, shared_yaxes=True)

fig.add_trace(go.Line(x=x,y=np.log2(x), name='log n'), row=1, col=1)
fig.add_trace(go.Line(x=x,y=x, name='n'), row=1, col=2)
fig.add_trace(go.Line(x=x,y=x*np.log2(x), name='n log n'), row=1, col=3)
fig.add_trace(go.Line(x=x,y=x*x, name='n<sup>2</sup>'), row=1, col=4)
fig.add_trace(go.Line(x=x,y=factorial(x), name='n!'), row=1, col=5)

fig.update_layout(height=300, width=1000, title_text="Big O - Time Complexity")
fig.show()
from scipy.special import factorial
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go

x = np.arange(1,5,1)

fig = make_subplots(rows=1, cols=5, shared_yaxes=True)

fig.add_trace(go.Line(x=x,y=np.log2(x), name='log n'), row=1, col=1)
fig.add_trace(go.Line(x=x,y=x, name='n'), row=1, col=2)
fig.add_trace(go.Line(x=x,y=x*np.log2(x), name='n log n'), row=1, col=3)
fig.add_trace(go.Line(x=x,y=x*x, name='n2'), row=1, col=4)
fig.add_trace(go.Line(x=x,y=factorial(x), name='n!'), row=1, col=5)

fig.update_layout(height=300, width=1000, title_text="Big O - Time Complexity")
fig.show()

Built In Data Types¶

Tuples

good when you don't have to change the data
immutable, iterable, indexed, ordered

Lists

mutable, use when you have to change
ordered
userful, reverse and sort, append, pop, remove

Dictionaries

mutable, iterable, ordered (3.7+)

Sets

unique, mutable, fast, unordered cannot index, iterable.
add update remove
intersection, difference; math functions

Abstract Data Types¶

These are general and well known data types that exist in Computer Science domain, eg linked-list. People know what data type it can hold and how to access it. They are language agnostic.

Advantages

abstraction - you need not know inner working
consistency - inner implementation can change but outer interface remains the same. So code using it need not to be changed.

Data Structure

Implementation of abstract data type.
Implementation means how you implement a ADT in program. Eg, how you implement a Link-List in Python. Once you implement it becomes data-structure.

List vs LinkedList

Memory: List stores elements in contiguous (common border) memory, one next to another, so adding new elements should have contiguous memory available (otherwise cannot store it). Also, this means yoy have to block the memory even if you do not use it (eg in java you do initialize 10 size array). Linked Lists store the values randomly in memory where every it has free space, and it has pointer to next element.
Accessibility: Arrays are good as there is sequential index, you can directly access last element, or fifth, forth last etc. But in LL you have to traverse whole list to access last element as LL has no index. Array is good for random access, LL is good for accessing whole list.

Worst Case	Arrays	Lists	Reason
Reading	O(1)	O(n)	Reading last elem in LL is worst, so O(n)
Insertion	O(n)	O(1)	Insert at start of array then you have to move all elem, hence O(n)
Deletion	O(n)	O(1)	Del 1st elem in array, you move all 1 memory back

Use Case: Too many write, few read, eg, in expense entry system, you enter each day but read total at month end. You would prefer LL, as it has low order of execution for write. Remember, Arrays allow fast reads. Linked lists allow fast inserts and deletes.

Linked List

Has a pointer to next element's memory location
List having no pointer is empty linked list
Recursive Data Type [?]
Operations
- Add
- Remove
- Is empty
- how many nodes
- find node's index given its key
When to use
- insert in between other items
- collection size is unknown
- don't need random access
- not concerned about memory

References:

Book - Problem Solving with Algorithms and Data Structures using Python

Linear Data Structures¶

Stacks

Linear data structure
LIFO
maintains queue in which they were added
when want quick access to latest item

Queues

FIFO
Ordered
Use when you have to process something in order

Linked List

Singly linked, ordered
When you have unknown number of items
One-direction movement is ok
Insert in between

Doubly Link list

move both directions
ordered
when you need to move both ways
sequential access of data

Non-Linear Data Structures¶

Non Linear Data Structures

Trees
Binary Search Trees
Graphs
Hash Maps

Trees

hierarchical, level, nodes
when you need to store hierarchical

Binary

smaller on left, large on right
root is midpoint
lookup, o(log n) time, as it eliminates half

Graphs

A finite set of nodes connected by edges
Edges can have optional values (attributes)
when you need relationship between nodes

Hash Maps

key - value
like dictionaries, name to id and id has phone numbers

Binary Search¶

It searches a Sorted list. Starts from mid and compares the item, in each comparison it gets to know in which half the item would be, hence, ignores the rest half. This makes the search possible in maximum, $\log n$ times.

How, $log_2 n$:

since in each search the number is halved, so 32 - 16 - 8 - 4 - 2 , this can be remembered as, total elements are getting chopped/logged in base of 2.
mathematically, the total elements n, reduce in power of 2, hence,
- numbers in each steps are power of 2, so how many steps are there,
- what power of 2 gives numbers,
- what power of 2 gives 32? it is 5
- log 32 base 2, is what? it is 5

Hence, the it is O(log n) complexity algorithm.

Binary Search Algorithm
- if an array is sorted
- find the number in middle
- then look left or right
- complexity is log(n) - that means that in worst case the number of iterations will be log(n) for an array of length n
- how?
  - lets say we do k iterations (at max) to search a number in array of length n, i.e. loop will run no more than k times and is hence worst case scenario, so
  - iter: 1, length: n/2
  - iter: 2, length: n/4 = n/2^2
  - iter: 3, length: n/4 = n/2^3
  - ...
  - iter: k, length: n/2^k, this is last iteration and we know in worst case the length will be reduced to 1.
  - n/2^k = 1 or solve it to get
  - k = log(n) which means that the code will have to run maximum log(n) number of times.
Links
- Binary Search, Linked Lists and Complexity | Data Structures and Algorithms in Python - direct code, no plain algo dry run

Selection Sort¶

Sorting simply by comparing a item in list by other items is easy. Its complexity is item1 compared to n others, item 2 compared to n others and so on until item_n compared to n others, which is, n x O(n) or $O(n^2)$

Complexity: $O(n^2)$

You would be comparing one less item in each step, but that one would come up as constant with n, and we ignore constant in Big O.

Recursion¶

Recursion is when a function calls itself.

Every recursive function has two parts: the base case, and the recursive case. Base case tells when to break the recursion and stop the execution.

Recursive case calls the function itself, and in base case, it doesn't hence ends. Eg:

def countdown(i):
    print (i)
    if i <= 0:          # Base case
        return
    else:               # Recursive case
        countdown(i-1)

Example 2: Factorial of a number

def fact(x):
    if x == 1:              # Base case
        return 1
    else:                   # Recursive case
        return x * fact(x-1)

Call Stacks

In programming execution, there is call stack, in which each item of stack is a function call and this item is popped from stack when it returns or ends.
So if a function B is called from a line A, then B is stacked on top of A until B returns.
In recursion, lets say, A calls B - factorial function, then stacks builds as, A B(5) B(4) B(3) B(2) B(1). The top item in stack, B(1) returns $1$, and then other stack returns $2*1$ and so on, so you lastly get, $1*2*3*4*5$
Tip: When you’re writing a recursive function involving an array, the base case is often an empty array or an array with one element.

Quicksort¶

Divide & Conquer

D&C algorithms are recursive algorithms. To solve a problem using D&C, there are two steps:

Figure out the base case. This should be the simplest possible case.
Divide or decrease your problem until it becomes the base case.

D&C isn’t a simple algorithm that you can apply to a problem. Instead, it’s a way to think about a problem.

Quicksort Algorithm

pick an element in array, called pivot
make new array having all elements less than pivot, less[]
make new array having all elements more than pivot, more[]
recursively do same thing on less and more.

Base Case You need not sort an empty array or array with one element. It is sorted. eg, [] or [20].

Recursive Case You need to build less and more array and then do same.

Code

def quicksort(array):
    if len(array) < 2:
        return array        # Base case: arrays with 0 or 1 element are already “sorted”
    else:
        pivot = array[0]    # Recursive case
        less = [i for i in array[1:] if i <= pivot]
        greater = [i for i in array[1:] if i > pivot]
        return quicksort(less) + [pivot] + quicksort(greater)

Complexity

$O(n) = n \log n$ , average case

$O(n) = n^2$ , worst case

Hash Tables¶

Lets build somethings that searches in O(1) complexity, that is instantly. No matter how long the array is, you get the result instantly.

They’re also known as hash maps, maps, dictionaries, and associative arrays.

You’ll probably never have to implement hash tables yourself. Any good language will have an implementation for hash tables. Python has hash tables; they’re called dictionaries.

Hash Functions

It is a function where you put a string and you get back a number. That number tells index where the value is stored.
Dict in Py is such implementation, you put a key it knows index, reads and returns a value.

Code

# create a hash table
book = dict()

Usage

Relationship of one thing to another, or key-value pair
Filtering out duplicates
They have no ordering

Implementation

Hash function is important, it should not do collision (same index for different key).
Reading array is O(1), as you can simply read from memory on the index you provide, So 1 elem or 1billion, just read that index from memory, hence O(1).
Insert and Delete in LinkedList is fast O(1).
Hash is best of both, for average case.
If load in hast table (occupied / empty slots) is more than 0.7, resize it to maintain optimal performance.
Real implementation of good hash function is beyond scope for now!

Breadth-First Search¶

When trying to solve problem like shortest-distance, try to model problem as graph and solve it using Breadth-first Search (BFS).

Graphs

Rama ---owes----> Bob ; or shows rama owes money to Bob. this can have many relations where 1 owes to 2 and 3, and 3 owes to 1, 4 etc.
It has node and edge
The directed graph has edge pointing to node, undirected has no direction of edge.

BFS

Lets say you want to search who works in Google in your friend list.
you build a list of your friends and search one by one if they work in Google.
If not, you add their friend (friend of friend, 2nd degree relation) to end of list, so that you can find someone working in google in your network.
But first, you search all your 1st degree friends, this is Breadth-First Search. Because you want to find the shortest path to Google working friend, which why you first search all your 1st connection. (search in order they are added to search list), data structure for this is queue. and for representing relations is hash-tables
BFS finds path from A to B, and if so, it finds the shortest path.

Queues

FIFO - First in first out (like queue at bus stop), opposite of stack which is LIFO.

Code

You can implement using List for Queue and Dictionary for hashmap. And maintain another list for Array of searched nodes to avoid circular relation and sending into infinite loop.

Time Complexity

Running all edges will take $O(E)$, number of edges
Running queue for all person will take $O(V)$ that is number of vertices
Running breadth-first search that is above two, will be
$O(V+E)$

Tree

let say you see a family tree showing parents and child, edges represent parent, so they can point only one way, son ---parent---> father, as son can't be parent of father. These graphs are trees.
A tree is a special type of graph, where no edges ever point back.

flowchart TD
Son --> Mother --> GrandFatherM
Son --> Father --> GrandFatherF
Father --> GrandMotherF
Mother --> GrandMotherM

Dijkstra’s Algorithm¶

Weighted graphs, edges that carry weight.

flowchart LR
Start--6-->A
Start--2-->B
A--1-->Finish
B--3-->A
B--5-->Finish

Dijkstra’s algorithm

Lets say you have to find path from start to finish, start is connected to nodes, and they to other nodes and so on until finish, each node has travel time.

From all neighbors to start, find the cheapest node. This is the node you can get to in the least amount of time*.
Update the costs of the neighbors of this node.
Repeat until you’ve done this for every node in the graph.
Calculate the final path.

You can do this by maintaining a table with three columns and update them in each iteration by traversing all nodes and edges.

Iter 1

Parent	Node	Cost
Start	Stop A	6
Start	Stop B	2
Stop A	Finish	infinity

...

Keep updating cost and parent, for each node in each iteration and finally you can find the path.

Iter Last

Parent	Node	Cost
Stop B	Stop A	5
Start	Stop B	2
Stop A	Finish	6

there is no other way to get to the cheapest node other than the direct link to start. Also this is only true when you have positive weights.

Negative-weight edges

The edge can have negative weight value, eg, you give A to B and B returns you money. A -- -2$ ---> B

Now the Dijkstra Algo does not work, as "the only shortest path to cheapest node from start is direct link" does not stand true. Hence, you can’t use Dijkstra’s algorithm if you have negative-weight edges.

The Bellman-Ford algorithm can be used in this case (out of scope for now!).

Implementation of Dijkstra

You need three hash-tables to represent the weighted graph data (and the structure in table above that keeps track to find shortest path)

graphs - dict - represents the weighted graph (see graph)
costs - dict - keeps track of lowest cost in each run (see table)
parents - dict - keeps track of parent having lowest cost in each run (see table)
processed - list - keeps track of nodes processed, so we do not repeat

graph = {
    'start' : {         # all neighbours to start
        'a': 6,
        'b': 2
    },
    'a' : {             # all neighbours to a
        'finish' : 1
    },
    'b' : {             # all neighbours to b
        'a' : 3,
        'finish' : 5
    },
    'finish' : {}       # finish has no neighbours
}

costs = {               # at Iter 1
    'a': 6,
    'b': 2,
    'final': float("inf")
}

parents = {             # at Iter 1
    'a': "start",
    'b': "start",
    'final': None
}

Once, you keep executing, you will reach the final iter, which will have costs and parents hash-table having shortest path route.

Shortest Path

flowchart LR
Start-.6.->A
Start==2==>B
A==1==>Finish
B==3==>A
B-.5.->Finish

Code

graph = {
    'start' : {         # all neighbours to start
        'a': 6,
        'b': 2
    },
    'a' : {             # all neighbours to a
        'finish' : 1
    },
    'b' : {             # all neighbours to b
        'a' : 3,
        'finish' : 5
    },
    'finish' : {}       # finish has no neighbours
}

costs = {               # at Iter 1
    'a': 6,
    'b': 2,
    'final': float("inf")
}

parents = {             # at Iter 1
    'a': "start",
    'b': "start",
    'final': None
}

processed = []

# Dijkstra's Algo

def find_lowest_cost_node(costs):
    lowest_cost = float(“inf”)
    lowest_cost_node = None
    for node in costs:              # Go through each node.
        cost = costs[node]
        if cost < lowest_cost and node not in processed:    # If it’s the lowest cost so far and hasn’t been processed yet ...
        lowest_cost = cost                                  # ... set it as the new lowest-cost node.
        lowest_cost_node = node
    return lowest_cost_node

node = find_lowest_cost_node(costs)     # that you haven’t processed yet.
while node is not None:                 # If you’ve processed all the nodes, this while loop is done. 
    cost = costs[node]
    neighbors = graph[node]
    for n in neighbors.keys():          # Go through all the neighbors of this node.
        new_cost = cost + neighbors[n]  # If it’s cheaper to get to this neighbor
        if costs[n] > new_cost:         # by going through this node ...
            costs[n] = new_cost         # ... update the cost for this node.
            parents[n] = node           # This node becomes the new parent for this neighbor
    processed.append(node)              # Mark the node as processed.
    node = find_lowest_cost_node(costs)

Summary

Breadth-first search is used to calculate the shortest path for an unweighted graph.
Dijkstra’s algorithm is used to calculate the shortest path for a weighted graph.
Dijkstra’s algorithm works when all the weights are positive.
If you have negative weights, use the Bellman-Ford algorithm.

Greedy Algorithms¶

Some problems do not have fast algo solution. You can identify them and use greedy strategy to find approximate solution. We use Greedy Algorithm

Dynamic Programming¶

When there is a hard solution to a problem, we break down problem into small tasks and solve them, this is dynamic programming.

Links¶

University of California, Berkeley¶

UC Berkeley (UCB) is "University of California" located in Berkeley, California, USA. It has CS lectures that have open syllabus, course material with lecture slides/videos, assignments, available for free online in an organized manner.
- Courses have a page and professor. Either videos are on course page, or go to prof youtube channel to find lectures.
EECS UCB is Electrical Engineering and Computer Science department.
EECS Courses cover all CS fundamental and advance courses. Each course has a page where you can see prerequisites which will be another course.
- CS61A/B/C (alternatives CS47ABC)
  - 61A: Teaches you how to think like a programmer
  - 61B: The go-to class that turns a lot of people into great programmers/interested in software.
  - 61C: Everything left in CS that isn't taught in 61A/B: Low-level things. Clocks, CPUs, Assembly, C, etc.
  - skip/skim 61A all together and just watch/do 61B, then 61C stuff
  - Those who have taken these courses claim that after completing this series they feel like they can achieve or learn almost anything if they wanted because they are already well versed on the lingo and tools of CS that is programming, problem solving and low level stuff.
  - DA DSA
  - CS 61A: Structure and Interpretation of Computer Programs
    - Prof JohnDeNero
      - mostly teaches in Python
      - youTube channel has all lecture assignment videos.
        
        go to playlist to see all lectures
  - DataStructur.es
    - info about all CS61 courses
    - Resources Algo Visual
- C9A/B/C/D/E/F/G/H
  - Basic programming courses
  - CS 9C. C for Programmers
  - CS 9D. Scheme and Functional Programming for Programmers
  - CS 9E. Productive Use of the UNIX Environment
- CS 186. Introduction to Database Systems
  - https://cs186berkeley.net/
- CS 261. Security in Computer Systems
- CS 162: Operating Systems and System Programming
Programming Systems (PS) - Area in UCB where many discoveries were done. It includes following courses
- CS 61A. The Structure and Interpretation of Computer Programs
- CS 61B. Data Structures
- CS 164. Programming Languages and Compilers
- CS 169. Software Engineering
- CS 263. Design of Programming Languages
- CS 264. Implementation of Programming Languages
- CS 265. Compiler Optimization and Code Generation
- CS C267. Applications of Parallel Computers
Seasons of the United States
- Spring - March to May. Q2
- Summer - June to August. Q3
- Fall/Autumn - September to November. Q4
- Winter - December to February. Q1
Books Recommended
- A Philosophy of Software Design by John Ousterhout - recommended by student on reddit
Links
- Ask HN: CS61 series from Berkeley, the best set of foundational CS courses?

CS 61A: Structure and Interpretation of Computer Programs¶

prof John DeNero

CS61A - Fall 2023 - 37 Lectures have playlist on Youtube. Text on site.

CS 61B Data Structures, Fall 2023¶

Prof Josh Hug has lecture playlist. Other playlists can be found on course page

Explore Study Material¶

UCB has best and open material based on reddit and your search.

It will take 2-3 months to complete all lectures. So be patient and just do it without 2nd thought of result. It will benefit you.

Links¶

2025-01-12 January 12, 2025