Data Structures and Algorithms¶
Recursion
- Break big problem into smaller problem that are same and solve the smallest one, and built on top of it.
Sorting Algorithms
- insertion
- merge
- quick sort
- understanding which to use depends on the data and how sorted it is in different scenarios
Searching Algorithms
Binary
Breadth first
Depth first
Use bradth-first search, when finding shortest path between two nodes in a tree
Don't use it on tree having high branching factor.
DSA Overview¶
Why is DSA important?
- In every problem there is a budget, which defines time and money. Hence, it is important to understand the cost of memory and time of execution.
Edge Cases
- extreme cases of input or situation
- like: empty array, bad format
Steps to solve a problem using DSA
- state problem clearly
- define inputs and outputs and their formats
- state test cases and edge cases in english
- state correct solution in english
- code and test
- repeat 2-5
- analyze algo's complexity or inefficiency
- refactor code, test. repeat 2-8
Best Practices in Coding using DSA
- create a function
- give meaningful name and follow standards
- do test driven development
- 10 mins code built test repeat, small steps
- abstract code into functions and make them generic. eg, repetitive logic can be moved to a separate function and used in changing condition function. as in binary-search logic is same but matching condition may vary, so make a generic binary-search function and use it in other functions.
- The solving approach and coding both are important, approach & right technique is more important that coding
- don't start coding directly, rather write clearly the problem statement and then solution in plain english, then code.
linear search algorithm or brute force algorithm
- simple, start from 0 and traverse whole array
Time calculation of code execution
1Ghz
machine can execute10^9
instructions in 1 second (giga = 9). Instructions are low level binary code. so a high level python statement can execute tens or thousands of instructions, but that's how you calculate execution time of once statement, and the set of statements in function and then number iterations of that function in worst case gives its complexity.iterations --of--> function of statements --having--> statement --build_of--> instructions --processed_on--> CPU
Analyze complexity and find inefficiency
- can you reduce the number of iterations in the loop? eg, use binary search instead of linear search on a sorted array.
- complexity of algo is the max time it will take to finish, that is total time in worst case. it is usually based on the size of array
n
- O(n) or Big O Notation or 'Order of n' means that a code-block is executed
n
number of times at max. - O(log n) means that code will iterate
log n
times.
Time Complexity¶
Time Complexity
Amount of time, in worst case scenario, given the length of input.
O(1) simple var access - constant time complexity
- given a list, you do not iterate whole list, instead just access one element. even if list lenght iceases, the time will not increase as your function just prints one element.
O(n) loop over list - linear
O(log n) binary search, where you eliminate half result.
O(n^2), for loop within for loop, you iterate over each list element. - quadratic
O(n log n), O(2^n) and O(n!) (n factorial), not common.
You should know time complexity of solution.
If there are multiple steps, time complexity is worst step, that is what we want to know.
Big O
- O represents order of. O(n) means, the time complexity increases as order of n. If n increases the time increases by it.
- if you plot time and n on graph, let's say eq is $t = an + b$, where $a$ and $b$ are constant, then t we see that t increases majorly by $a x n$, which is order of n. hence,
O(n)
. You ignore the co-efficients from the equations. - For $t = 20$ it is $t = 20 x 1$, hence
O(1)
- For $t = an^2 + c$, notation is, $O(n^2)$
- Big O does not depend on machine capability, twice fast machine will change coefficient, which is ignored. However, the function complexity will remain the same, that is, constant, linear or quadratic etc irrespective of machine's capability.
Evaluating Big O
- the assignment operation will have complexity constant, of 1 so $O(1)$
- Now, if this is in for loop, it repeats n times, hence time is $n \times O(1)$ that will be $O(n)$
- Now, if there is another for loop outside this, time taken is $n \times O(n)$, that will be $O(nn)$ or $O(n^2)$
Comparison of time complexity with common algo and common orders
Algo | Binary Search | Simple Search | Quicksort | Selection Sort | The Traveling Salesman |
---|---|---|---|---|---|
Complexity / Size | $O(\log n)$ | $O(n)$ | $O(n\log n)$ | $O(n^2)$ | $O(n!)$ |
10 | 0.3s | 1s | 3.3s | 10s | 4.2 days |
100 | 0.6s | 10s | 66.4s | 16.6m | $2.9 \times 10^{149}$ yrs |
1000 | 1s | 100s | 996s | 27.7 hrs | $1.27 \times 10^{2559}$ yrs |
- Link - YK Sugi CS Dojo
from scipy.special import factorial
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
x = np.arange(1,5,1)
fig = make_subplots(rows=1, cols=5, shared_yaxes=True)
fig.add_trace(go.Line(x=x,y=np.log2(x), name='log n'), row=1, col=1)
fig.add_trace(go.Line(x=x,y=x, name='n'), row=1, col=2)
fig.add_trace(go.Line(x=x,y=x*np.log2(x), name='n log n'), row=1, col=3)
fig.add_trace(go.Line(x=x,y=x*x, name='n<sup>2</sup>'), row=1, col=4)
fig.add_trace(go.Line(x=x,y=factorial(x), name='n!'), row=1, col=5)
fig.update_layout(height=300, width=1000, title_text="Big O - Time Complexity")
fig.show()
Built In Data Types¶
Tuples
- good when you don't have to change the data
- immutable, iterable, indexed, ordered
Lists
- mutable, use when you have to change
- ordered
- userful, reverse and sort, append, pop, remove
Dictionaries
- mutable, iterable, ordered (3.7+)
Sets
- unique, mutable, fast, unordered cannot index, iterable.
- add update remove
- intersection, difference; math functions
Abstract Data Types¶
These are general and well known data types that exist in Computer Science domain, eg linked-list. People know what data type it can hold and how to access it. They are language agnostic.
Advantages
- abstraction - you need not know inner working
- consistency - inner implementation can change but outer interface remains the same. So code using it need not to be changed.
Data Structure
- Implementation of abstract data type.
- Implementation means how you implement a ADT in program. Eg, how you implement a Link-List in Python. Once you implement it becomes data-structure.
List vs LinkedList
- Memory: List stores elements in contiguous (common border) memory, one next to another, so adding new elements should have contiguous memory available (otherwise cannot store it). Also, this means yoy have to block the memory even if you do not use it (eg in java you do initialize 10 size array). Linked Lists store the values randomly in memory where every it has free space, and it has pointer to next element.
- Accessibility: Arrays are good as there is sequential index, you can directly access last element, or fifth, forth last etc. But in LL you have to traverse whole list to access last element as LL has no index. Array is good for random access, LL is good for accessing whole list.
Worst Case | Arrays | Lists | Reason |
---|---|---|---|
Reading | O(1) | O(n) | Reading last elem in LL is worst, so O(n) |
Insertion | O(n) | O(1) | Insert at start of array then you have to move all elem, hence O(n) |
Deletion | O(n) | O(1) | Del 1st elem in array, you move all 1 memory back |
- Use Case: Too many write, few read, eg, in expense entry system, you enter each day but read total at month end. You would prefer LL, as it has low order of execution for write. Remember, Arrays allow fast reads. Linked lists allow fast inserts and deletes.
Linked List
Has a pointer to next element's memory location
List having no pointer is empty linked list
Recursive Data Type [?]
Operations
- Add
- Remove
- Is empty
- how many nodes
- find node's index given its key
When to use
- insert in between other items
- collection size is unknown
- don't need random access
- not concerned about memory
References:
Linear Data Structures¶
Stacks
- Linear data structure
- LIFO
- maintains queue in which they were added
- when want quick access to latest item
Queues
- FIFO
- Ordered
- Use when you have to process something in order
Linked List
- Singly linked, ordered
- When you have unknown number of items
- One-direction movement is ok
- Insert in between
Doubly Link list
- move both directions
- ordered
- when you need to move both ways
- sequential access of data
Non-Linear Data Structures¶
Non Linear Data Structures
- Trees
- Binary Search Trees
- Graphs
- Hash Maps
Trees
- hierarchical, level, nodes
- when you need to store hierarchical
Binary
- smaller on left, large on right
- root is midpoint
- lookup, o(log n) time, as it eliminates half
Graphs
- A finite set of nodes connected by edges
- Edges can have optional values (attributes)
- when you need relationship between nodes
Hash Maps
- key - value
- like dictionaries, name to id and id has phone numbers
Binary Search¶
It searches a Sorted list. Starts from mid and compares the item, in each comparison it gets to know in which half the item would be, hence, ignores the rest half. This makes the search possible in maximum, $\log n$ times.
How, $log_2 n$:
- since in each search the number is halved, so 32 - 16 - 8 - 4 - 2 , this can be remembered as, total elements are getting chopped/logged in base of 2.
- mathematically, the total elements n, reduce in power of 2, hence,
- numbers in each steps are power of 2, so how many steps are there,
- what power of 2 gives numbers,
- what power of 2 gives 32? it is 5
- log 32 base 2, is what? it is 5
Hence, the it is O(log n)
complexity algorithm.
Binary Search Algorithm
if an array is sorted
find the number in middle
then look left or right
complexity is
log(n)
- that means that in worst case the number of iterations will belog(n)
for an array of lengthn
how?
- lets say we do
k
iterations (at max) to search a number in array of lengthn
, i.e. loop will run no more than k times and is hence worst case scenario, so - iter: 1, length: n/2
- iter: 2, length: n/4 = n/2^2
- iter: 3, length: n/4 = n/2^3
- ...
- iter: k, length: n/2^k, this is last iteration and we know in worst case the length will be reduced to 1.
n/2^k = 1
or solve it to getk = log(n)
which means that the code will have to run maximumlog(n)
number of times.
- lets say we do
Links
- Binary Search, Linked Lists and Complexity | Data Structures and Algorithms in Python - direct code, no plain algo dry run
Selection Sort¶
Sorting simply by comparing a item in list by other items is easy. Its complexity is item1 compared to n others, item 2 compared to n others and so on until item_n compared to n others, which is, n x O(n) or $O(n^2)$
Complexity: $O(n^2)$
You would be comparing one less item in each step, but that one would come up as constant with n, and we ignore constant in Big O.
Recursion¶
Recursion is when a function calls itself.
Every recursive function has two parts: the base case, and the recursive case. Base case tells when to break the recursion and stop the execution.
Recursive case calls the function itself, and in base case, it doesn't hence ends. Eg:
def countdown(i):
print (i)
if i <= 0: # Base case
return
else: # Recursive case
countdown(i-1)
Example 2: Factorial of a number
def fact(x):
if x == 1: # Base case
return 1
else: # Recursive case
return x * fact(x-1)
Call Stacks
- In programming execution, there is call stack, in which each item of stack is a function call and this item is popped from stack when it returns or ends.
- So if a function B is called from a line A, then B is stacked on top of A until B returns.
- In recursion, lets say, A calls B - factorial function, then stacks builds as, A B(5) B(4) B(3) B(2) B(1). The top item in stack, B(1) returns $1$, and then other stack returns $2*1$ and so on, so you lastly get, $1*2*3*4*5$
- Tip: When you’re writing a recursive function involving an array, the base case is often an empty array or an array with one element.
Quicksort¶
Divide & Conquer
D&C algorithms are recursive algorithms. To solve a problem using D&C, there are two steps:
- Figure out the base case. This should be the simplest possible case.
- Divide or decrease your problem until it becomes the base case.
D&C isn’t a simple algorithm that you can apply to a problem. Instead, it’s a way to think about a problem.
Quicksort Algorithm
- pick an element in array, called
pivot
- make new array having all elements less than pivot,
less[]
- make new array having all elements more than pivot,
more[]
- recursively do same thing on
less
andmore
.
Base Case You need not sort an empty array or array with one element. It is sorted. eg, []
or [20]
.
Recursive Case You need to build less and more array and then do same.
Code
def quicksort(array):
if len(array) < 2:
return array # Base case: arrays with 0 or 1 element are already “sorted”
else:
pivot = array[0] # Recursive case
less = [i for i in array[1:] if i <= pivot]
greater = [i for i in array[1:] if i > pivot]
return quicksort(less) + [pivot] + quicksort(greater)
Complexity
$O(n) = n \log n$ , average case
$O(n) = n^2$ , worst case
Hash Tables¶
Lets build somethings that searches in O(1) complexity, that is instantly. No matter how long the array is, you get the result instantly.
They’re also known as hash maps, maps, dictionaries, and associative arrays.
You’ll probably never have to implement hash tables yourself. Any good language will have an implementation for hash tables. Python has hash tables; they’re called dictionaries.
Hash Functions
- It is a function where you put a string and you get back a number. That number tells index where the value is stored.
- Dict in Py is such implementation, you put a
key
it knows index, reads and returns avalue
.
Code
# create a hash table
book = dict()
Usage
- Relationship of one thing to another, or key-value pair
- Filtering out duplicates
- They have no ordering
Implementation
- Hash function is important, it should not do collision (same index for different key).
- Reading array is O(1), as you can simply read from memory on the index you provide, So 1 elem or 1billion, just read that index from memory, hence O(1).
- Insert and Delete in LinkedList is fast O(1).
- Hash is best of both, for average case.
- If load in hast table (occupied / empty slots) is more than 0.7, resize it to maintain optimal performance.
- Real implementation of good hash function is beyond scope for now!
Breadth-First Search¶
When trying to solve problem like shortest-distance, try to model problem as graph and solve it using Breadth-first Search (BFS).
Graphs
- Rama ---owes----> Bob ; or shows rama owes money to Bob. this can have many relations where 1 owes to 2 and 3, and 3 owes to 1, 4 etc.
- It has node and edge
- The directed graph has edge pointing to node, undirected has no direction of edge.
BFS
- Lets say you want to search who works in Google in your friend list.
- you build a list of your friends and search one by one if they work in Google.
- If not, you add their friend (friend of friend, 2nd degree relation) to end of list, so that you can find someone working in google in your network.
- But first, you search all your 1st degree friends, this is Breadth-First Search. Because you want to find the shortest path to Google working friend, which why you first search all your 1st connection. (search in order they are added to search list), data structure for this is queue. and for representing relations is hash-tables
- BFS finds path from A to B, and if so, it finds the shortest path.
Queues
- FIFO - First in first out (like queue at bus stop), opposite of stack which is LIFO.
Code
You can implement using List
for Queue and Dictionary
for hashmap. And maintain another list
for Array of searched nodes to avoid circular relation and sending into infinite loop.
Time Complexity
- Running all edges will take $O(E)$, number of edges
- Running queue for all person will take $O(V)$ that is number of vertices
- Running breadth-first search that is above two, will be
- $O(V+E)$
Tree
- let say you see a family tree showing parents and child, edges represent parent, so they can point only one way, son ---parent---> father, as son can't be parent of father. These graphs are trees.
- A tree is a special type of graph, where no edges ever point back.
flowchart TD Son --> Mother --> GrandFatherM Son --> Father --> GrandFatherF Father --> GrandMotherF Mother --> GrandMotherM
Dijkstra’s Algorithm¶
Weighted graphs, edges that carry weight.
flowchart LR Start--6-->A Start--2-->B A--1-->Finish B--3-->A B--5-->Finish
Dijkstra’s algorithm
Lets say you have to find path from start to finish, start is connected to nodes, and they to other nodes and so on until finish, each node has travel time.
- From all neighbors to start, find the cheapest node. This is the node you can get to in the least amount of time*.
- Update the costs of the neighbors of this node.
- Repeat until you’ve done this for every node in the graph.
- Calculate the final path.
You can do this by maintaining a table with three columns and update them in each iteration by traversing all nodes and edges.
Iter 1
Parent | Node | Cost |
---|---|---|
Start | Stop A | 6 |
Start | Stop B | 2 |
Stop A | Finish | infinity |
...
Keep updating cost and parent, for each node in each iteration and finally you can find the path.
Iter Last
Parent | Node | Cost |
---|---|---|
Stop B | Stop A | 5 |
Start | Stop B | 2 |
Stop A | Finish | 6 |
- there is no other way to get to the cheapest node other than the direct link to start. Also this is only true when you have positive weights.
Negative-weight edges
The edge can have negative weight value, eg, you give A to B and B returns you money. A -- -2$ ---> B
Now the Dijkstra Algo does not work, as "the only shortest path to cheapest node from start is direct link" does not stand true. Hence, you can’t use Dijkstra’s algorithm if you have negative-weight edges.
The Bellman-Ford algorithm can be used in this case (out of scope for now!).
Implementation of Dijkstra
You need three hash-tables to represent the weighted graph data (and the structure in table above that keeps track to find shortest path)
graphs
- dict - represents the weighted graph (see graph)costs
- dict - keeps track of lowest cost in each run (see table)parents
- dict - keeps track of parent having lowest cost in each run (see table)processed
- list - keeps track of nodes processed, so we do not repeat
graph = {
'start' : { # all neighbours to start
'a': 6,
'b': 2
},
'a' : { # all neighbours to a
'finish' : 1
},
'b' : { # all neighbours to b
'a' : 3,
'finish' : 5
},
'finish' : {} # finish has no neighbours
}
costs = { # at Iter 1
'a': 6,
'b': 2,
'final': float("inf")
}
parents = { # at Iter 1
'a': "start",
'b': "start",
'final': None
}
Once, you keep executing, you will reach the final iter, which will have costs and parents hash-table having shortest path route.
Shortest Path
flowchart LR Start-.6.->A Start==2==>B A==1==>Finish B==3==>A B-.5.->Finish
Code
graph = {
'start' : { # all neighbours to start
'a': 6,
'b': 2
},
'a' : { # all neighbours to a
'finish' : 1
},
'b' : { # all neighbours to b
'a' : 3,
'finish' : 5
},
'finish' : {} # finish has no neighbours
}
costs = { # at Iter 1
'a': 6,
'b': 2,
'final': float("inf")
}
parents = { # at Iter 1
'a': "start",
'b': "start",
'final': None
}
processed = []
# Dijkstra's Algo
def find_lowest_cost_node(costs):
lowest_cost = float(“inf”)
lowest_cost_node = None
for node in costs: # Go through each node.
cost = costs[node]
if cost < lowest_cost and node not in processed: # If it’s the lowest cost so far and hasn’t been processed yet ...
lowest_cost = cost # ... set it as the new lowest-cost node.
lowest_cost_node = node
return lowest_cost_node
node = find_lowest_cost_node(costs) # that you haven’t processed yet.
while node is not None: # If you’ve processed all the nodes, this while loop is done.
cost = costs[node]
neighbors = graph[node]
for n in neighbors.keys(): # Go through all the neighbors of this node.
new_cost = cost + neighbors[n] # If it’s cheaper to get to this neighbor
if costs[n] > new_cost: # by going through this node ...
costs[n] = new_cost # ... update the cost for this node.
parents[n] = node # This node becomes the new parent for this neighbor
processed.append(node) # Mark the node as processed.
node = find_lowest_cost_node(costs)
Summary
- Breadth-first search is used to calculate the shortest path for an unweighted graph.
- Dijkstra’s algorithm is used to calculate the shortest path for a weighted graph.
- Dijkstra’s algorithm works when all the weights are positive.
- If you have negative weights, use the Bellman-Ford algorithm.
Greedy Algorithms¶
Some problems do not have fast algo solution. You can identify them and use greedy strategy to find approximate solution. We use Greedy Algorithm
Dynamic Programming¶
When there is a hard solution to a problem, we break down problem into small tasks and solve them, this is dynamic programming.
University of California, Berkeley¶
UC Berkeley (UCB) is "University of California" located in Berkeley, California, USA. It has CS lectures that have open syllabus, course material with lecture slides/videos, assignments, available for free online in an organized manner.
- Courses have a page and professor. Either videos are on course page, or go to prof youtube channel to find lectures.
EECS UCB is Electrical Engineering and Computer Science department.
EECS Courses cover all CS fundamental and advance courses. Each course has a page where you can see prerequisites which will be another course.
- CS61A/B/C (alternatives CS47ABC)
- 61A: Teaches you how to think like a programmer
- 61B: The go-to class that turns a lot of people into great programmers/interested in software.
- 61C: Everything left in CS that isn't taught in 61A/B: Low-level things. Clocks, CPUs, Assembly, C, etc.
- skip/skim 61A all together and just watch/do 61B, then 61C stuff
- Those who have taken these courses claim that after completing this series they feel like they can achieve or learn almost anything if they wanted because they are already well versed on the lingo and tools of CS that is programming, problem solving and low level stuff.
- DA DSA
- CS 61A: Structure and Interpretation of Computer Programs
- Prof JohnDeNero
- mostly teaches in Python
- youTube channel has all lecture assignment videos.
- go to playlist to see all lectures
- Prof JohnDeNero
- DataStructur.es
- info about all CS61 courses
- Resources Algo Visual
- C9A/B/C/D/E/F/G/H
- Basic programming courses
- CS 9C. C for Programmers
- CS 9D. Scheme and Functional Programming for Programmers
- CS 9E. Productive Use of the UNIX Environment
- CS 186. Introduction to Database Systems
- CS 261. Security in Computer Systems
- CS 162: Operating Systems and System Programming
- CS61A/B/C (alternatives CS47ABC)
Programming Systems (PS) - Area in UCB where many discoveries were done. It includes following courses
- CS 61A. The Structure and Interpretation of Computer Programs
- CS 61B. Data Structures
- CS 164. Programming Languages and Compilers
- CS 169. Software Engineering
- CS 263. Design of Programming Languages
- CS 264. Implementation of Programming Languages
- CS 265. Compiler Optimization and Code Generation
- CS C267. Applications of Parallel Computers
Seasons of the United States
- Spring - March to May. Q2
- Summer - June to August. Q3
- Fall/Autumn - September to November. Q4
- Winter - December to February. Q1
Books Recommended
- A Philosophy of Software Design by John Ousterhout - recommended by student on reddit
Links
CS 61A: Structure and Interpretation of Computer Programs¶
prof John DeNero
CS61A - Fall 2023 - 37 Lectures have playlist on Youtube. Text on site.
CS 61B Data Structures, Fall 2023¶
- Prof Josh Hug has lecture playlist. Other playlists can be found on course page
Explore Study Material¶
UCB has best and open material based on reddit and your search.
It will take 2-3 months to complete all lectures. So be patient and just do it without 2nd thought of result. It will benefit you.