Databases & SQL
all about databases, SQL only
- add query examples for each.
Database lets store data usually in tabular format.
----- DDL -----
Data Definition Commands, they let define the data table structure and things around them.
CASCADE
can be used when defining tables, it helps do action on the child table based on a action on parent table.
TRIGGERS
Triggers let do action on another table based on activity on one table.
----- DML -----
Data Manipulation Language, this lets manipulate the data.
Date Operations
Link https://www.postgresql.org/docs/8.1/functions-datetime.html
Distinct
Can bes used after select to fetch only distinct row in result set. It works on multiple column as a set of row, whole row would be unique.
NVL COALESCE
To change null values cell
COALESCE
returns first not-null value, or null if all are null, COALESCE()
CASE WHEN THEN
END ELSE is an optional component but WHEN THEN these cases must be included in the CASE statement.
SELECT ...,
CASE
WHEN col_a = "val_1" THEN 'res_1'
WHEN col_a = "val_2" THEN 'res_2'
WHEN col_b = "val_4" THEN 'res_4'
ELSE 'val_0'
END as col_my
FROM ...
-- syntax 2
CASE col_a
WHEN val_1 THEN res_1
WHEN val_1 THEN res_1
ELSE res_0
END CASE
-- Use in Order by
SELECT CustomerName, Country
FROM Customer
ORDER BY
(CASE
WHEN Country IS 'India' THEN Country
ELSE Age
END);
DECODE
same as case
Use AVG to find Percentage
Percentage is flag/total
, say flag is active uses as 1 and total is total users.
AVG age is (sum of age) / (number of students), that is, if age is 1 and 0, like flag, then avg(flag) is percentage. Because avg will be sum(flag)/total row. So use:
select
avg(case when is_active_flg = 1 then 1.0 else 0 end) as pct_low_fat_and_recyclable
from products p
GROUP ROLLUP CUBE
Use GROUP BY
to group data. the columns in select should be same as in group by clause, or you can have aggregated columns.
use ROLLUP
, to add sub-totals and totals in grouped columns.
Use CUBE
similar to rollup to get subtotals by combination of dimensions.
Eg, lets say database having, country, state, department and Sales. You can make a cube to get sub totals by combination of dimensions, that is, country-department totals, country-state totals, state-department totals, and total by all 3dims.
HAVING
HAVING is used to specify a condition for a group or an aggregate function.
WHERE filters before grouping, then HAVING filters after grouping.
EXISTS
The EXISTS operator tests for the existence of rows in the results set of the subquery. If a subquery row value is found the condition is flagged TRUE and the search does not continue in the inner query, and if it is not found then the condition is flagged FALSE and the search continues in the inner query.
SELECT employee_id, last_name, job_id, department_id
FROM employees outer
WHERE EXISTS ( SELECT ’X’
FROM employees
WHERE manager_id =
outer.employee_id);
Sub Queries
- A subquery is a query within a query.
- We can use it in multiple ways:
- in
from
as another table - in
where
as a set to filter in main query - as a
column
- in
- example
Correlated Sub-Queries
The Correlated Subqueries are dependent on the outer query and are executed for each row of the main query.
Whereas, regular subqueries (non-correlated) are independent of the outer query and are evaluated only once before the main query runs.
Correlated subqueries are useful when you need to filter the result of the outer query based on some condition that requires referencing values from the outer query itself. They are commonly used in scenarios where the condition relies on data from related tables.
SELECT last_name, salary, department_id
FROM employees outer
WHERE salary >
(SELECT AVG(salary)
FROM employees
WHERE department_id =
outer.department_id group by department_id);
JOINS
A CROSS JOIN
produces a cartesian product between the two tables, returning all possible combinations of all rows. It has no ON clause because you're just joining everything to everything.
A FULL JOIN
/ OUTER JOIN
/ JOIN
are same and have ON
clause that has matching condition in between tables.
Problem: Rank with Self Join
You can self join a table:
- left side is your table
- right side is again your table but only rows that have sales more than the current row, so
- sales 5 in left, will have sales 6, 7, 8 joined to it
- sales 10 will have only 11 joined to it.
- then you can count the rows that got joined from right table. It will give you the rank.
SELECT a1.Name, a1.Sales, COUNT(a2.sales) Sales_Rank
FROM Total_Sales a1, Total_Sales a2
WHERE a1.Sales <= a2.Sales or (a1.Sales=a2.Sales and a1.Name = a2.Name)
GROUP BY a1.Name, a1.Sales
ORDER BY a1.Sales DESC, a1.Name DESC;
Link: https://www.experts-exchange.com/questions/24783940/RANK-Using-a-SELF-JOIN-or-other.html
UNION
In Union the order is not guaranteed to be preserved. As a result, you can add row_number()
or rank()
and table_number hardcoded as table1 and table2 to reselect from union.
UNION
only returns a unique record, while UNION ALL
returns all the records (including duplicates)
CTE - Common Table Expression
- It is a query that we write before the main query.
- use
WITH
to implement create CBT and use it in main query - Example
- CTEs can be recursive
- CTEs are reusable
- CTE are more readable and help manage long and complex queries
- CTE are similar to Subqueries, however, CTE can't be used in WHERE clause
The WITH clause lets define a table. The definition is available only to the query in which the with clause occurs.
Recursion
Imagine you have a tree structure:
drop table if exists data_sci.org_structure;
create table data_sci.org_structure (
id integer,
department_name text,
parent_department_id integer);
insert into data_sci.org_structure values
(1, 'CEO Office', null),
(2, 'VP Sales', 1),
(3, 'VP Operations', 1),
(4, 'Northeast Sales',2),
(5, 'Northwest Sales',2),
(6, 'Infrastructure Operations', 3),
(7, 'Management Operations', 3);
and you want to find complete path from one node to root (top parent), you know the algo, find parent of child, then find grand-parent, ... until you reach the root.
This is where recursion can help.
WITH RECURSIVE <table_name_recursive_cte> (<columns>) AS
<terminal select statement>
UNION
<recursive select statement>
SELECT
...
Eg
with recursive report_structure(id, department_name,parent_department_id) as
(select id, department_name, parent_department_id
from data_sci.org_structure where id = 7
union
select os.id, os.department_name, os.parent_department_id
from data_sci.org_structure os
join report_structure rs on rs.parent_department_id = os.id
)
select
*
from report_structure
This will find all parent for child id 7, you can similarly pass any id and find all its parents using recursion.
Window Functions
You can partition the data over a column, and then apply some function over the rows in each partition. The value of function is related to items in a partition, but resets in each partition. Some of the functions that you can apply are
RANK()
gives rank based on value, if two item get same rank, next one would be same as row number, that is, 1,2,2,4 and so on, where 2, 3 have same rank 2, hence 4th item has rank 4, 3 is skipped.DENSE_RANK()
gives rank, but it doesn't skip rank, 1,2,2,3 and so on.ROW_NUMBER()
give row number like 1,2,3,4 and so noSUM(col_a)
- like sum (cost) over months, this will give running total by monthsNTILE(n)
- over a partition, it gives a number to each data item by dividing into n quarters, eg, 4 will break into 4 quartiles based on value of data item in window.LEAD(sales)
will show 2nd row in 1st row and so on..LAG(sales)
will show 1st row in 2nd row and so on..;LAG(sales,2)
will show 1st row in 3rd row and so on.. you can specify the offset, by default it is 1.
Syntax: RANK() OVER (PARTITION BY col_b ORDER BY col_a)
- Imp In PARTITION BY... , the ORDER BY .... makes huge difference as order changes the way function is applied, eg:
-- this gives running total by each month and resets on year
sum(total_sales) over (partition by year_sold order by year_sold, month_sold) as running_total
-- this gives running total by year and resets on year, it skips months, that is one total for year and is same in each month
sum(total_sales) over (partition by year_sold) as running_total
The partition that you create is called window. You can use as above, or have to use same partition again and again, you can define the partition as window and re-use it. syntax WINDOW win_name AS (PARTITION by col_a ORDER by col_b)
Example query:
-- Window Function
SELECT emp.firstName, mdl.model, count(sls.salesId) as num_sold,
rank() over (PARTITION by emp.firstName ORDER by count(sls.salesId) desc) as ranks, -- this could also be fname, it is just to show
dense_rank() over (fname) as dense_ranks,
row_number() over (fname) as row_numbers,
sum(sls.salesAmount) over (fname) as running_sales
FROM sales sls
INNER JOIN employee emp
ON sls.employeeId = emp.employeeId
INNER JOIN inventory inv
ON inv.inventoryId = sls.inventoryId
INNER JOIN model mdl
ON mdl.modelId = inv.modelId
GROUP BY emp.firstName, emp.lastName, mdl.model
WINDOW fname AS (PARTITION by emp.firstName ORDER by count(sls.salesId) desc)
-- Running Total using Window fn with CTE
with total_sales_cte as (
select
soldDate
, strftime('%Y', soldDate) as year_sold
, strftime('%m', soldDate) as month_sold
, sum(salesAmount) as total_sales
from sales
GROUP by year_sold, month_sold
ORDER by soldDate
--LIMIT 5
)
select *
, sum(total_sales) over (partition by year_sold order by year_sold, month_sold) as running_total
from total_sales_cte
Links:
Pivot
Lets say you have table as below
Name | Occupation |
---|---|
Ashley | Professor |
Samantha | Actor |
Julia | Doctor |
Britney | Professor |
Maria | Professor |
Meera | Professor |
You can pivot by hardcoding the column names and using case statement.
Lets first try to give row numbers over occupations
Name | Occupation | row_id |
---|---|---|
Actor | Eve | 1 |
Actor | Jennifer | 2 |
Actor | Kate | 3 |
Actor | Samantha | 4 |
Doctor | Amanda | 1 |
Doctor | Julia | 2 |
Doctor | Emma | 3 |
Professor | Ashley | 1 |
Professor | Jason | 2 |
Professor | Britney | 3 |
Then select from this table and hardcode the column names
-- sql server
select
t1.row_id
, max(case when o='Doctor' then name else NULL END) as 'Doctors'
, max(case when o='Professor' then name else NULL END) as 'Professors'
, max(case when o='Singer' then name else NULL END) as 'Singers'
, max(case when o='Actor' then name else NULL END) as 'Actors'
FROM
(select Occupation as o
, Name
, Row_number() over (partition by occupation order by name) as row_id
from occupations) as t1
GROUP BY t1.row_id
ORDER by t1.row_id
Output:
row_id | Doctors | Professors | Singers | Actors |
---|---|---|---|---|
1 | Amanda | Ashley | Chris | Eve |
2 | Julia | Boris | Jane | Jennifer |
3 | Jane | Britney | Jenny | Emma |
4 | NULL | Maria | Kris | Samantha |
5 | NULL | Meera | NULL | NULL |
6 | NULL | Naomi | NULL | NULL |
7 | NULL | Priyanka | NULL | NULL |
That's how you can pivot tables.
----- DBMS -----
Normalization
Normalization is done to avoid data duplication and easily manage relational data, specially 1-many and many-many relationships
OLTP
- It is mostly normalized, as there are many reads and writes but on specific data, eg, add new book, update an author
- you work on row level, eg, update address of a customer, you need all detail but only of one customer, so one row
Snowflake schemas are normalized, avoiding data redundancy by storing dimensional data in separate tables. This improves data quality and reduces the required storage space.
LINK: NORM VS DE-NORM
Denormalisation
- it improves read performance.
- reduced risk of anomalies as we are mostly analysing and not updating / writing / deleting data at atomic level.
- streaming inserts, like from IoT can use denormalized data model as it has simple structure of timestamp and value.
- Eliminates complex joins and give performance gain
- Common way is star-schema where you have facts and dimensions.
- Wide table is also good but when data is petabyte scale then use specific databases like google big query.
OLAP
- It is for data analysis, many read by many process on many data points. hence denormalized is preferred.
- you work on column level, eg average of temperature.
Why star schema is not normalized?
- In a star schema, there is only one level of dimension tables, and all foreign key relationships are defined between the fact table and the dimensions. A query never needs to join tables beyond the single layer of dimension tables, resulting in better performance than if the dimension tables were normalized.
Partitioning
- large table can be inefficient, as it may have large indexes to manage. It has to scan large index.
- Horizontal Partitioning divides table into parts, it limits scans to subset of partitions. So if you break into 3 horizontal partition, there will be smaller index to scan/maintain and limited rows to read.
- efficient deleting, lets say you can drop a old timed partition, eg if older than 5 year, drop whole partition.
Vertical Partitions
- partitions into set of columns, good when using columnar storage
- retains all rows in one partition
- cam have global indexes, but reduced I/O as reading less
Range Partitioning
- similar to horizontal, uses time / numeric range / alphabetical range for partitioning
- it partitions on non-overlapping keys.
Hash Partitioning
- identify a partitioning key, apply a hash and take modulus of hash (remainder) to identify which partition it should go into. Eg, you determine the number of partitions you need, then can do hash partitioning.
- useful when range partition does not make sense, or you need more balanced partitions.
Code in Postgres
- define partition when creating,
- you can create partition tables by creating table as
partition of for values from ... to ...
CREATE TABLE (
...
msmt_date date not null,
...
)
PARTITION BY RANGE (msmt_date);
CREATE TABLE tab_part_a PARTITION OF table_x
FOR VALUES FROM ('2021-01-01') TO ('2021-01-31') ; -- monthly
Materialized Views
- Instead of executing query again and again through execution plan, you can save results using materialized view.
- Persist the results, form of caching
- trade space for time.
When to use
- long running complex queries with joins
- computing aggregate data
When NOT to use
- can get out of sync, so depends how consistent data is required in view, eg 1hr refresh of materialized view will not have data generated in last 59 mins.
CREATE MATERIALIZED VIEW mv_myview AS (
-- your sql query
)
SELECT * FROM mv_myview; -- use as normal table
Read Replicas
There is a primary server which takes all read and write operation, you can add a read replica fo this server where all the writes also go in, now you can query this.
Data is up to date with write ahead log.
this makes primary to focus on writes, other replica(s) can help heavy reads.
there are things to consider, like consistency requirements, transaction completion, all depends on trade offs.
Challenge Sensor Data - Write Heady example
You have sensor data coming in from IoT device, there should not be latency or lag in writes.
System Design
- You should model the table to be same as data coming in, eg (device_id, timestamp, measure1, measure 2), exactly these 4.
- then can have materialized view on top of these to have aggregated hourly / daily results
- also consider partitioning which would ideally be on the time column.
- consider when do you need to purge the materialised view, old hourly data need not to be kept
- latest hour data materialized view need to be refreshed more frequently if required
- all these questions should be asked from domain expert, interrogate.
- if low level raw data is required, then avoid doing reads on primary server, rather create read-replica for this.
Indexes
This is indexing strategy. Indexes help reduce the data blocks. But there is cost to it. more space, more writes on each update/write. more cardinality (more distinct values) then indexing is more useful.
Indexes are ordered, whereas tables might be unordered.
Indexing Attributes
- Access Types: does it help to access range or a specific value
- Access Time: time to find the result
- Insertion Time: time to find the appropriate space to insert data
- Deletion Time: time to find data, delete it and update the index.
- Space Overhead: additional space required by the index.
Index is not used in Google Big query or aws red shift. They are columnar storage databases and have different strategy.
Type
- B-tree
- bit map
- hash index
B-tree
- balanced tree index
- look-up in logarithmic time, same as binary search.
- its like binary tree, small value go to left, large go to right, and so on. it halves and halves and halves, hence, \(O(\log n)\)
Bitmap index
- small possible values in column, encode to bit string. eg, yes 1, no is 0, null is 00; Any enums can be represented by bit map.
- read intensive and few writes, because can be expensive to create bit maps for write
- PostgreSql does not support natively
Hash
- Function for mapping arbitrary length data to a fixed-size string
- unique values output, eg
hello
is5d41402abc4b2a76b9719d911017c592
in MD5 hash. and is always same for a string. - any change in input produces new hash.
- You cannot reverse a hash, simple it is way math algo have done it, as eg, consider 912 is hash, 912 = 900 + 12 or 400 + 512 and so on. You will have to try all possible outputs to match, but can't exactly know what it is.
- Use case: It is useful when there are usually unique value in column, like email. Rather than fetch by email, you create hash index on it, and then you can fetch using hash index scan.
GiST and SP-GiST
- Generalized search tree.
- for specialized data types, like
ltree
andhtree
. special type like circle, box, latitudes, etc. - SP - partitioned trees.
GIN
and BRIN
- Generalized inverted index. values that are in composite item.
- slow insertion.
- either you can make insert fast or fetching.
BRIN
- block range index,- used with very large index, ike post-code or dates.
- pages adjacent in table.
- stores summary information in block ranges.
Challenge
Claims data, large size, 12 col data. how would you index to optimize join on claim id.
Solution
- claim id is for join. by default, Postgres has b-tree. but claim_id is unique in dataset, so b-tree is not useful, rather we can use hash-index. So that would be one-one lookup.
SQL Execution Plan
how database engine executes the SQL statements
- In SQL we tell what we want, not how we want, in java/python we do tell how we want to do.
- The cost of execution mostly depends on the number of rows.
- Few scans on large table is okay, otherwise, prefer indexes.
- Indexes help filter data on index, rather than on rows. And this may help in fetching just a partition.
Table Join Algo
- tables join on keys, the internal algo is one of these
- Nested Loop Join - compare all rows in both tables. n^2 ?
- Hash Join - Calculate hash value of keys and join based on matching hash values.
- Sort Merge Join - Sort both tables and then join rows with advantage of order. n log n ?
Plan Builder
- It builds the plan of execution based on the meta data statistics.
- You can look at query plan for a query
- If that query plan is inefficient, you can run
ANALYZE
command to update the statistics on meta data.
- SQL is declarative language, where you declare what you need. Whereas, programming is procedural language where you tell how to do things.
Explain
- Add
EXPLAIN
before any query to see the query plan. - It shows operations, like Seq Scan.. or Append. It can be something happening in parallel and others.
- With indexed in table, the plan changes, instead of sequence scan, you can see index scan, which takes less time.
- This way you can see what going on under the hood, are the indexes useful?
- Explain Join - here you can see the algo used for join. If the execution plan is expensive, you can consider adding indexes on key used for join.
- Eg,
Seq Scan on employee(cost=0.00..26.50 rows=323 width=65)
means it is a full table scan.
Analyze
If you add Explain Analyze
before a query you can see the time taken by query to get completed.
Challenge
- You have to optimise a long running query. You see that it has
- 3 tables
- 2 left joins
- 2 tables with ~500k rows and one with 200 rows
Solution
- Run EXPLAIN
- look for full-table scan, if so create index, on col used in join.
- look to filter dataset if not required.
- Run ANALYZE to ensure stats are up to date.
Tip You can generate fake data in Postgres using generate_series()
function. It is like range() in Python. It can generate numbers, timestamps etc. And use random()
for random data.
- select is executed before group hence you can use select aliases in group by.
- examples
- you can calculate month, year from order_date and use that in group by.
- use sum over month to find running total
- make this CTE, then reuse it.
User Defined Functions
- you can define functions to do some steps, it can be overloaded
- Postgres supports using python in functions.
HStore
You can use HStore with postgres to support storage of key-value pair in efficient way.
- Useful when you have large number of attributes not being used, this creates sparse columns.
- GIN and GiST indexes can be used efficiently.
- Eg: Catalog for store items.
- you can have a normal table. with columns as defined and have another column defined as HBase, which can have key value pair to store attributes.
- Useful when you do not have fixed structure to attributes. It is combination of sql with no-sql.
JSON and Semi-structured data
Postgres 9+ lets you store both kind of data in relational database with flexibility of semi-structured data. You can have a column data type as JSON and store JSON in it. It will have JSON as text, but JSON key can be queries and filtered using ->
notation.
CREATE TABLE customer_summary
(id serial primary key,
customer_doc jsonb);
insert into customer_summary (customer_doc) values
('{
"customer_name": { "first_name": "Alice", "last_name": "Johnson" },
"address": { "city": "Boston", "state": "MA" }
}');
select customer_doc->'customer_name'->>'first_name' from customer_summary;
-- Alice
----- Databases -----
- Relational Databases
- store in rows. Eg, MySQL,
PostgreSQl
- good to store transactions and build relationships.
- Eg, PostgreSQL is relational db.
- store in rows. Eg, MySQL,
- Columnar Databases
- used commonly in warehousing. eg, Amazon Redshift, Google BigQuery, Apache Cassandra.
- NoSQL Databases
- They do not have defined structure and are document based.
- Elasticsearch - a search engine based on
Apache Lucene
and can be used as NoSql db. - Apache Kibana - adds GUI to ElasticSearch.
PostgreSQL
- PostgreSQL is relational db.
pgadmin4
is GUI to it. It is available on web and can run on localhost.pgcli
is eMac CLI to client for database access. It can be installed within PIP as a package.- login with linux user on the server as
username@server
- then add server, expand it, see database, schemas, public.
SQLite
- It is a micro database that can work in memory or a saved in file, eg,
store.db
. - Queries are same as any other SQL.
- It can be used in many ways, some are:
- Python Program and DB Engine In memory
- Python Program and DB Engine as File
- SQLite installed as utility and access via shell, this is
sqlite3.exe
program. - SQLite ODBC driver.
Interaction
- shell
sqlite3
opens a shell in command line, just like mysql shell. DB is in memory- to open a file, use
.open FILENAME
to open an existing database.
- to open a file, use
sqlite3 data.sqlite
to work on this filectrl + z
enter to exit
- DDL
.tables
to show all tables.schema orders
to check create statement
- load CSV
.mode csv
and then.import data.csv orders
loads csv to db, creates if not exists.
- GUI SQLite Browser
MySQL
Installation:
brew install mysql
Start Server:
brew services start mysql
- background mysql startmysql.server start
- no backgroundmysql_secure_installation
- run this and set root pwd, etc.ps -ef | grep mysqld
process
Login:
mysql -u root -p
then enter password
Queries:
show DATABASES;
create user 'bob_me'@'localhost' identified with mysql_native_password by 'bob_pwd';
create database bob_db;
grant all privileges on bob_db.* to 'bob_me'@'localhost';
Client - SequelPro:
- Connecting:
- Standard
- Host: 127.0.0.1
- enter username and password and connect.
Shutdown Server:
mysql.server stop
- stops server
Other Notes:
- Column and Table names are case-sensitive.
mysqladmin
is also installed
Trouble Shooting:
- If you see error in clients, eg Sequel Pro, it might not be ready yet for a new kind of user login, link.
- do
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'newrootpassword';
then try logging in.
Microsoft SQL Server
Snippets
-- Create and Insert
SELECT FirstName, LastName
INTO TestTable
FROM Person.Contact
WHERE EmailPromotion = 2
-- Existing Table Append
INSERT INTO TestTable
SELECT FirstName, LastName
FROM Person.Contact
WHERE EmailPromotion = 2
Redis
redis-server
to start the server.
Elasticsearch
- Elastic Search is a search engine based on
Apache Lucene
and can be used as NoSql db. - you can set cluster and node in elastic search.
- once installed it doesn't has GUI but is rather an API. at
localhost:9200
you can see JSON out from API - uses JSON Query called Elastic Query DSL (Domain Specific Language)
- Apache Kibana
- adds GUI to ElasticSearch.
- you can use it to build visualizations and dashboards of your data held in Elasticsearch.
- after installation it can be accessed on
http://localhost:5601
- you can create viz from index in elasticsearch. viz is widget that can be added to dashboard. dashboard filter filter all widget if field is present.
developer tools
is scratch-pad area where you can crate and query data/indices.