Parallel reading and processing from Postgres using (Py)Spark - python

I have a question regarding reading large amounts of data from a Postgres database, and process it using spark in parallel. Let's assume I have a table in Postgres I would like to read into Spark using JDBC. Let's assume it has the following columns:
id (bigint)
date (datetime)
many other columns (different types)
Currently the Postgres table is not partitioned. I would like to transform a lot of data in parallel, and eventually store the transformed data somewhere else.
Question: How can we optimize parallel reading of the data from Postgres?
The documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) suggests to use a partitionColum to process the queries in parallel. In addition, one is required to set lowerBound, and upperBound. From what I understand is that in my case, I can either use the column id and date for partitionColumn. However, the problem here is how to set the lowerBound and upperBound values when partitioning on one of the columns. I noticed that data skew arises in my case if not set properly. For processing in Spark, I do not care about natural partitions. I just need to transform all data as quick as possible, so optimizing for unskewed partitions would be prefered I think.
I have come up with a solution for this, but I am unsure if it actually makes sense to do this. Essentially it is hashing the id's into partitions. My solution would be to do use mod() on the id column with a specified number of partitions. So then the dbtable field in the would be something like:
"(SELECT *, mod(id, <<num-parallel-queries>>) as part FROM <<schema>>.<<table>>) as t"
And then I use partitionColum="part", lowerBound=0, and upperBound=<<num-parallel-queries>> as options for the Spark read JDBC job.
Please, let me know if this makes sense!

It is a good idea to "partition" by the primary key column.
To get partitions of equal size, use the table statistics:
SELECT histogram_bounds::text::bigint[]
FROM pg_stats
WHERE tablename = 'mytable'
AND attname = 'id';
If you have default_statistics_target at its default value of 100, this will be an array of 101 values that delimit the percentiles from 0 to 100. You can use this to partition your table evenly.
For example: if the array looks like {42,10001,23066,35723,49756,...,999960} and you need 50 partitions, the first would be all rows with id < 23066, the second all rows with 23066 ≤ id < 49756, and so on.

Related

pyspark groupby sum all in one time vs partial where then groupby sum for huge table

suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role

SQL ORDER BY Timestamp, identical values

I've got a sql data (MariaDB) with a timestamp index. Sadly there are multiple rows with the indentical timestamp. The best way would be to reject the current data and start using a unique index. Sadly that is not easily done.
When querying like that:
SELECT TIMESTAMP, DEVICE, VALUE FROM df WHERE DEVICE IN ('A', 'B') ORDER BY TIMESTAMP ASC
The order of the elements with the identical timestamp isn't the same as the order when looking at the complete data without ORDER BY TIMESTAMP ASC. I would like to get the data in the same order as written into the SQL data. I am querying in python with pandas and my work around would be to get the complete data prepared with python but that is slower.
Can you help me? I know it should be done different in the first place but maybe there is a work-around.
Fabian
SQL doesn't guarantee the order the data is retrieved in. You need to use another column to force data retrieval in a specific order. don't you have another column you can use in the order by?

PySpark repartition according to a specific column

I am looking how to repartition (in PySpark) a dataset so that all rows that have the same ID in a specified column move to the same partition. In fact, I have to run in each partition a program which computes a single value for all rows having the same ID.
I have a dataframe (df) built from a HIVE QL query (with lets say contains 10000 distinct IDs).
I tried :
df = df.repartition("My_Column_Name")
By default I get 200 partitions, but always I obtain 199 IDs for which I got duplicated computed values when I run the program.
I looked on the web, and some people recommended to define a custom partitioner to use with repartition method, but I wasn't able to find how to do that in Python.
Is there a way to do this repartition correctly?
I want only to have ALL rows with the same IDs moved to the same partition. No problem if a partition contains several groups of rows with distinct IDs. 1000 was just an example, the number of different IDs can be very high. So, partitionning a DF to number of different IDs partitions should not lead to good performances. I need that because I run a function (which cannot be implemented using basic Spark transformation functions) using RDD mapPartition method. This function produces one result per distinct ID, this is why I need to have all rows with the same ID in the same partition.

How to make Spark read only specified rows?

Suppose I'm selecting given rows from a large table A. The target rows are given either by a small index table B, or by a list C. The default behavior of
A.join(broadcast(B), 'id').collect()
or
A.where(col('id').isin(C)).collect()
will create a task that reads in all data of A before filtering out the target rows. Take the broadcast join as an example, in the task DAG, we see that the Scan parquet procedure determines columns to read, which in this case, are all columns.
The problem is, since each row of A is quite large, and the selected rows are quite few, ideally it's better to:
read in only the id column of A;
decide the rows to output with broadcast join;
read in only the selected rows to output from A according to step 2.
Is it possible to achieve this?
BTW, rows to output could be scattered in A so it's not possible to make use of partition keys.
will create a task that reads in all data of A
You're wrong. While the first scenario doesn't push any filters, other than IsNotNull on join key in case of inner or left join, the second approach will push In down to the source.
If isin list is large, this might not necessary be faster, but it is optimized nonetheless.
If you want to fully benefit from possible optimization you should still use bucketing (DISTRIBUTE BY) or partitioning (PARTITIONING BY). These are useful in the IS IN scenario, but bucketing can be also used, in the first one, where B is too large to be broadcasted.

Optimize python csv processing into parent and EAV child table

There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.

Categories