Spark - Skewed Input dataframe

Spark - Skewed Input dataframe - python

i am working with a heavily nested non-splittable json format input dataset housed in S3.
The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.
When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.)
I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.
df = spark.read.json('s3://my/parent/directory')
df.repartition(396)
# Settings (few):
default parallelism = 396
total number of cores = 400
What I tried:
I figured that the input partitions (s3 partitions not spark) scheme (folder hierarchy) might be leading to this skewed partitions problem, where some s3 folders (techinically 'prefixes') have just one file while other has thousands, so I transformed the input to a flattened directory structure using hashcodewhere each folder has just one file:
Earlier:
/parent1
/file1
/file2
.
.
/file1000
/parent2/
/file1
Now:
hashcode=FEFRE#$#$$#FE/parent1/file1
hashcode=#$#$#Cdvfvf##/parent1/file1
But it didnt have any effect.
I have tried with really large clusters too - thinking that even if there is input skew - that much memory should be able to handle the larger files. But I still get into the straggling tasks.
When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes - and probably its assigning only two files where the file size is huge , and much more files to single partition when the filesize is less?
Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.
EDIT: Code added for latest trial per Abdennacer's comment
This is the entire code.
Results:
The job gets stuck in Running - even though in worker logs I see the task finished. The driver log has error 'can not increase buffer size' - I dont know what is causing the 2GB buffer issue, since I am not issuing any 'collect' or similar statement.
Configuration for this job is like below to ensure the executor has huge memory per task.
Driver/Executor Cores: 2.
Driver/ Executor memory: 198 GB.
# sample s3 source path:
s3://masked-bucket-urcx-qaxd-dccf/gp/hash=8e71405b-d3c8-473f-b3ec-64fa6515e135/group=34/db_name=db01/2022-12-15_db01.json.gz
#####################
# Boilerplate
#####################
spark = SparkSession.builder.appName('query-analysis').getOrCreate()
#####################
# Queries Dataframe
#####################
dfq = spark.read.json(S3_STL_RAW, primitivesAsString="true")
dfq.write.partitionBy('group').mode('overwrite').parquet(S3_DEST_PATH)

Great job flattening the files to increase read speed. Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size. The approach you took will up reading faster than you original strategy. It will not help you with skew of the data itself.
One thing you might consider is that your raw data and working data do not need to be the same set of files. There is a strategy for landing data and then pre-processing it for performance.
That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries. (Parquet is the best choice for working with S3).
Land data a 'landing zone'
As needed process the data stored in the landing zone, into a convenient splittable format for querying. ('pre-processed folder' )
Once your raw data is processed move it to a 'processed folder'. (Use Your existing flat folder structure.) This processing table is important should you need to rebuild the table or make changes to the table format.
Create a view that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder. This gives you a performant table with up to date data.
If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data. In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3. Run this 'processing' as often as needed and you should have a performant table to run large queries on.
S3 was designed as a long term cheap storage. It's not made to perform quickly, but they've been trying to make it better over time.
It should be noted this will solve skew on read but won't solve skew on query. To solve the query portion you can enable Adaptive query in your config. (this will adaptively add an extra shuffle to make queries run faster.)
spark.sql.adaptive.enabled=true

Related

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of files so that its easier to read for my ML model(using sklearn).
I have managed to convert multiple from from S3 to a spark dataframe (called spark_df) using :
spark_df1=spark.read.csv(path,header=False,schema=schema)
spark_df1 contains 100s of columns (features) and is my time series inference data for millions of customers IDs. Since it is a time series data, i want to make sure that a the data points of 'customerID' should be present in same output file as I would be reading each partition file as a chunk.
I want to unload this data back into S3.I don't mind smaller partition of data but each partitioned file SHOULD have the entire time series data of a single customer. in other words one customer's data cannot be in 2 files.
current code:
datasink3=spark_df1.repartition(1).write.format("parquet").save(destination_path)
However, this takes forever to run and the ouput is a single file and it is not even zipped. I also tried using ".coalesce(1)" instead of ".repartition(1)" but it was slower in my case.

You can partition it using the customerID:
spark_df1.partitionBy("customerID") \
.write.format("parquet") \
.save(destination_path)
You can read more about it here: https://sparkbyexamples.com/pyspark/pyspark-repartition-vs-partitionby/

This code worked and the time to run reduced to 1/5 of the original result. Only thing to note is that make sure that the load is split equally amongst the nodes (in my case i had to make sure that each customer id had ~same number of rows)
spark_df1.repartition("customerID").write.partitionBy("customerID").format("csv").option("compression","gzip").save(destination_path)

adding to manks answer, you need to repartition the DataFrame by customerID and then write.partitionBy(customerID) to get one file per customer.
you can see a similar issue here.
Also, regarding your comment that parquet files are not zipped, the default compression is snappy which has some pros &cons compared to gzip compression but it's still much better than uncompressed.

What is a viable approach for loading a large quantity of data originating in csv files?

Background
Ultimately I'm trying to load a large quantity of data into GBQ. Records exist for the majority of seconds across many dates.
I want to have a table for each day's data so I may query the set using the syntax:
_TABLE_SUFFIX BETWEEN '20170403' AND '20170408'...
Some of the columns will be nested and sets of columns currently originate from a high number of csv files each between 30-150mb in size.
Approach
My approach to processing the files is to:
1) Create tables with correct schema
2) Fill each table with a record for each second in preparation for cycling through data and running UPDATE queries
3) Cycle through csv files and load data
All this is intended to be driven by the GBQ client library for python.
Step (2) may seem odd, but I think it's needed in order to simplify the loading as I will be able to guarantee that UPDATE DML will work in all cases.
I do realise that (3) has little chance of working given the issues with (2), but I have run a similar operation at a slightly smaller scale leading me to believe that my approach might not be entirely hopeless at this larger scale.
I'm also left wondering whether this is the most appropriate method to load data of this nature, but some of the options presented in GBQ documentation seem like they could be overkill for a one-time load of data i.e I would not want to learn apache beam unless I know I have to given that my chief concern is analysing the dataset. But I'm not beyond learning something like apache beam if I know it's the only way of completing such a load as it could come in handy in future.
Problem that I've run into
In order to run (2) I have three nested for loops cycling through each hour, minute second and constructing a long string to go into the INSERT query:
(TIMESTAMP '{timestamp}', TIMESTAMP '2017-04-03 00:00:00', ... + ~86k times for each day)
GBQ doesn't allow you to run such a long query there is a limit, from memory it's 250k characters (could be wrong). For that reason the length of query is tested at each iteration and if it exceeds a number slightly lower than 250k characters then the insert query is performed.
Although this worked for a similar type of operation involving more columns when I run this this time the process terminates when GQB returns:
"google.cloud.exceptions.BadRequest: 400 Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.."
When I reduce the size of the strings and run smaller queries this error is returned less frequently, but does not disappear all together. Unfortunately at this reduced size of query the script would take too long to complete to serve as a feasible option for step (2). It takes >600 seconds and the entire data set would take in excess of 20 days to run.
I had thought that changing the query mode to 'BATCH' may overcome this, but it's been pointed out to me that this won't help, I'm not yet clear why, but happy to take someone's word for it. Hence I've detailed the problem further as it seems likely that my approach may be wrong.
This is what I see as the relevant code block:
from google.cloud import bigquery
from google.cloud.bigquery import Client
... ...
bqc = Client()
dataset = bqc.dataset('test_dataset')
job = bqc.run_async_query(str(uuid.uuid4()), sql)
job.use_legacy_sql = False
job.begin()
What would be the best approach to loading this dataset? Is there a way of getting my current approach to work?
Query
INSERT INTO `project.test_dataset.table_20170403` (timestamp) VALUES (TIMESTAMP '2017-04-03 00:00:00'), ... for ~7000 distinct values
Update
It seems as if the GBQ client library has had a not insignificant upgrade since I last downloaded it Client.run_asyn_query() appears to have been replaced with Client.query(). I'll experiment with the new code and report back.

How to analyse multiple csv files very efficiently?

I have nearly 60-70 timing log files(all are .csv files, with a total size of nearly 100MB). I need to analyse these files at a single go. Till now, I've tried the following methods :
Merged all these files into a single file and stored it in a DataFrame (Pandas Python) and analysed them.
Stored all the csv files in a database table and analysed them.
My doubt is, which of these two methods is better? Or is there any other way to process and analyse these files?
Thanks.

For me I usually merge the file into a DataFrame and save it as a pickle but if you merge it the file will pretty big and used up a lot of ram when you used it but it is the fastest way if your machine have a lot of ram.
Storing the database is better in the long term but you will waste your time uploading the csv to the database and then waste even more of your time retrieving it from my experience you use the database if you want to query specific things from the table such as you want a log from date A to date B however if you use pandas to query all of that than this method is not very good.
Sometime for me depending on your use case you might not even need to merge it use the filename as a way to query and get the right log to process (using the filesystem) then merge the log files you are concern with your analysis only and don't save it you can save that as pickle for further processing in the future.

What exactly means analysis on a single go?
I think your problem(s) might be solved using dask and particularly the dask dataframe
However, note that the dask documentation recommends to work with one big dataframe, if it fits comfortably in the RAM of you machine.
Nevertheless, an advantage of dask might be to have a better parallelized or distributed computing support than pandas.

Iterate through S3 files in Spark

Problem:
Large number of files. Each file is 10MB and consist of records in json format, gzipped.
My snippet is loading all the data into memory. There is no need to do this. I just need a few hours of data in memory at a time. I need a sliding window.
Is it possible to apply the 'window' idea from spark streaming to the files and how would I do this?
I'm using python
location = "s3://bucketname/xxxx/2016/10/1[1-2]/*/file_prefix*.gz"
rdd = sc.textFile(location)

The snippet you posted actually does no computation. Spark execution is lazy, and only forces computation of "transformations" like maps, filters, and even textFiles when you ask for a result -- counting the RDD for example.
Another note is that most Spark operations stream by default. If you have 300 10M json files, you're going to get 300 separate partitions or tasks. If you're willing to wait, you could perform most RDD operations on this dataset on one core.
If you need a sliding window, then there's good functionality for that in the Spark streaming package. But the snippet you posted has no problems as it is!

Saving large Python arrays to disk for re-use later --- hdf5? Some other method?

I'm currently rewriting some python code to make it more efficient and I have a question about saving python arrays so that they can be re-used / manipulated later.
I have a large number of data, saved in CSV files. Each file contains time-stamped values of the data that I am interested in and I have reached the point where I have to deal with tens of millions of data points. The data has got so large now that the processing time is excessive and inefficient---the way the current code is written the entire data set has to be reprocessed every time some new data is added.
What I want to do is this:
Read in all of the existing data to python arrays
Save the variable arrays to some kind of database/file
Then, the next time more data is added I load my database, append the new data, and resave it. This way only a small number of data need to be processed at any one time.
I would like the saved data to be accessible to further python scripts but also to be fairly "human readable" so that it can be handled in programs like OriginPro or perhaps even Excel.
My question is: whats the best format to save the data in? HDF5 seems like it might have all the features I need---but would something like SQLite make more sense?
EDIT: My data is single dimensional. I essentially have 30 arrays which are (millions, 1) in size. If it wasn't for the fact that there are so many points then CSV would be an ideal format! I am unlikely to want to do lookups of single entries---more likely is that I might want to plot small subsets of data (eg the last 100 hours, or the last 1000 hours, etc).

HDF5 is an excellent choice! It has a nice interface, is widely used (in the scientific community at least), many programs have support for it (matlab for example), there are libraries for C,C++,fortran,python,... It has a complete toolset to display the contents of a HDF5 file. If you later want to do complex MPI calculation on your data, HDF5 has support for concurrently read/writes. It's very well suited to handle very large datasets.

Maybe you could use some kind of key-value database like Redis, Berkeley DB, MongoDB... But it would be nice some more info about the schema you would be using.
EDITED
If you choose Redis for example, you can index very long lists:
The max length of a list is 232 - 1 elements (4294967295, more than 4
billion of elements per list). The main features of Redis Lists from
the point of view of time complexity are the support for constant time
insertion and deletion of elements near the head and tail, even with
many millions of inserted items. Accessing elements is very fast near
the extremes of the list but is slow if you try accessing the middle
of a very big list, as it is an O(N) operation.

I would use a single file with fixed record length for this usecase. No specialised DB solution (seems overkill to me in that case), just plain old struct (see the documentation for struct.py) and read()/write() on a file. If you have just millions of entries, everything should be working nicely in a single file of some dozens or hundreds of MB size (which is hardly too large for any file system). You also have random access to subsets in case you will need that later.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.