dask set_index from large unordered csv file

dask set_index from large unordered csv file - python

At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing.
I found the option of doing set_index within dask unworkable for the size of the toy cluster I am using for learning and the size of the files (33GB).
So if your problem is loading large unsorted CSV files, ( multiple tens of gigabytes ), into a dask dataframe and quickly start performing groupbys my suggestion is to previously sort them with the unix command "sort".
sort processing needs are negligible and it will not push your RAM limits beyond unmanageable limits. You can define the number of parallel processes to run/sort as well as the ram consumed as buffer. In as far you have disk space, this rocks.
The trick here is to export LC_ALL=C in your environment prior to issue the command. Either wise, pandas/dask sort and unix sort will produce different results.
Here is the code I have used
export LC_ALL=C
zcat BigFat.csv.gz |
fgrep -v ( have headers?? take them away)|
sort -key=1,1 -t "," ( fancy multi field sorting/index ? -key=3,3 -key=4,4)|
split -l 10000000 ( partitions ??)
The result is ready for a
ddf=dd.read_csv(.....)
ddf.set_index(ddf.mykey,sorted=True)
Hope this helps
JC

As discussed above, I am just posting this as a solution to my problem. Hope works for others.
I am not claiming this is the best, most efficient or more pythonic! :-)

Related

Spark - Skewed Input dataframe

i am working with a heavily nested non-splittable json format input dataset housed in S3.
The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.
When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.)
I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.
df = spark.read.json('s3://my/parent/directory')
df.repartition(396)
# Settings (few):
default parallelism = 396
total number of cores = 400
What I tried:
I figured that the input partitions (s3 partitions not spark) scheme (folder hierarchy) might be leading to this skewed partitions problem, where some s3 folders (techinically 'prefixes') have just one file while other has thousands, so I transformed the input to a flattened directory structure using hashcodewhere each folder has just one file:
Earlier:
/parent1
/file1
/file2
.
.
/file1000
/parent2/
/file1
Now:
hashcode=FEFRE#$#$$#FE/parent1/file1
hashcode=#$#$#Cdvfvf##/parent1/file1
But it didnt have any effect.
I have tried with really large clusters too - thinking that even if there is input skew - that much memory should be able to handle the larger files. But I still get into the straggling tasks.
When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes - and probably its assigning only two files where the file size is huge , and much more files to single partition when the filesize is less?
Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.
EDIT: Code added for latest trial per Abdennacer's comment
This is the entire code.
Results:
The job gets stuck in Running - even though in worker logs I see the task finished. The driver log has error 'can not increase buffer size' - I dont know what is causing the 2GB buffer issue, since I am not issuing any 'collect' or similar statement.
Configuration for this job is like below to ensure the executor has huge memory per task.
Driver/Executor Cores: 2.
Driver/ Executor memory: 198 GB.
# sample s3 source path:
s3://masked-bucket-urcx-qaxd-dccf/gp/hash=8e71405b-d3c8-473f-b3ec-64fa6515e135/group=34/db_name=db01/2022-12-15_db01.json.gz
#####################
# Boilerplate
#####################
spark = SparkSession.builder.appName('query-analysis').getOrCreate()
#####################
# Queries Dataframe
#####################
dfq = spark.read.json(S3_STL_RAW, primitivesAsString="true")
dfq.write.partitionBy('group').mode('overwrite').parquet(S3_DEST_PATH)

Great job flattening the files to increase read speed. Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size. The approach you took will up reading faster than you original strategy. It will not help you with skew of the data itself.
One thing you might consider is that your raw data and working data do not need to be the same set of files. There is a strategy for landing data and then pre-processing it for performance.
That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries. (Parquet is the best choice for working with S3).
Land data a 'landing zone'
As needed process the data stored in the landing zone, into a convenient splittable format for querying. ('pre-processed folder' )
Once your raw data is processed move it to a 'processed folder'. (Use Your existing flat folder structure.) This processing table is important should you need to rebuild the table or make changes to the table format.
Create a view that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder. This gives you a performant table with up to date data.
If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data. In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3. Run this 'processing' as often as needed and you should have a performant table to run large queries on.
S3 was designed as a long term cheap storage. It's not made to perform quickly, but they've been trying to make it better over time.
It should be noted this will solve skew on read but won't solve skew on query. To solve the query portion you can enable Adaptive query in your config. (this will adaptively add an extra shuffle to make queries run faster.)
spark.sql.adaptive.enabled=true

Why am I not using all of the memory in spark?

I am wondering why my jobs are running very slowly, and it appears to be because I am not using all of the memory available in PySpark.
When I go to the spark UI, and I click on "Executors" I see the following memory used:
And when I look at my executors I see the following table:
I am wondering why the "Used" memory is so small compared to the "Total memory. What can I do to use as much of the memory as possible?
Other information:
I have a small broadcasted table, but it is only 1MB in size. It should be replicated once per each executor, so I do not imagine it would affect this that much.
I am using spark managed by yarn
I am using spark 1.6.1
Config settings are:
spark.executor.memory=45g
spark.executor.cores=2
spark.executor.instances=4
spark.sql.broadcastTimeout = 9000
spark.memory.fraction = 0.6
The dataset I am processing has 8397 rows, and 80 partitions. I am not doing any shuffle operations aside from the repartitioning initially to 80 partitions.
It is the part when I am adding columns that this becomes slow. All of the parts before that seem to be reasonably fast, when when I try to add a column using a custom udf (using withColumn) it seems to be slowing in that part.
There is a similar question here:
How can I tell if my spark job is progressing? But my question is more pointed - why does the "Memory Used" show a number so low?
Thanks.

How to analyse multiple csv files very efficiently?

I have nearly 60-70 timing log files(all are .csv files, with a total size of nearly 100MB). I need to analyse these files at a single go. Till now, I've tried the following methods :
Merged all these files into a single file and stored it in a DataFrame (Pandas Python) and analysed them.
Stored all the csv files in a database table and analysed them.
My doubt is, which of these two methods is better? Or is there any other way to process and analyse these files?
Thanks.

For me I usually merge the file into a DataFrame and save it as a pickle but if you merge it the file will pretty big and used up a lot of ram when you used it but it is the fastest way if your machine have a lot of ram.
Storing the database is better in the long term but you will waste your time uploading the csv to the database and then waste even more of your time retrieving it from my experience you use the database if you want to query specific things from the table such as you want a log from date A to date B however if you use pandas to query all of that than this method is not very good.
Sometime for me depending on your use case you might not even need to merge it use the filename as a way to query and get the right log to process (using the filesystem) then merge the log files you are concern with your analysis only and don't save it you can save that as pickle for further processing in the future.

What exactly means analysis on a single go?
I think your problem(s) might be solved using dask and particularly the dask dataframe
However, note that the dask documentation recommends to work with one big dataframe, if it fits comfortably in the RAM of you machine.
Nevertheless, an advantage of dask might be to have a better parallelized or distributed computing support than pandas.

Iterate through S3 files in Spark

Problem:
Large number of files. Each file is 10MB and consist of records in json format, gzipped.
My snippet is loading all the data into memory. There is no need to do this. I just need a few hours of data in memory at a time. I need a sliding window.
Is it possible to apply the 'window' idea from spark streaming to the files and how would I do this?
I'm using python
location = "s3://bucketname/xxxx/2016/10/1[1-2]/*/file_prefix*.gz"
rdd = sc.textFile(location)

The snippet you posted actually does no computation. Spark execution is lazy, and only forces computation of "transformations" like maps, filters, and even textFiles when you ask for a result -- counting the RDD for example.
Another note is that most Spark operations stream by default. If you have 300 10M json files, you're going to get 300 separate partitions or tasks. If you're willing to wait, you could perform most RDD operations on this dataset on one core.
If you need a sliding window, then there's good functionality for that in the Spark streaming package. But the snippet you posted has no problems as it is!

Is speed of file opening/reading language dependent?

I have really big collection of files, and my task is to open a couple of random files from this collection treat their content as a sets of integers and make an intersection of it.
This process is quite slow due to long times of reading files from disk into memory so I'm wondering whether this process of reading from file can be speed up by rewriting my program in some "quick" language. Currently I'm using python which could be inefficient for this kind of job. (I could implement tests myself if I knew some other languages beside python and javascript...)
Also will putting all the date into database help? Files wont fit the RAM anyway so it will be reading from disk again only with database related overhead.
The content of files is the list of long integers. 90% of the files are quite small, less than a 10-20MB, but 10% left are around 100-200mb. As input a have filenames and I need read each of the files and output integers present in every file given.
I've tried to put this data in mongodb but that was as slow as plain files based approach because I tried to use mongo index capabilities and mongo does not store indexes in RAM.
Now I just cut the 10% of the biggest files and store rest in the redis, sometimes accessing those big files. This is, obviously temporary solution because my data grows and amount of RAM available does not.

One thing you could try is calculating intersections of the files on a chunk-by-chunk basis (i.e., read x-bytes into memory from each, calculate their intersections, and continue, finally calculating the intersection of all intersections).
Or, you might consider using some "heavy-duty" libraries to help you. Consider looking into PyTables (with HDF storage)/using numpy for calculating intersections. The benefit there is that the HDF layer should help deal with not keeping the entire array structure in memory all at once---though I haven't tried any of these tools before, it seems like they offer what you need.

If no file contains duplicate numbers, I'd try this:
sort file1 file2 | uniq -d
If they may contain duplicates, then you need to eliminate duplicates first:
sort -u file1 > /tmp/file1
sort -u file2 > /tmp/file2
cat /tmp/file1 /tmp/file2 | sort | uniq -d
Or if you prefer a version that doesn't (explicitly) use temporary files.
(sort -u file1; sort -u file2) | sort | uniq -d
You don't say what format the files are in (the above assumes text, with one integer per line). If they're in some binary format, you would also need a command to translate them before applying the above commands. By using pipes you can compose this step like this:
(decode file1 | sort -u ; decode file2 | sort -u) | sort | uniq -d
Here decode is the name of a program you would have to write that parses your file format.
Apart from being incredibly short and simple, the good thing about this shell solution is that it works with files of any size, even if they don't fit into RAM.
It's not clear from your question whether you have 2 or an arbitrary number of files to intersect (the start of your question says "a couple", the end "a list of filenames"). To deal with, for example, 5 files instead of 2, use uniq -c | awk '{ if ($1=="5") print $2; }' instead of uniq -d

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.