PyArrow SIGSEGV error when using UnionDatasets - python

The context:
I am using PyArrow to read a folder structured as exchange/symbol/date.parquet. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.
If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.
So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.
The problem:
When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.
But when I read the UnionDataset filtered data, I always get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.
I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.
Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.

Related

PDF Grabbing Code is terminated (Cache full trying to find a workaround)

So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')

Is Luigi suited for building a pipeline around lots of small files (100k+)

My first gut reaction is that Luigi isn't suited for this sort of thing, but I would like the "pipeline" functionality and everything keeps pointing me back to Luigi/Airflow. I can't use Airflow as it is a Windows environment.
My use-case:
So currently from my “source” folder we have 20 or so machines that produce XML data. Over time, some process puts these files into a folder continually on each machine (its log data). On any given day these machines could have 0 files in the folder or it could have 100k+ files (each machine). Eventually someone will go delete all of the files.
One part of this process is to to watch all of these directories on all of these machines, and copy the files down to an archive folder if they are new.
My current process makes a listing of all the files on each machine every 5 minutes, grabs a listing of the files and loops over the source checking if the file is available at the destination. Copies if it doesn't exist at destination, skips if it does.
It seems that Luigi wants to work with only "a" (singular) file in its output and/or target. The issue is, I could have 1 new file, or several thousand files that shows up.
This same exact issue happens through the entire process. As the next step in my pipeline is to add the files and its metadata information (size, filename, directory location) to a db record. At that point another process reads all of the metadata record rows and puts them into a content extracted table of the XML log data.
Is Luigi even suited for something like this? Luigi seems to want to deal with one thing, do some work on it, and then emit that information out to another single file.
I can tell you that mine workflow works with 10K log files every day without any glitches. The main good stuff here that I have created one task for work with each file.

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

What big data solution can I use to process a huge number of input files?

I am currently searching for the best solution + environment for a problem I have. I'm simplifying the problem a bit, but basically:
I have a huge number of small files uploaded to Amazon S3.
I have a rule system that matches any input across all file content (including file names) and then outputs a verdict classifying each file. NOTE: I cannot combine the input files because I need an output for each input file.
I've reached the conclusion that Amazon EMR with MapReduce is not a good solution for this. I'm looking for a big data solution that is good at processing a large number of input files and performing a rule matching operation on the files, outputting a verdict per file. Probably will have to use ec2.
EDIT: clarified 2 above
Problem with Hadoop is when you get a very large number of files that you do not combine with CombineFileInput format, it makes the job less efficient.
Spark doesnt seem to have a problem with this though, Ive had jobs run without problems with 10s of 1000s of files and output 10s of 1000s of files. Not tried to really push the limits, not sure if there even is one!

Categories