Python Pandas - Appending to dataframe: One by One or Batch - python

I'm writing an app that monitors an applications scanning process. Of course to check this overtime I have to log the progress (don't ask me why this isn't in the app already).
To do this the app runs every half hour, determines what's worth loggin and not and adds them to a pandas dataframe that is then saved locally as a CSV so next run it can determine if progress is as we expect.
My question is that should I append the data i need to as I find it through the run or store it in a list or another dataframe and append it all at the end of a run before saving to CSV?
Is there a benefit to one way or another or is the difference between running append multiple times vs once negligable?
The reason I ask is this could eventually be large amounts of data being appended so building efficiencies in from the start is a good idea.
Thanks in Advance

This really depends on what you mean by large amounts of data. If it's MB then keeping everything as a df in memeory is fine; however if GB then it's better to saving them to CSV and concat to a new df
from glob import glob
df = pd.concat([pd.read_csv(i) for i in glob('/path/to/csv_files/*.csv')])

Related

Python pandas dataframe failing to successfully execute to_csv() in run mode while successfully executes in debug mode

I'm using python pandas to create a relatively large dataframe and storing it as a csv file on windows operating system. The dataframe is relatively large in terms of 1000s of rows * 100 columns.
The problem I'm seeing is inconsistency in creating the file and storing it during execution without any errors. The valid file name is created dynamically and the data changes every time, but I can get windows to create and save the same dataframe as a csv in debug mode with single stepping, i.e. when I give a gap between the dataframe.to_csv() function and next step, but in run-time irrespective of the delay the file is sometimes generated and sometimes not.
Could someone advise the best way to debug the erratic behavior and what might be the root cause ? I'm sure the data or the file name is not the culprit.
import pandas as pd
# logic for the valid pd_csv_filename creation
# logic for the valid dataframe df1 creation & manipulation
time.sleep(x)
df1.to_csv(pd_csv_fileName, mode='a')
time.sleep(x)
# Check if the file is created
The csv file is not opened during the execution.

Is there a way to compare CSVs in Python while ignoring row ordering?

I'm setting up an automatic job that needs to parse csv files from an ftp site, each file contains several 10 thousand lines. I want to pre-process the directory to eliminate duplicate files before parsing the remaining files. The problem is that duplicate files are being pushed to the ftp but with different row ordering (i.e. same data, different order). That results in "duplicate files" having different hashes and byte-by-byte comparisons. with minimal processing.
I want to keep file manipulation to the minimum so I tried sorting the CSVs using the csvsort module but this is giving me an index error: IndexError: list index out of range. Here is the relevant code:
from csvsort import csvsort
csvsort(input_filename=file_path,columns=[1,2])
I tried finding and eliminating empty rows but this didn't seem to be the problem and, like I said, I want to keep file manipulation to the minimum to retain file integrity. Moreover, I have no control over the creation of files or pushing of the files to the ftp
I can think of a number of ways to work around this issue but they would all involve opening the CSVs and reading the contents, manipulating it, etc. Is there anyway I can do a lightweight file comparison that ignores row ordering or will I have to go for heavier processing?
You do not specify how much data you have. My take on this would differ depending on size. Are we talking 100's of lines ? Or multi-million lines ?
If you have "few" lines you can easily sort the lines. But if the data gets longer, you can use other strategies.
I have previously solved the problem of "weeding out lines from file A that appear in file B" using AWK, since AWK is able to do this with just 1 run through the long file (A), making the process extremely fast. However you may need to call an external program. Not sure if this is ideal for you.
If your lines are not completely identical - let's say you need to compare just one of several fields - AWK can do that as well. Just extract fields into variables.
If you choose to go this way, the script is something along:
FNR==NR{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
Use with
c:\path\to\awk.exe -f awkscript.awk SMALL_LIST FULL_LIST > DIFF_LIST
DIFF_LIST is items from FULL that are NOT in SMALL.
So, it turns out that pandas has a built in hash function with an option to ignore the index. As the hash is calculated on each row, you need to run an additional sum function. In terms of code, it is about as lightweight as I could wish, in terms of runtime, it parses ~15 files in ~5 seconds (~30k rows, 17 columns in each file).
from pandas import read_csv
from pandas.util import hash_pandas_object
from collections import defaultdict
duplicate_check = defaultdict(list)
for f in files:
duplicate_check[hash_pandas_object(read_csv(f),index=False).sum()].append(f)

PDF Grabbing Code is terminated (Cache full trying to find a workaround)

So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

access pandas dataframe from another file while it is still being updated

consider a dataframe that is constantly being appended new values at given interval (say, every 10 mins) over a specific length of time (say, 300 mins). Whilst data is being added to this dataframe, I want to simultaneously be able to read this dataframe in another file [meaning perform some further process/analytics on the dataframe values from another .py file]. How can I achieve this? I suspect, that i need to use the multiprocessing or multi-treading library, but can I read the dataframe from memory or do i have to first write it to the disk and read the stored file?
Also, how can i run the first file (which appends the data) in the background so as to be able to work on other files from ipython shell ( i am using spyder 3.3 and python 2.7)
I did some online reading on multiprocessing but couldn't understand how to go about the two issues referred above. Generally, any pointers on how this can be achieved in simplest possible way will be helpful

Categories