consider a dataframe that is constantly being appended new values at given interval (say, every 10 mins) over a specific length of time (say, 300 mins). Whilst data is being added to this dataframe, I want to simultaneously be able to read this dataframe in another file [meaning perform some further process/analytics on the dataframe values from another .py file]. How can I achieve this? I suspect, that i need to use the multiprocessing or multi-treading library, but can I read the dataframe from memory or do i have to first write it to the disk and read the stored file?
Also, how can i run the first file (which appends the data) in the background so as to be able to work on other files from ipython shell ( i am using spyder 3.3 and python 2.7)
I did some online reading on multiprocessing but couldn't understand how to go about the two issues referred above. Generally, any pointers on how this can be achieved in simplest possible way will be helpful
Related
I'm writing an app that monitors an applications scanning process. Of course to check this overtime I have to log the progress (don't ask me why this isn't in the app already).
To do this the app runs every half hour, determines what's worth loggin and not and adds them to a pandas dataframe that is then saved locally as a CSV so next run it can determine if progress is as we expect.
My question is that should I append the data i need to as I find it through the run or store it in a list or another dataframe and append it all at the end of a run before saving to CSV?
Is there a benefit to one way or another or is the difference between running append multiple times vs once negligable?
The reason I ask is this could eventually be large amounts of data being appended so building efficiencies in from the start is a good idea.
Thanks in Advance
This really depends on what you mean by large amounts of data. If it's MB then keeping everything as a df in memeory is fine; however if GB then it's better to saving them to CSV and concat to a new df
from glob import glob
df = pd.concat([pd.read_csv(i) for i in glob('/path/to/csv_files/*.csv')])
So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')
I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.
We have a dataframe we are working it in a ipython notebook. Granted, if one could save a dataframe in such a way that the whole group could have access to it through their notebooks, would be ideal, and I'd love to know how to do that. However could you help with the following specific problem?
When we do df.to_csv("Csv file name") it appears that it is located in the exact same place as the files we placed in object storage to utilize in the ipython notebook. However, when one goes to Manage Files, it's nowhere to be found.
When one runs pd.DataFrame.to_csv(df), text of the csv file is apparently given. However when one copies that into a text editor (ex- Sublime text), saves it at a csv, and attempts to read it in to a dataframe, the expected dataframe is not yielded.
How does one export a dataframe to csv format, and then access it?
I'm not familiar with bluemix, but it sounds like you're trying to save a pandas dataframe in a way that all of your collaborators can access and it look the same way for everyone.
Maybe saving and reading from CSVs is messing up the formatting of your dataframe. Have you tried using pickling? Since pickling is based around python, it should give consistent results.
Try this:
import pandas as pd
pd.to_pickle(df, "/path/to/pickle/My_pickle")
and on the read side:
df_read = pd.read_pickle("/path/to/pickle/My_pickle")
This is probably very easy, but after looking through documentation and possible examples online for the past several hours I cannot figure it out.
I have a large dataset (a spreadsheet) that gets heavily cleaned by a DO file. In the DO file I then want to save certain variables of the cleaned data as a temp .csv run some Python scripts, that produce a new CSV and then append that output to my cleaned data.
If that was unclear here is an example.
After cleaning my data set (XYZ) goes from variables A to Z with 100 observations. I want to take variables A and D through F and save it as test.csv. I then want to run a python script that takes this data and creates new variables AA to GG. I want to then take that information and append it to the XYZ dataset (making the dataset now go from A to GG with 100 observations) and then be able to run a second part of my DO file for analysis.
I have been doing this manually and it is fine but the file is going to start changing quickly and it would save me a lot of time.
Would this work (assuming you can get to python
tempfile myfiletemp
save `myfiletemp'
outsheet myfile1.csv
shell python.exe myscript.py
insheet myfile2.csv, clear
append using `myfiletemp'
Type "help shell" in Stata. What you want to do is shell out from Stata, call Python, and then have Stata resume whatever you want it to do after the Python script has completed.