This question already has an answer here:
Pandas XLSWriter - return instead of write
(1 answer)
Closed 4 years ago.
I recently had to take a dataframe and prepare it to output to an Excel file. However, I didn't want to save it to the local system, but rather pass the prepared data to a separate function that saves to the cloud based on a URI. After searching through a number of ExcelWriter examples, I couldn't find what I was looking for.
The goal is to take the dataframe, e.g.:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6})
And temporarily store it as bytes in a variable, e.g.:
processed_data = <bytes representing the excel output>
The solution I came up with is provided in the answers and hopefully will help someone else. Would love to see others' solutions as well!
Update #2 - Example Use Case
In my case, I created an io module that allows you to use URIs to specify different cloud destinations. For example, "paths" starting with gs:// get sent to Google Storage (using gsutils-like syntax). I process the data as my first step, and then pass that processed data to a "save" function, which itself filters to determine the right path.
df.to_csv() actually works with no path and automatically returns a string (at least in recent versions), so this is my solution to allow to_excel() to do the same.
Works like the common examples, but instead of specifying the file in ExcelWriter, it uses the standard library's BytesIO to store in a variable (processed_data):
from io import BytesIO
import pandas as pd
df = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
output = BytesIO()
writer = pd.ExcelWriter(output)
df.to_excel(writer) # plus any **kwargs
writer.save()
processed_data = output.getvalue()
Related
Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not support it, so alternative would be storing multiple Parquet files into the file system. I have a rather large number (say 10000) of relatively small frames ~1-5MB to process, so I'm not sure if this could become a concern?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
dfs = []
df1 = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6]},
columns=["A", "B"])
df2 = pd.DataFrame(data={"X": [1, 2], "Y": [3, 4], "Z": [5, 6]},
columns=["X", "Y", "Z"])
dfs.append(df1)
dfs.append(df2)
for i in range(2):
table1 = pa.Table.from_pandas(dfs[i])
pq.write_table(table1, "my_parq_" + str(i) + ".parquet")
No, this is not possible as Parquet files have a single schema. They normally also don't appear as single files but as multiple files in a directory with all files being the same schema. This enables tools to read these files as if they were one, either fully into local RAM, distributed over multiple nodes or evaluate an (SQL) query on them.
Parquet will also be able to store these data frames efficiently even for this small size thus it should be a suitable serialization format for your use case. In contrast to HDF5, Parquet is only a serialization for tabular data. As mentioned in your question, HDF5 also supports a file system-like key vale access. As you have a large number of files and this might be problematic for the underlying filesystem, you should look at finding a replacement for this layer. Possible approaches for this will first serialize the DataFrame to Parquet in-memory and then store it in a key-value container, this could either be a simple zip archive or a real key value store like e.g. LevelDB.
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)
I have a calculation that creates an excel spreadsheet using xlsxwriter to show results. It would be useful to sort the table after knowing the results.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
I cannot find a way to sort the structures in the xlsx module.
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
Another solution would be reopening the file, sort the stuff and close it again?
import xlsxwriter
workbook=xlsxwriter("Trial.xlsx")
worksheet=workbook.add_worksheet("first")
worksheet.write_number(0,1,2)
worksheet.write_number(0,2,1)
...worksheet.sort
Can anybody help with the internal data structure of that module? Can that be sorted, before writing it to disk.
I am the author of the module and the short answer is that this can't or shouldn't be done.
It is possible to sort worksheet data in Excel at runtime but that isn't part of the file specification so it can't be done with XlsxWriter.
One solution would be to create a separate Data structure in python, and sort the data structure, and use xlsx later, but it is not very elegant, requires a lot of data type handling.
That sounds like a reasonable solution to me.
You should process your data before writing it to a Workbook as it is not easily possible to manipulate the data once in the spreadsheet.
The following example would write a column of numbers unsorted:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = [5, 2, 7, 3, 8, 1]
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
But if you first sort the data as follows:
import xlsxwriter
with xlsxwriter.Workbook("Trial.xlsx") as workbook:
worksheet = workbook.add_worksheet("first")
data = sorted([5, 2, 7, 3, 8, 1])
for rowy, value in enumerate(data):
worksheet.write_number(rowy, 0, value) # use column 0
You would get something like:
After looking around for about a week, I have been unable to find an answer that I can get to work. I am making an assignment manager for a project for my first year CS class. Everything else works how I'd like it to (no GUI, just text) except that I cannot save data to use each time you reopen it. Basically, I would like to save my classes dictionary:
classes = {period_1:assignment_1, period_2:assignment_2, period_3:assignment_3, period_4:assignment_4, period_5:assignment_5, period_6:assignment_6, period_7:assignment_7}
after the program closes so that I can retain the data stored in the dictionary. However, I cannot get anything I have found to work. Again, this is a beginner CS class, so I don't need anything fancy, just something basic that will work. I am using a school-licensed form of Canopy for the purposes of the class.
L3viathan's post might be direct answer to this question, but I would suggest the following for your purpose: using pickle.
import pickle
# To save a dictionary to a pickle file:
pickle.dump(classes, open("assignments.p", "wb"))
# To load from a pickle file:
classes = pickle.load(open("assignments.p", "rb"))
By this method, the variable would retain its original structure without having to write and convert to different formats manually.
Either use the csv library, or do something simple like:
with open("assignments.csv", "w") as f:
for key, value in classes.items():
f.write(key + "," + value + "\n")
Edit: Since it seems that you can't read or write files in your system, here's an alternative solution (with pickle and base85):
import pickle, base64
def save(something):
pklobj = pickle.dumps(something)
print(base64.b85encode(pklobj).decode('utf-8'))
def load():
pklobj = base64.b85decode(input("> ").encode('utf-8'))
return pickle.loads(pklobj)
To save something, you call save on your object, and copy the string that is printed to your clipboard, then you can save it in a file, for instance.
>>> save(classes) # in my case: {34: ['foo#', 3]}
fCGJT081iWaRDe;1ONa4W^ZpJaRN&NWpge
To load, you call load() and enter the string:
>>> load()
> fCGJT081iWaRDe;1ONa4W^ZpJaRN&NWpge
{34: ['foo#', 3]}
The pickle approach described by #Ébe Isaac and #L3viathan is the way to go. In case you also want to do something else with the data, you might want to consider pandas (which you should only use IF you do something else than just exporting the data).
As there are only basic strings in your dictionary according to your comment below your question, it is straightforward to use; if you have more complicated data structures, then you should use the pickle approach:
import pandas as pd
classes = {'period_1':'assignment_1', 'period_2':'assignment_2', 'period_3':'assignment_3', 'period_4':'assignment_4', 'period_5':'assignment_5', 'period_6':'assignment_6', 'period_7':'assignment_7'}
pd.DataFrame.from_dict(classes, orient='index').sort_index().rename(columns={0: 'assignments'}).to_csv('my_csv.csv')
That gives you the following output:
assignments
period_1 assignment_1
period_2 assignment_2
period_3 assignment_3
period_4 assignment_4
period_5 assignment_5
period_6 assignment_6
period_7 assignment_7
In detail:
.from_dict(classes, orient='index') creates the actual dataframe using the dictionary as in input
.sort_index() sorts the index which is not sorted as you use a dictionary for the creation of the dataframe
.rename(columns={0: 'assignments'}) that just assigns a more reasonable name to your column (by default '0' is used)
.to_csv('my_csv.csv') that finally exports the dataframe to a csv
If you want to read in the file again you can do it as follows:
df2 = pd.read_csv('my_csv.csv', index_col=0)
I'm happy to use csv.Dialect objects for reading and writing CSV files in python. My only problem with this now is the following:
it seems like I can't use them as a to_csv parameter in pandas
to_csv and Dialect (and read_csv) parameters are different (eg. to_csv have sep instead of delimiter)... so generating a key-value parameterlist doesn't seem to be a good idea
So I'm a little lost here, what to do.
What can I do if I have a dialect specified but I have a pandas.DataFrame I have to write into CSV? Should I create a parameter mapping by hand?! Should I change to something else from to_csv?
I have pandas-0.13.0.
Note: to_csv(csv.reader(..., dialect=...), ...) didn't work:
need string or buffer, _csv.writer found
If you have a CSV reader, than you don't need to also do a pandas.read_csv call. You can create a dataframe with a dictionary, so your code would look something like:
csv_dict = # Insert dialect code here to read in the CSV as a dictonary of the format {'Header_one': [1, 2, 3], 'Header_two': [4, 5, 6]}
df = pd.DataFrame(csv_dict)