I'm new to Dremio, and I was following this SQS + S3 + Dremiotutorial to learn more about Dremio. In one of the code snippets, it is mentioned that get_messages_from_queue will create a CSV file and which is later used in the upload_file method to upload into S3.
However I'm missing that portion of the command which converts into CSV, can anyone help me how to create CSV using pandas? I'm new to Pandas, still learning.
SQS message body looks like this
"Body": "{\"holiday\":\"None\",\"temp\":288.28,\"rain_1h\":0.0,\"snow_1h\":0.0,\"clouds_all\":40,\"weather_main\":\"Clouds\",\"weather_description\":\"scattered clouds\",\"date_time\":\"2012-10-02 09:00:00\",\"traffic_volume\":5545}"
Add sqs message to dictionary and load it to panda Dataframe. Finally use to_csv to export csv file.
import pandas as pd
df = pd.DataFrame(sqs_message)
df.to_csv('sqs_messages.csv', index = False) # pass path and file name of csv.
Related
I've seen a number of solutions about sending a CSV as an attachment in an email via Python.
In my case, my Python code needs to extract data from a Snowflake view and send it to a user group as a CSV attachment. While I know how I can do it using to_csv out of a Pandas dataframe, my question is: Do I have to create the CSV file externally at all? Can I simply run the Pandas DF through MIMEText? If so, what file name do I use in the header?
You don't have to create a temporary CSV file on disk, but you also can't just "attach a dataframe" since it'd have no specified on-wire format. Use a BytesIO to have Pandas serialize the CSV to memory:
df = pd.DataFrame(...)
bio = io.BytesIO()
df.to_csv(bio, mode="wb", ...)
bio.seek(0) # rewind the in-memory file
# use the file object wherever a file object is supported
# or extract the binary data with `bio.getvalue()`
I'm building a site that, based on a user's input, sorts through JSON data and prints a schedule for them into an html table. I want to give it the functionality that once the their table is created they can export the data to a CSV/Excel file so we don't have to store their credentials (logins & schedules in a database). Is this possible? If so, how can I do it using python preferably?
This is not the exact answer but rather steps for you to follow in order to get a solution:
1 Read data from json. some_dict = json.loads(json_string)
2 Appropriate code to get the data from dictionary (sort/ conditions etc) and get that data in a 2D array (list)
3 Save that list as csv: https://realpython.com/python-csv/
I'm pretty lazy and like to utilize pandas for things like this. It would be something along the lines of
import pandas as pd
file = 'data.json'
with open(file) as j:
json_data = json.load(j)
df = pd.DataFrame.from_dict(j, orient='index')
df.to_csv("data.csv")
When I put an object in python, I can set the metadata at the time. Example:
self.s3_client.put_object(
Bucket=self._bucket,
Key=key,
Body=body,
ContentEncoding=self._compression,
ContentType="application/json",
ContentLanguage="en-US",
Metadata={'other-key':'value'}
)
It seems like both pyarrow and fastparquet don't let me pass those particular keywords despite the pandas documentation saying that extra keywords are passed.
This saves the data how I want it to, but I can't seem to attach the metadata with any syntax that I try.
df.to_parquet(s3_path, compression='gzip')
If there was an easy way to compress the parquet and convert it to a bytestream.
Would rather not write the file twice (either to local then transfer to AWS or twice on AWS)
Ok. Found it quicker than I thought.
import pandas as pd
import io
#read in data to df.
df=pd.read_csv('file.csv')
body = io.BytesIO()
df.to_parquet(
path=body,
compression="gzip",
engine="pyarrow",
)
bucket='MY_BUCKET'
key='prefix/key'
s3_client.put_object(
Bucket=bucket,
Key=key,
Body=body.getvalue(),
ContentEncoding='gzip',
ContentType="application/x-parquet",
ContentLanguage="en-US",
Metadata={'user-key':'value'},
)
I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()
I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?
the write syntax is
df.to_parquet(path, mode='append')
the read syntax is
pd.read_parquet(path)
Looks like its possible to append row groups to already existing parquet file using fastparquet. This is quite a unique feature, since most libraries don't have this implementation.
Below is from pandas doc:
DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, **kwargs)
we have to pass in both engine and **kwargs.
engine{‘auto’, ‘pyarrow’, ‘fastparquet’}
**kwargs - Additional arguments passed to the parquet library.
**kwargs - here we need to pass is: append=True (from fastparquet)
import pandas as pd
import os.path
file_path = "D:\\dev\\output.parquet"
df = pd.DataFrame(data={'col1': [1, 2,], 'col2': [3, 4]})
if not os.path.isfile(file_path):
df.to_parquet(file_path, engine='fastparquet')
else:
df.to_parquet(file_path, engine='fastparquet', append=True)
If append is set to True and the file does not exist then you will see below error
AttributeError: 'ParquetFile' object has no attribute 'fmd'
Running above script 3 times I have below data in parquet file.
If I inspect the metadata, I can see that this resulted in 3 row groups.
Note:
Append could be inefficient if you write too many small row groups. Typically recommended size of a row group is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small row groups. Compression will work better, since compression operates within a row group only. There will also be less overhead spent on storing statistics, since each row group stores its own statistics.
To append, do this:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"
# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)
# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)
This will automatically append into your table.
I used aws wrangler library. It works like charm
Below are the reference docs
https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html
I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter
Below is the sample code I used:
!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
wr.s3.to_parquet(
df=evet_data,
path=s3_path,
dataset=True,
partition_cols=['e','f'],
mode="append",
database="wat_q4_stg",
table="raw_data_v3",
catalog_versioning=True # Optional
)
print("write successful")
except Exception as e:
print(str(e))
Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient
There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.
Use the fastparquet write function
from fastparquet import write
write(file_name, df, append=True)
The file must already exist as I understand it.
API is available here (for now at least): https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
Pandas to_parquet() can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
os.makedirs(path, exist_ok=True)
# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))
# read
pd.read_parquet(path)