I am aware that this can be done in R as follows
ds <- open_dataset("nyc-taxi/csv/2019", format = "csv",
partitioning = "month")
But is there a way to do in python ? Tried these but seems like thats not an option
from pyarrow import csv
table = csv.read_csv("*.csv")
from pyarrow import csv
path = os.getcwd()
table = csv.read_csv(path)
table
Is there a way to make it happen in python ?
Yes, you can do this with pyarrow as well, similarly as in R, using the pyarrow.dataset submodule (the pyarrow.csv submodule only exposes functionality for dealing with single csv files).
Example code:
import pyarrow.dataset as ds
dataset = ds.dataset("nyc-taxi/csv/2019", format="csv", partitioning=["month"])
table = dataset.to_table()
And then in the to_table() method you can specify row/column filters.
Related
I have a python script to insert a csv file into mongodb collection
import pymongo
import pandas as pd
import json
client = pymongo.MongoClient("mongodb://localhost:27017")
df = pd.read_csv("iris.csv")
data = df.to_dict(oreint = "records")
db = client["Database name"]
db.CollectionName.insert_many(data)
Here all the columns of csv files are getting inserted into mongo collection. How can I achieve a usecase where I want to insert only specific columns of csv file in the mongo collection .
What changes I can make to existing code.
Lets say I also have database already created in my Mongo. Will this command work even if the database is present (db = client["Database name"])
Have you checked out pymongoarrow? the latest release has write support where you can import a csv file into mongodb. Here are the release notes and documentation. You can also use mongoimport to import a csv file, documentation is here, but I can't see any way to exclude fields like the way you can with pymongoarrow.
Since the foundry documentation is rather patchy and didn't really provide an answer:
Is it somehow possible to use a foundry code repository (python-docx library is available and used) and a df as input to produce word documents (.docx) as output?
I thought that maybe using a composition of the transform input/output and py-docx document.save() functionality may work but I couldn't come up with a proper solution.
from pyspark.sql import functions as F
from transforms.api import transform, transform_df, Input, Output
import os, docx
import pandas as pd
#transform(
output = Output("some_folder/"),
source_df = Input(""),
)
def compute(source_df, output):
df = source_df.dataframe()
test = df.toPandas()
document = docx.Document()
doc.add_paragraph(str(test.loc[1,1])
document.save('test.docx')
output.write_dataframe(df)
This code ofc does't work, but would appreciate a working solution (in an ideal world it would be possible to have multiple .docx as output).
Your best bet is to use spark to distribute the file generation over executors. This transformation generates a word doc for each row and stores in a dataset container, which is recommended over using Compass (Foundry's folder system). Browse to the dataset to download the underlying files
# from pyspark.sql import functions as F
from transforms.api import transform, Output
import pandas as pd
import docx
'''
# ====================================================== #
# === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
# ====================================================== #
Description
-----------
Generates a spark dataframes containing docx files with strings contained in a source spark dataframe
Strategy
--------
1. Create dummy spark dataframe with primary key and random text
2. Use a udf to open filesystem and write a docx with the contents of text column above
'''
#transform(
output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
)
def compute(ctx, output):
# gen data
pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
data = ctx.spark_session.createDataFrame(pdf)
# function to write files
def strings_to_doc(df, transform_output):
rdd = df.rdd
def generate_files(row):
filename = row['name'] + '.docx'
with transform_output.filesystem().open(filename, 'wb') as worddoc:
doc = docx.Document()
doc.add_heading(row['name'])
doc.add_paragraph(row['content'])
doc.save(worddoc)
rdd.foreach(generate_files)
return strings_to_doc(data, output)
A pandas udf will also work if you prefer the input to a pandas dataframe but you are forced define a schema which is inconvinient for your usage.
How do I store custom metadata to a ParquetDataset using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then read it using pyarrow
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')
However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.
I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:
meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')
One possibility (that does not directly answer the question) is to use dask.
import dask
# Sample data
df = dask.datasets.timeseries()
df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})
Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.
from pathlib import Path
import pyarrow.parquet as pq
files = Path('test.parq').glob('*')
all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True
I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:
import pandas as pd
from pathlib import Path
data_dir = Path("path_of_folder_with_files")
df = pd.concat(
pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
for parquet_file in data_dir.glob("*")
)
Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?
import pyarrow.parquet as pq
table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()
You could implement it using arrow instead of pandas:
batches = []
for file_name in data_dir.glob("*"):
table = pq.read_table(file_name)
table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
batches.extend(table.to_batches())
return pa.Table.from_batches(batches)
I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).
I have a .csv file that I am converting into a table format using the following python script. In order to make this useful, I need to create a table within the Excel that holds the data (actually formatted as a table (Insert > Table). Is this possible within python? I feel like it should be relatively easy, but can't find anything on the internet.
The idea here is that the python takes the csv file, converts it to xlsx with a table embedded on sheet1, and then moves it to the correct folder.
import os
import shutil
import pandas as pd
src = r"C:\Users\xxxx\Python\filename.csv"
src2 = r"C:\Users\xxxx\Python\filename.xlsx"
read_file = pd.read_csv (src) - convert to Excel
read_file.to_excel (src2, index = None, header=True)
dest = path = r"C:\Users\xxxx\Python\repository"
destination = shutil.copy2(src2, dest)
Edit: I got sidetracked by the original MWE.
This should work, using xlsxwriter:
import pandas as pd
import xlsxwriter
#Dummy data
my_data={"list1":[1,2,3,4], "list2":"a b c d".split()}
df1=pd.DataFrame(my_data)
df1.to_csv("myfile.csv", index=False)
df2=pd.read_csv("myfile.csv")
#List of column name dictionaries
headers=[{"header" : i} for i in list(df2.columns)]
#Create and propagate workbook
workbook=xlsxwriter.Workbook('output.xlsx')
worksheet1=workbook.add_worksheet()
worksheet1.add_table(0, 0, len(df2), len(df2.columns)-1, {"columns":headers, "data":df2.values.tolist()})
workbook.close()