Output .docx document using repository in palantir foundry

Output .docx document using repository in palantir foundry - python

Since the foundry documentation is rather patchy and didn't really provide an answer:
Is it somehow possible to use a foundry code repository (python-docx library is available and used) and a df as input to produce word documents (.docx) as output?
I thought that maybe using a composition of the transform input/output and py-docx document.save() functionality may work but I couldn't come up with a proper solution.
from pyspark.sql import functions as F
from transforms.api import transform, transform_df, Input, Output
import os, docx
import pandas as pd
#transform(
output = Output("some_folder/"),
source_df = Input(""),
)
def compute(source_df, output):
df = source_df.dataframe()
test = df.toPandas()
document = docx.Document()
doc.add_paragraph(str(test.loc[1,1])
document.save('test.docx')
output.write_dataframe(df)
This code ofc does't work, but would appreciate a working solution (in an ideal world it would be possible to have multiple .docx as output).

Your best bet is to use spark to distribute the file generation over executors. This transformation generates a word doc for each row and stores in a dataset container, which is recommended over using Compass (Foundry's folder system). Browse to the dataset to download the underlying files
# from pyspark.sql import functions as F
from transforms.api import transform, Output
import pandas as pd
import docx
'''
# ====================================================== #
# === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
# ====================================================== #
Description
-----------
Generates a spark dataframes containing docx files with strings contained in a source spark dataframe
Strategy
--------
1. Create dummy spark dataframe with primary key and random text
2. Use a udf to open filesystem and write a docx with the contents of text column above
'''
#transform(
output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
)
def compute(ctx, output):
# gen data
pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
data = ctx.spark_session.createDataFrame(pdf)
# function to write files
def strings_to_doc(df, transform_output):
rdd = df.rdd
def generate_files(row):
filename = row['name'] + '.docx'
with transform_output.filesystem().open(filename, 'wb') as worddoc:
doc = docx.Document()
doc.add_heading(row['name'])
doc.add_paragraph(row['content'])
doc.save(worddoc)
rdd.foreach(generate_files)
return strings_to_doc(data, output)
A pandas udf will also work if you prefer the input to a pandas dataframe but you are forced define a schema which is inconvinient for your usage.

Related

python looping through csv files

I'm new to programming. I have a certain amount of csv files. What I want to do is to read the text columns in these files and to translate the columns to Spanish with google translate API, and then save the data frame as a new csv file.
My code goes like this:
!pip install googletrans==4.0.0rc1
from numpy.ma.core import append
import googletrans
import pandas as pd
import numpy as np
from googletrans import Translator
translator = Translator()
df = pd.read_csv("file.csv")
sentences= df['text'].tolist()
result = []
text_es=[]
[result.append(translator.translate(sentence,dest='es')) for sentence in sentences]
for s in result:
text_es.append(s.text)
df['text_es'] = np.array(text_es)
df.to_csv('es_file.csv', index=False)
Instead of uploading every single file and applying the code, I want to write a code that applies the code to all the files. How can I do this?

Ok so what you're going to want to do is create an Array of paths to all your csv files.
csv_paths = [Path1, Path2, Path3, Path4]
Then you need to loop over this list which is pretty simple simply use a for each loop like this:
for path in csv_paths:
Now you can do almost exactly what you were doing before but inside the loop:
df = pd.read_csv(path)
sentences= df['text'].tolist()
result = []
text_es=[]
[result.append(translator.translate(sentence,dest='es')) for sentence in sentences]
for s in result:
text_es.append(s.text)
df['text_es'] = np.array(text_es)
df.to_csv('es_file.csv', index=False)
I hope that helps :)

You can list all files in a folder using:
os.listdir('My_Downloads/Music')
And write a loop on this list.
See the docs:
See this link for more info

How to store custom Parquet Dataset metadata with pyarrow?

How do I store custom metadata to a ParquetDataset using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then read it using pyarrow
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')
However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.
I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:
meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')

One possibility (that does not directly answer the question) is to use dask.
import dask
# Sample data
df = dask.datasets.timeseries()
df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})
Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.
from pathlib import Path
import pyarrow.parquet as pq
files = Path('test.parq').glob('*')
all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True

Can I load multiple csv files using pyarrow?

I am aware that this can be done in R as follows
ds <- open_dataset("nyc-taxi/csv/2019", format = "csv",
partitioning = "month")
But is there a way to do in python ? Tried these but seems like thats not an option
from pyarrow import csv
table = csv.read_csv("*.csv")
from pyarrow import csv
path = os.getcwd()
table = csv.read_csv(path)
table
Is there a way to make it happen in python ?

Yes, you can do this with pyarrow as well, similarly as in R, using the pyarrow.dataset submodule (the pyarrow.csv submodule only exposes functionality for dealing with single csv files).
Example code:
import pyarrow.dataset as ds
dataset = ds.dataset("nyc-taxi/csv/2019", format="csv", partitioning=["month"])
table = dataset.to_table()
And then in the to_table() method you can specify row/column filters.

Method called twice instead of single call in Dask's multiprocessing

I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.

The DataFrame.apply method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.

How do I use pandas.read_csv on Google Cloud ML?

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist

You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default

Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.

You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Output .docx document using repository in palantir foundry - python

Related

python looping through csv files

How to store custom Parquet Dataset metadata with pyarrow?

Can I load multiple csv files using pyarrow?

Method called twice instead of single call in Dask's multiprocessing

How do I use pandas.read_csv on Google Cloud ML?

Categories

Resources