Try/Except in for loop failing - python

I am trying to catch some exceptions when processing files from an AWS s3 bucket. I know that the processing works just fine normally, as the error i get is expected and generated by myself by altering the column names of 1 file. The bucket contains several files that should process like normal, while the 1 file i altered should throw an exception. My desire is to append the filename to a list if it is not processed, print the exception with logging module, and continue processing the rest of the files. This is my code:
for item in settings.keys:
try:
response = settings.client.get_object(Bucket=settings.source_bucket, Key=item)
tmp = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='unicode_escape', sep=None, engine='python')
tmp['account_number'] = item.split('/')[4][:-4]
tmp.columns = tmp.columns.str.strip()
tmp.columns = tmp.columns.map(settings._config['balances']['columns'])
df = pd.concat([df, tmp], ignore_index=False)
except:
settings.unprocessed.append(item)
logger.exception(f'{item} Not Processed')
Before i altered the 1 file, everything processed like it should. By using try/except, i want to catch the exception if a file contains errors, and still process the rest of the files. However, after i altered the 1 file, every single file in the bucket threw an exception, and nothing was processed. Does anyone have any input as to why this happens?
2023-01-25 14:59:56 - ERROR - xxxx.csv Not Processed
Traceback (most recent call last):
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx.py", line 19, in balances
df = pd.concat([df, tmp], ignore_index=False)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\reshape\concat.py", line 360, in concat
return op.get_result()
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\reshape\concat.py", line 591, in get_result
indexers[ax] = obj_labels.get_indexer(new_labels)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3721, in get_indexer
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
UPDATE:
I did the same for some other files, and this works as expected. The error i generated is catched and printed to console, and the filename is appended to a list. This is the working code that does as expected:
for item in settings.keys:
try:
tmp = pd.DataFrame()
response = settings.client.get_object(Bucket=settings.source_bucket, Key=item)
if item.endswith('.csv'):
tmp = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='unicode_escape', sep=None, engine='python')
elif item.endswith('.xlsx'):
tmp = pd.read_excel(io.BytesIO(response['Body'].read()))
tmp['file'] = item.split('/')[4]
tmp.columns = tmp.columns.map(settings._config['account statements']['columns'])
tmp['row'] = tmp.index + 2
tmp.columns = tmp.columns.astype(str)
tmp.rename(columns=lambda x: x.strip())
for col in tmp.columns:
if col.startswith('Beløp'):
settings.statement_currencies[item.split('/')[-1:][0]] = col[-3:]
tmp[col] = tmp[col].astype(str)
tmp[col] = tmp[col].str.replace(',', '.')
tmp[col] = tmp[col].astype(float)
tmp['direction'] = np.where(tmp[col] > 0, 'Incoming', 'Outgoing')
df = pd.concat([df, tmp], ignore_index=False)
except:
settings.unprocessed.append(item)
logger.exception(f'{item} Not Processed')

Related

I am getting this error - raise ValueError("Unsupported predictor value: %d"%ft) TypeError: %d format: a real number is required, not bytes

I am trying to extract texts from PDF and compare the info, finally saving it as an excel file. But while I am running it, (the code is given below), I get the error. I have provided the whole Traceback.
`
import pdfminer
import pandas as pd
from time import sleep
from tqdm import tqdm
from itertools import chain
import slate
# List of pdf files to process
pdf_files = ['file1.pdf', 'file2.pdf']
# Create a list to store the text from each PDF
pdf1_text = []
pdf2_text = []
# Iterate through each pdf file
for pdf_file in tqdm(pdf_files):
# Open the pdf file
with open(pdf_file, 'rb') as pdf_now:
# Extract text using slate
text = slate.PDF(pdf_now)
text = text[0].split('\n')
if pdf_file == pdf_files[0]:
pdf1_text.append(text)
else:
pdf2_text.append(text)
sleep(20)
pdf1_text = list(chain.from_iterable(pdf1_text))
pdf2_text = list(chain.from_iterable(pdf2_text))
differences = set(pdf1_text).symmetric_difference(pdf2_text)
## Create a new dataframe to hold the differences
differences_df = pd.DataFrame(columns=['pdf1_text', 'pdf2_text'])
# Iterate through the differences and add them to the dataframe
for difference in differences:
# Create a new row in the dataframe with the difference from pdf1 and pdf2
differences_df = differences_df.append({'pdf1_text': difference if difference in pdf1_text else '',
'pdf2_text': difference if difference in pdf2_text else ''}, ignore_index=True)
# Write the dataframe to an excel sheet
differences_df = differences_df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)
differences_df.to_excel('differences.xlsx', index=False, engine='openpyxl')
import openpyxl
import re
# Load the Excel file into a dataframe
df = pd.read_excel("differences.xlsx")
# Create a condition to check the number of words in each cell
for column in ["pdf1_text", "pdf2_text"]:
df[f"{column}_word_count"] = df[column].str.split().str.len()
condition = df[f"{column}_word_count"] < 10
# Drop the rows that meet the condition
df = df[~condition]
for column in ["pdf1_text", "pdf2_text"]:
df = df.drop(f"{column}_word_count", axis=1)
# Save the modified dataframe to a new Excel file
df.to_excel("differences.xlsx", index=False)
This is my code, and below is the error which I am getting. Listing the whole traceback below -
Traceback (most recent call last):
File "c:\Users\lmohandas\stuff\1801pdfs\slatetrial.py", line 22, in <module>
text = slate.PDF(pdf_now)
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\slate\classes.py", line 61, in __init__
self.doc = PDFDocument(self.parser, password)
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\pdfdocument.py", line 558, in __init__
self.read_xref_from(parser, pos, self.xrefs)
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\pdfdocument.py", line 789, in read_xref_from
xref.load(parser)
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\pdfdocument.py", line 242, in load
self.data = stream.get_data()
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\pdftypes.py", line 292, in get_data
self.decode()
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\pdftypes.py", line 283, in decode
data = apply_png_predictor(pred, colors, columns, bitspercomponent, data)
File "C:\Users\lmohandas\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfminer\utils.py", line 46, in apply_png_predictor
raise ValueError("Unsupported predictor value: %d"%ft)
TypeError: %d format: a real number is required, not bytes

AttributeError - 'Div' object has no attribute 'set_index'

I am trying to run this code:
def parse_data(contents, filename):
content_type, content_string = contents.split(',')
decoded = base64.b64decode(content_string)
try:
if 'csv' in filename:
# Assume that the user uploaded a CSV or TXT file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')))
elif 'xls' in filename:
# Assume that the user uploaded an excel file
df = pd.read_excel(io.BytesIO(decoded))
elif 'txt' or 'tsv' in filename:
# Assume that the user upl, delimiter = r'\s+'oaded an excel file
df = pd.read_csv(
io.StringIO(decoded.decode('utf-8')), delimiter = r'\s+')
except Exception as e:
print(e)
return html.Div([
'There was an error processing this file.'
])
return df
def update_graph(contents, filename):
fig = {
'layout': go.Layout(
plot_bgcolor=colors["graphBackground"],
paper_bgcolor=colors["graphBackground"])
}
if contents:
contents = contents[0]
filename = filename[0]
df = parse_data(contents, filename)
df = df.set_index(df.columns[0])
fig['data'] = df.iplot(asFigure=True, kind='scatter', mode='lines+markers', size=1)
return fig
And get this error:
Traceback (most recent call last):
File "/Users/.../PycharmProjects/pythonProject2/main.py", line 93, in update_graph
df = df.set_index(df.columns[0])
AttributeError: 'Div' object has no attribute 'set_index'
Any ideas what might be wrong? Thanks a lot!
All problem is with your parse_data(). If it can't read file then it runs return html.Div() so running df = parse_data() means df = html.Div() and later you don't check if you really get data in df and df.set_index() means html.Div().set_index().
Maybe better use return None and check this after df = parse_data()
def parse_data(contents, filename):
# ... code ...
try:
# ... code ...
except Exception as e:
print(e)
return None
return df
and later
df = parse_data(contents, filename)
if df is None:
html.Div(['There was an error processing this file.'])
else:
df = df.set_index(df.columns[0])
fig['data'] = df.iplot(asFigure=True, kind='scatter', mode='lines+markers', size=1)
But this still can have problem with fig['data'] when it can't read file.
I can't test your code but maybe it should do assign Div to fig['data']
if df is None:
fig['data'] = html.Div(['There was an error processing this file.'])
else:
df = df.set_index(df.columns[0])
fig['data'] = df.iplot(asFigure=True, kind='scatter', mode='lines+markers', size=1)

How can I speed up these dataframe operations on 12k files/50gb?

Background:
I have 12,000 csv files (50gb) of data that mostly have the same format, but some may be missing a column or two and some header rows may not always start on the first row of the file.
I have a class with a couple of functions that utilize pandas to analyze and normalize these csv files either stored locally or from a google bucket.
The following actions occur in these functions:
In analyze_files
loop through all the files, "peeking" at their contents to determine the headers and if any rows need to be skipped in order to get to the headers row.
translate all collected headers into a standard format, removing all but alphanumeric and underscores from the filenames.
In normalize_files
loop through all files, loading each one completely this time.
convert the column headers to the standardizwd versions of the headers from analyze_files.
upload or save the updated version of the file
The functions work as-intended. But, I'm looking for methods I could use to speed things up.
Using the below version (simplified into a mvce) with 12,000 local files (8-core 16gb ram)
analyze_files takes around 2-4 minutes
normalize_files takes around 52 minutes
from google.cloud import storage
import pandas as pd
import glob
import os
import re
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./service_account_details.json"
class MyClass(object):
def __init__(self, uses_gs=False, gs_bucket_name=None, gs_folder_path=None):
self.__uses_gs = uses_gs
if uses_gs:
self.__gs_client = storage.Client()
self.__gs_bucket_name = gs_bucket_name
self.__gs_bucket = self.__gs_client.get_bucket(gs_bucket_name)
self.__gs_folder_path = gs_folder_path
else:
# save to a subfolder of current directory
self.__save_location = os.path.join(os.path.dirname(os.path.abspath(__file__)), self.__name__)
if not os.path.exists(self.__save_location):
os.mkdir(self.__save_location)
self.__file_analysis = dict()
self.__file_columns = set()
self.__file_column_mapping = dict()
def analyze_files(self):
# collect the list of files
files_to_analyze = list()
if self.__uses_gs:
gs_files = self.__gs_client.list_blobs(self.__gs_bucket, prefix=self.__gs_folder_path, delimiter="/")
for file in gs_files:
if file.name == self.__gs_folder_path:
continue
gs_filepath = f"gs://{self._gs_bucket_name}/{file.name}"
files_to_analyze.append(gs_filepath)
else:
local_files = glob.glob(os.path.join(self.__save_location, "*.csv"))
files_to_analyze.extend(local_files)
# analyze each collected file
for filepath in files_to_analyze:
# determine how many rows to skip in order to start at the header row,
# then collect the headers for this particular file, to be utilized for comparisons in `normalize_files`
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist()
self.__file_columns.update(headers)
# store file details as pandas parameters, so we can smoothly transition into reading the files efficiently
skiprows = skiprows + 1 if skiprows else 1 # now that we know the headers, we can skip the header row
self.__file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
# convert the columns to their bigquery-compliant equivalents
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
self.__file_column_mapping.update({
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in self.__file_columns
})
def normalize_files(self):
# perform the normalizations and upload/save the final results
total_columns = len(self.__file_columns)
for filepath, params in self.__file_analysis.items():
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=self.__file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
# swap the missing column names out for the bigquery equivalents
missing_columns = [self.__file_column_mapping[c] for c in self.__file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
if self.__uses_gs:
blob_path = filepath[5 + len(self.__gs_bucket_name) + 1:] # "gs://" + "{bucket_name}" + "/"
self.__gs_bucket.blob(blob_path).upload_from_string(df.to_csv(index=False), "text/csv")
else: # save locally
df.to_csv(filepath, index=False)
I thought about using dask, combined with ProcessPool and ThreadPool from the multiprocessing module. But, I am struggling with exactly what approach to take.
Since the dataframe operations are CPU-Bound they seem best-suited for dask, possibly combined with a ProcessPool to divvy up the 12k files across the 8 available cores, then dask would utilize the threads of each core (overcoming GIL limitations).
The uploading of the files back to disk or a google bucket seem more suited for a ThreadPool, since that activity is Network-bound.
As for reading in files from a Google bucket, I'm not sure what approach would work best.
Basically, it comes down to two sceneries:
What methods/logic would perform best when working with local files?
And what methods/logic would perform best when pulling from and saving back to (overwriting/updating) a Google bucket?
Can someone please provide some direction or code that will provide the most efficient speed boost for the above two functions?
Benchmark tests would be greatly appreciated as I've been pondering this topic for the better part of a week and it would be great to have statistics to back-up the decision of methodology.
Current Benchmarks from what I've tried
def local_analysis_test_dir_pd(test_dir):
file_analysis, file_columns = dict(), set()
local_files = glob.glob(os.path.join(test_dir, "*.csv"))
for filepath in local_files:
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist() # noqa
skiprows = skiprows + 1 if skiprows else 1
file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
file_columns.update(headers)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub(" ", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['local_analysis_test_dir_pd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def local_analysis_test_dir_dd(test_dir):
file_analysis, file_columns = dict(), set()
local_files = glob.glob(os.path.join(test_dir, "*.csv"))
def dask_worker(filepath):
siloed_analysis, siloed_columns = dict(), set()
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
return siloed_analysis, siloed_columns
headers = df.columns.values.tolist()
siloed_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
siloed_columns.update(headers)
return siloed_analysis, siloed_columns
dask_futures = [dask.delayed(dask_worker)(filepath) for filepath in local_files]
file_analyses, column_sets = map(list, zip(*list(dask.compute(*dask_futures))))
for analysis in file_analyses:
file_analysis.update(analysis)
file_columns.update(*column_sets)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub(" ", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['local_analysis_test_dir_dd'] result:", len(file_analysis), len(file_columns))
def remote_analysis_test_dir_pd(test_dir):
remote_files, file_analysis, file_columns = list(), dict(), set()
prefix = test_dir.replace("gs://webscraping/", "") + "/"
gs_files = gs_client.list_blobs("webscraping", prefix=prefix, delimiter="/")
for file in gs_files:
if file.name == prefix:
continue
elif file.name.endswith(".xlsx"):
continue
elif not file.name.endswith(".csv"):
continue
gs_filepath = f"gs://webscraping/{file.name}"
remote_files.append(gs_filepath)
for filepath in remote_files:
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist() # noqa
skiprows = skiprows + 1 if skiprows else 1
file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
file_columns.update(headers)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['remote_analysis_test_dir_pd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def remote_analysis_test_dir_dd(test_dir):
remote_files, file_analysis, file_columns = list(), dict(), set()
prefix = test_dir.replace("gs://webscraping/", "") + "/"
gs_files = gs_client.list_blobs("webscraping", prefix=prefix, delimiter="/")
for file in gs_files:
if file.name == prefix:
continue
elif file.name.endswith(".xlsx"):
continue
elif not file.name.endswith(".csv"):
continue
gs_filepath = f"gs://webscraping/{file.name}"
remote_files.append(gs_filepath)
def dask_worker(filepath):
siloed_analysis, siloed_columns = dict(), set()
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
return siloed_analysis, siloed_columns
headers = df.columns.values.tolist()
siloed_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
siloed_columns.update(headers)
return siloed_analysis, siloed_columns
dask_futures = [dask.delayed(dask_worker)(filepath) for filepath in remote_files]
file_analyses, column_sets = map(list, zip(*list(dask.compute(*dask_futures))))
for analysis in file_analyses:
file_analysis.update(analysis)
file_columns.update(*column_sets)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['remote_analysis_test_dir_dd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def normalization_plain_with_pd(file_analysis, file_columns, file_column_mapping, meta_columns):
total_columns = len(file_columns)
for filepath, params in file_analysis.items():
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
missing_columns = [file_column_mapping[c] for c in file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
fpath, fname = os.path.split(filepath)
if not fpath.startswith("gs://"):
updated_path = os.path.join(fpath, "normalized_with_pd")
if not os.path.exists(updated_path):
os.mkdir(updated_path)
new_path = os.path.join(updated_path, fname)
else:
new_path = "/".join([fpath, "normalized_with_pd", fname])
df.to_csv(new_path, index=False)
def normalization_plain_with_dd(file_analysis, _file_columns, _file_column_mapping, _meta_columns):
def dask_worker(file_item, file_columns, file_column_mapping, meta_columns):
total_columns = len(file_columns)
filepath, params = file_item
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
missing_columns = [file_column_mapping[c] for c in file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
fpath, fname = os.path.split(filepath)
if not fpath.startswith("gs://"):
updated_path = os.path.join(fpath, "normalized_with_dd")
if not os.path.exists(updated_path):
os.mkdir(updated_path)
new_path = os.path.join(updated_path, fname)
else:
new_path = "/".join([fpath, "normalized_with_dd", fname])
df.to_csv(new_path, index=False)
dask_futures = [
dask.delayed(dask_worker)(file_item, _file_columns, _file_column_mapping, _meta_columns)
for file_item in file_analysis.items()
]
dask.compute(*dask_futures)
if __name__ == "__main__":
for size, params in local_dirs.items():
print(f"['{size}_local_analysis_dir_tests'] ({params['items']} files, {params['size']})")
local_analysis_test_dir_pd(params["directory"])
local_analysis_test_dir_dd(params["directory"])
for size, settings in local_dirs.items():
print(f"['{size}_pre_test_file_cleanup']")
for file in glob.glob(os.path.join(settings["directory"], '*', '*.csv')):
os.remove(file)
print(f"['{size}_local_normalization_dir_tests'] ({settings['items']} files, {settings['size']})")
files, columns, column_mapping = local_analysis_test_dir_pd(settings["directory"])
local_normalization_plain_with_pd(files, columns, column_mapping, {})
local_normalization_plain_with_dd(files, columns, column_mapping, {})
for size, settings in remote_dirs.items():
print(f"['{size}_remote_analysis_dir_tests'] ({settings['items']} files, {settings['size']})")
_, _, _ = remote_analysis_test_dir_pd(settings["directory"])
files, columns, column_mapping = remote_analysis_test_dir_dd(settings["directory"])
print(f"['{size}_remote_normalization_dir_tests'] ({settings['items']} files, {settings['size']})")
normalization_plain_with_pd(files, columns, column_mapping, {})
normalization_plain_with_dd(files, columns, column_mapping, {})
Conclusions thus far:
local_analysis is fastest with pandas.from_csv, based against:
a single file of 343 MB ( 0.0210 sec using pandas VS 0.5141 sec using dask)
a small dir of 8 files/ 1.12 GB ( 0.1263 sec using pandas VS 0.1357 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 3.2991 sec using pandas VS 3.7717 sec using dask)
an xlarge dir of 13,361 files/46.30 GB (131.5941 sec using pandas VS 132.6982 sec using dask)
local_normalization is fastest with pandas.from_csv, based against:
a small dir of 8 files/ 1.12 GB ( 61.2338 sec using pandas VS 62.2033 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 136.8900 sec using pandas VS 132.7574 sec using dask)
an xlarge dir of 13,361 files/46.30 GB (3166.0797 sec using pandas VS 3265.4251 sec using dask)
remote_analysis is fastest with dask.delayed, based against:
a small dir of 8 files/ 1.12 GB ( 8.6728 sec using pandas VS 6.0795 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 149.7931 sec using pandas VS 37.3509 sec using dask)
remote_normalization is fastest with dask.delayed, based against:
a small dir of 8 files/ 1.12 GB (1758.1562 sec using pandas VS 1431.9895 sec using dask)
medium and xlarge datasets not benchmarked yet
NOTE: dask tests utilize pandas.from_csv inside dask.delayed() calls to gain maximum time reduction
Like Code Different said, the upload_from_string bit takes a while. Have you considered writing them to Google BigQuery as opposed to saving them as .csv files in a bucket? I found that faster for my purpose.
The delayed API might be suitable here. The class you provided is rather elaborate, but this is the rough pattern that might work for this case:
import dask
#dask.delayed
def analyze_one_file(file_name):
# use the code you run on a single file here
return dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
# form delayed computations
delayed_values = [analyze_one_file(filepath) for filepath in files_to_analyze]
# execute the delayed computations
results = dask.compute(delayed_values)
# now results will be a list of dictionaries (or whatever
# the delayed function returns)
# apply similar wrapping to normalize_files loop
It might be that there is a more efficient ETL procedure for your case, but this is situation-specific, so assuming that iterating over the files to discover number of rows to skip is necessary, then wrapping things up with delayed is probably sufficient to reduce the df processing times by the core multiple.

keyError when moving file to another directory

Lets say you have the following df:
dfresult_secondlook = {'relfilepath': ['test.pdf', 'epic.pdf' ], 'col2': [3, 4]}
I want to move a file that is in this df to another folder with the following code:
#moving files from secondlook df to secondlook folder
sourceDir = 'C:\\Users\\Max12\\Desktop\\xml\\pdfminer\\UiPath\\attachments\\75090058\\Status\\PDFsend'
destDir = 'C:\\Users\\Max12\\Desktop\\xml\\pdfminer\\UiPath\\attachments\\75090058\\Status\\SecondLook'
files = os.listdir(sourceDir)
filesToMove = dfresult_secondlook
def move(file, sourceDir, destDir):
sourceFile = os.path.join(sourceDir, file)
if not os.path.exists(destDir):
os.makedirs(destDir)
try:
shutil.move(sourceFile, destDir)
except:
pass
for i in range(len(filesToMove)):
file = filesToMove['relfilepath'][i]
move(file,sourceDir,destDir)
#writing files to excel for further examination
book = load_workbook(r"C:\Users\Max12\Desktop\xml\pdfminer\UiPath\attachments\75090058\secondlook.xlsx")
writer = pd.ExcelWriter(r"C:\Users\Max12\Desktop\xml\pdfminer\UiPath\attachments\75090058\secondlook.xlsx", engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
dfresult_secondlook.to_excel(writer, "Main", header = False, index = False, startrow = writer.sheets['Main'].max_row)
writer.save()
However, I'm getting a KeyError:
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-13-4043ec85df9c> in <module>
17
18 for i in range(len(filesToMove)):
---> 19 file = filesToMove['relfilepath'][i]
20 move(file,sourceDir,destDir)
21
I don't see what's going wrong after 2 hours..
Please help!
According to the sample data you provided, relfilepath is a dict which does not always have 0 as a key. Thus, your for / loop, starting from 0, fails.
You could then try this:
for i in range(len(filesToMove)):
try:
file = filesToMove['relfilepath'][i]
move(file,sourceDir,destDir)
except KeyError:
continue
PS: you should modify the beginning of your post, where dfresult_secondlook shows 'relfilepath' as a list instead of a dict.

parallel processing write dictionary to multiple csv files

I have a large dataframe that I would like to write to different files depending on the value in a particular column.
The first function takes a dictionary where the key is the file to write out to and the value is a numpy array which is a subset of the original dataframe.
def write_in_parallel(inputDict):
for key,value in inputDict.items():
df = pd.DataFrame(value)
with open(baseDir + outDir + outputFileName + key + outputFileType, 'a') as oFile:
data.to_csv(oFile, sep = '|', index = False, header = False)
print("Finished writing month: " + outputFileName + key)
function 2 takes the column values for partitioning the dataframe and the dataframe itself, and returns the dataframe.
def make_slices(files, df):
outlist = dict()
for item in files:
data = np.array(df[df.iloc[:,1] == item])
outlist[item] = data
return outlist
the final function uses multiprocessing to call write_in_parallel and iterates over the dictionary from make_slices, hopefully in parallel.
def make_dynamic_columns():
perfPath = baseDir + rawDir
perfFiles = glob.glob(perfPath + "/*" + inputFileType)
perfFrame = pd.DataFrame()
for file_ in perfFiles:
df = pd.read_table(file_, delimiter = '|', header = None)
df.fillna(missingDataChar,inplace=True)
df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
with mp.Pool(processes=10) as pool:
pool.map(write_in_parallel, make_slices(fileList, df))
the error I am getting is 'str object has no attribute items' which leads me to believe that pool.map and write_in_parallel is not receiving the dictionary. I am not sure how to solve this issue. Any help is greatly appreciated.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "_FHLMC_LLP_dataprep.py", line 22, in write_in_parallel
for key,value in dict.items():
AttributeError: 'str' object has no attribute 'items'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "_FHLMC_LLP_dataprep.py", line 59, in <module>
make_dynamic_columns_freddie()
File "_FHLMC_LLP_dataprep.py", line 55, in make_dynamic_columns_freddie
pool.map(write_in_parallel, dictinput)
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
AttributeError: 'str' object has no attribute 'items'
Your problem is that make_slices returns a dictionary, not a list, and pool.map() does not like that. It just passes your dictionary keys to your workers, which means they are strings (try printing what you receive as inputDict). It is not dictionary but just keys.
def make_slices(files, df):
outlist = []
for item in files:
data = df + item
outlist.append({item: data})
return outlist
Could you try something like this, so that you actually return a list? Members would then be dictionary items. (I had to modify your code to just create something in data to test).
This way you can receive a key and a related data item in your worker if that is what you want to do.

Categories