How can I speed up these dataframe operations on 12k files/50gb? - python
Background:
I have 12,000 csv files (50gb) of data that mostly have the same format, but some may be missing a column or two and some header rows may not always start on the first row of the file.
I have a class with a couple of functions that utilize pandas to analyze and normalize these csv files either stored locally or from a google bucket.
The following actions occur in these functions:
In analyze_files
loop through all the files, "peeking" at their contents to determine the headers and if any rows need to be skipped in order to get to the headers row.
translate all collected headers into a standard format, removing all but alphanumeric and underscores from the filenames.
In normalize_files
loop through all files, loading each one completely this time.
convert the column headers to the standardizwd versions of the headers from analyze_files.
upload or save the updated version of the file
The functions work as-intended. But, I'm looking for methods I could use to speed things up.
Using the below version (simplified into a mvce) with 12,000 local files (8-core 16gb ram)
analyze_files takes around 2-4 minutes
normalize_files takes around 52 minutes
from google.cloud import storage
import pandas as pd
import glob
import os
import re
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./service_account_details.json"
class MyClass(object):
def __init__(self, uses_gs=False, gs_bucket_name=None, gs_folder_path=None):
self.__uses_gs = uses_gs
if uses_gs:
self.__gs_client = storage.Client()
self.__gs_bucket_name = gs_bucket_name
self.__gs_bucket = self.__gs_client.get_bucket(gs_bucket_name)
self.__gs_folder_path = gs_folder_path
else:
# save to a subfolder of current directory
self.__save_location = os.path.join(os.path.dirname(os.path.abspath(__file__)), self.__name__)
if not os.path.exists(self.__save_location):
os.mkdir(self.__save_location)
self.__file_analysis = dict()
self.__file_columns = set()
self.__file_column_mapping = dict()
def analyze_files(self):
# collect the list of files
files_to_analyze = list()
if self.__uses_gs:
gs_files = self.__gs_client.list_blobs(self.__gs_bucket, prefix=self.__gs_folder_path, delimiter="/")
for file in gs_files:
if file.name == self.__gs_folder_path:
continue
gs_filepath = f"gs://{self._gs_bucket_name}/{file.name}"
files_to_analyze.append(gs_filepath)
else:
local_files = glob.glob(os.path.join(self.__save_location, "*.csv"))
files_to_analyze.extend(local_files)
# analyze each collected file
for filepath in files_to_analyze:
# determine how many rows to skip in order to start at the header row,
# then collect the headers for this particular file, to be utilized for comparisons in `normalize_files`
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist()
self.__file_columns.update(headers)
# store file details as pandas parameters, so we can smoothly transition into reading the files efficiently
skiprows = skiprows + 1 if skiprows else 1 # now that we know the headers, we can skip the header row
self.__file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
# convert the columns to their bigquery-compliant equivalents
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
self.__file_column_mapping.update({
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in self.__file_columns
})
def normalize_files(self):
# perform the normalizations and upload/save the final results
total_columns = len(self.__file_columns)
for filepath, params in self.__file_analysis.items():
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=self.__file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
# swap the missing column names out for the bigquery equivalents
missing_columns = [self.__file_column_mapping[c] for c in self.__file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
if self.__uses_gs:
blob_path = filepath[5 + len(self.__gs_bucket_name) + 1:] # "gs://" + "{bucket_name}" + "/"
self.__gs_bucket.blob(blob_path).upload_from_string(df.to_csv(index=False), "text/csv")
else: # save locally
df.to_csv(filepath, index=False)
I thought about using dask, combined with ProcessPool and ThreadPool from the multiprocessing module. But, I am struggling with exactly what approach to take.
Since the dataframe operations are CPU-Bound they seem best-suited for dask, possibly combined with a ProcessPool to divvy up the 12k files across the 8 available cores, then dask would utilize the threads of each core (overcoming GIL limitations).
The uploading of the files back to disk or a google bucket seem more suited for a ThreadPool, since that activity is Network-bound.
As for reading in files from a Google bucket, I'm not sure what approach would work best.
Basically, it comes down to two sceneries:
What methods/logic would perform best when working with local files?
And what methods/logic would perform best when pulling from and saving back to (overwriting/updating) a Google bucket?
Can someone please provide some direction or code that will provide the most efficient speed boost for the above two functions?
Benchmark tests would be greatly appreciated as I've been pondering this topic for the better part of a week and it would be great to have statistics to back-up the decision of methodology.
Current Benchmarks from what I've tried
def local_analysis_test_dir_pd(test_dir):
file_analysis, file_columns = dict(), set()
local_files = glob.glob(os.path.join(test_dir, "*.csv"))
for filepath in local_files:
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist() # noqa
skiprows = skiprows + 1 if skiprows else 1
file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
file_columns.update(headers)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub(" ", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['local_analysis_test_dir_pd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def local_analysis_test_dir_dd(test_dir):
file_analysis, file_columns = dict(), set()
local_files = glob.glob(os.path.join(test_dir, "*.csv"))
def dask_worker(filepath):
siloed_analysis, siloed_columns = dict(), set()
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
return siloed_analysis, siloed_columns
headers = df.columns.values.tolist()
siloed_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
siloed_columns.update(headers)
return siloed_analysis, siloed_columns
dask_futures = [dask.delayed(dask_worker)(filepath) for filepath in local_files]
file_analyses, column_sets = map(list, zip(*list(dask.compute(*dask_futures))))
for analysis in file_analyses:
file_analysis.update(analysis)
file_columns.update(*column_sets)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub(" ", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['local_analysis_test_dir_dd'] result:", len(file_analysis), len(file_columns))
def remote_analysis_test_dir_pd(test_dir):
remote_files, file_analysis, file_columns = list(), dict(), set()
prefix = test_dir.replace("gs://webscraping/", "") + "/"
gs_files = gs_client.list_blobs("webscraping", prefix=prefix, delimiter="/")
for file in gs_files:
if file.name == prefix:
continue
elif file.name.endswith(".xlsx"):
continue
elif not file.name.endswith(".csv"):
continue
gs_filepath = f"gs://webscraping/{file.name}"
remote_files.append(gs_filepath)
for filepath in remote_files:
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
continue
headers = df.columns.values.tolist() # noqa
skiprows = skiprows + 1 if skiprows else 1
file_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
file_columns.update(headers)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['remote_analysis_test_dir_pd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def remote_analysis_test_dir_dd(test_dir):
remote_files, file_analysis, file_columns = list(), dict(), set()
prefix = test_dir.replace("gs://webscraping/", "") + "/"
gs_files = gs_client.list_blobs("webscraping", prefix=prefix, delimiter="/")
for file in gs_files:
if file.name == prefix:
continue
elif file.name.endswith(".xlsx"):
continue
elif not file.name.endswith(".csv"):
continue
gs_filepath = f"gs://webscraping/{file.name}"
remote_files.append(gs_filepath)
def dask_worker(filepath):
siloed_analysis, siloed_columns = dict(), set()
skiprows = None
while True:
try:
df = pd.read_csv(filepath, nrows=nrows, skiprows=skiprows)
break
except pd.errors.ParserError as e:
try:
start_row_index = re.findall(r"Expected \d+ fields in line (\d+), saw \d+", str(e))[0]
skiprows = int(start_row_index) - 1
except IndexError:
print("Could not locate start_row_index in pandas ParserError message")
return siloed_analysis, siloed_columns
headers = df.columns.values.tolist()
siloed_analysis[filepath] = dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
siloed_columns.update(headers)
return siloed_analysis, siloed_columns
dask_futures = [dask.delayed(dask_worker)(filepath) for filepath in remote_files]
file_analyses, column_sets = map(list, zip(*list(dask.compute(*dask_futures))))
for analysis in file_analyses:
file_analysis.update(analysis)
file_columns.update(*column_sets)
non_alpha = re.compile(r"([\s\W]|^\d+)")
multi_under = re.compile(r"(_{2,})")
file_column_mapping = {
file_column: multi_under.sub("_", non_alpha.sub("_", file_column)).upper()
for file_column in file_columns
}
# print dictionary length for sanity check; to ensure both functions are performing identical actions.
print("['remote_analysis_test_dir_dd'] result:", len(file_analysis), len(file_columns))
return file_analysis, file_columns, file_column_mapping
def normalization_plain_with_pd(file_analysis, file_columns, file_column_mapping, meta_columns):
total_columns = len(file_columns)
for filepath, params in file_analysis.items():
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
missing_columns = [file_column_mapping[c] for c in file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
fpath, fname = os.path.split(filepath)
if not fpath.startswith("gs://"):
updated_path = os.path.join(fpath, "normalized_with_pd")
if not os.path.exists(updated_path):
os.mkdir(updated_path)
new_path = os.path.join(updated_path, fname)
else:
new_path = "/".join([fpath, "normalized_with_pd", fname])
df.to_csv(new_path, index=False)
def normalization_plain_with_dd(file_analysis, _file_columns, _file_column_mapping, _meta_columns):
def dask_worker(file_item, file_columns, file_column_mapping, meta_columns):
total_columns = len(file_columns)
filepath, params = file_item
df = pd.read_csv(filepath, **params)
# rename the column header to align with bigquery columns
df.rename(columns=file_column_mapping, inplace=True)
if len(params["names"]) != total_columns:
missing_columns = [file_column_mapping[c] for c in file_columns - set(params["names"])]
# add the missing columns to the dataframe
df[[*missing_columns]] = pd.DataFrame([[np.nan] * len(missing_columns)], index=df.index)
fpath, fname = os.path.split(filepath)
if not fpath.startswith("gs://"):
updated_path = os.path.join(fpath, "normalized_with_dd")
if not os.path.exists(updated_path):
os.mkdir(updated_path)
new_path = os.path.join(updated_path, fname)
else:
new_path = "/".join([fpath, "normalized_with_dd", fname])
df.to_csv(new_path, index=False)
dask_futures = [
dask.delayed(dask_worker)(file_item, _file_columns, _file_column_mapping, _meta_columns)
for file_item in file_analysis.items()
]
dask.compute(*dask_futures)
if __name__ == "__main__":
for size, params in local_dirs.items():
print(f"['{size}_local_analysis_dir_tests'] ({params['items']} files, {params['size']})")
local_analysis_test_dir_pd(params["directory"])
local_analysis_test_dir_dd(params["directory"])
for size, settings in local_dirs.items():
print(f"['{size}_pre_test_file_cleanup']")
for file in glob.glob(os.path.join(settings["directory"], '*', '*.csv')):
os.remove(file)
print(f"['{size}_local_normalization_dir_tests'] ({settings['items']} files, {settings['size']})")
files, columns, column_mapping = local_analysis_test_dir_pd(settings["directory"])
local_normalization_plain_with_pd(files, columns, column_mapping, {})
local_normalization_plain_with_dd(files, columns, column_mapping, {})
for size, settings in remote_dirs.items():
print(f"['{size}_remote_analysis_dir_tests'] ({settings['items']} files, {settings['size']})")
_, _, _ = remote_analysis_test_dir_pd(settings["directory"])
files, columns, column_mapping = remote_analysis_test_dir_dd(settings["directory"])
print(f"['{size}_remote_normalization_dir_tests'] ({settings['items']} files, {settings['size']})")
normalization_plain_with_pd(files, columns, column_mapping, {})
normalization_plain_with_dd(files, columns, column_mapping, {})
Conclusions thus far:
local_analysis is fastest with pandas.from_csv, based against:
a single file of 343 MB ( 0.0210 sec using pandas VS 0.5141 sec using dask)
a small dir of 8 files/ 1.12 GB ( 0.1263 sec using pandas VS 0.1357 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 3.2991 sec using pandas VS 3.7717 sec using dask)
an xlarge dir of 13,361 files/46.30 GB (131.5941 sec using pandas VS 132.6982 sec using dask)
local_normalization is fastest with pandas.from_csv, based against:
a small dir of 8 files/ 1.12 GB ( 61.2338 sec using pandas VS 62.2033 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 136.8900 sec using pandas VS 132.7574 sec using dask)
an xlarge dir of 13,361 files/46.30 GB (3166.0797 sec using pandas VS 3265.4251 sec using dask)
remote_analysis is fastest with dask.delayed, based against:
a small dir of 8 files/ 1.12 GB ( 8.6728 sec using pandas VS 6.0795 sec using dask)
a medium dir of 474 files/ 2.03 GB ( 149.7931 sec using pandas VS 37.3509 sec using dask)
remote_normalization is fastest with dask.delayed, based against:
a small dir of 8 files/ 1.12 GB (1758.1562 sec using pandas VS 1431.9895 sec using dask)
medium and xlarge datasets not benchmarked yet
NOTE: dask tests utilize pandas.from_csv inside dask.delayed() calls to gain maximum time reduction
Like Code Different said, the upload_from_string bit takes a while. Have you considered writing them to Google BigQuery as opposed to saving them as .csv files in a bucket? I found that faster for my purpose.
The delayed API might be suitable here. The class you provided is rather elaborate, but this is the rough pattern that might work for this case:
import dask
#dask.delayed
def analyze_one_file(file_name):
# use the code you run on a single file here
return dict(skiprows=skiprows, names=headers, dtype=dict.fromkeys(headers, str))
# form delayed computations
delayed_values = [analyze_one_file(filepath) for filepath in files_to_analyze]
# execute the delayed computations
results = dask.compute(delayed_values)
# now results will be a list of dictionaries (or whatever
# the delayed function returns)
# apply similar wrapping to normalize_files loop
It might be that there is a more efficient ETL procedure for your case, but this is situation-specific, so assuming that iterating over the files to discover number of rows to skip is necessary, then wrapping things up with delayed is probably sufficient to reduce the df processing times by the core multiple.
Related
Using pandas and multiprocessing to read and write CSV file in Python
I have written a code that could find target row in multiple csv files and generate new csv files and write it to the specific folder, due to the large number files in this folder, I would like to use multiprocessing to operate the CSV files. Here is the code which is without multiprocessing. import os from pathlib import Path from collections import defaultdict import numpy as np import pandas as pd import multiprocessing def Filter_Minute_Data(foldpath, time_FileMap, target_List, target_Type, output_folder): columnNum = get_columnNum(target_Type) for key in time_FileMap.keys(): for file in time_FileMap[key]: # try to find the filename output_fileName = genetate_output_fileName(file) output_folder_new = output_folder + "/by_minute" + output_fileName[0] if not (os.path.exists(output_folder_new)): os.makedirs(output_folder_new) output_path = os.path.join(output_folder_new, output_fileName[1]) out = os.path.join(multiprocessing.current_process().name,output_path) filePath = Path(os.path.join(foldpath, file)) # read the csv file df = pd.read_csv(filePath, compression='gzip', header = None, error_bad_lines=False) df_filtered = df[df[columnNum].isin(target_List)] df_filtered.to_csv(output_path, index = False, header = False) I also tried to use multiprocessing to do it, but it missed some data in the result. Can anyone help me with the multiprocessing part? How to implement the Pool and make it correct? def Filter_Minute_Data(foldpath, time_FileMap, target_List, target_Type, output_folder): columnNum = get_columnNum(target_Type) pool = multiprocessing.Pool(processes=4) for key in time_FileMap.keys(): for file in time_FileMap[key]: output_fileName = genetate_output_fileName(file) output_folder_new = output_folder + "/by_minute" + output_fileName[0] if not (os.path.exists(output_folder_new)): os.makedirs(output_folder_new) output_path = os.path.join(output_folder_new, output_fileName[1]) out = os.path.join(multiprocessing.current_process().name,output_path) filePath = Path(os.path.join(foldpath, file)) pool.apply_async(multithreading_read_csv,(filePath, columnNum, target_List, out)) pool.close() pool.join() def multithreading_read_csv(filePath, columnNum, target_List, output_path): df = pd.read_csv(filePath, compression='gzip', header = None, error_bad_lines=False) df_filtered = df[df[columnNum].isin(target_List)] df_filtered.to_csv(output_path, mode = 'a', index = False, header = False)
Parsing csv file and splitting into sub files
I am trying to create a generic filter to split file on the condition from the Yaml file. My code is running Pandas but as the environment is not having Pandas module I am trying to achieve it through CSV library. When I am hard coding the value at q its working but when I am trying to pass it from the config file its not working. Also I want pass multiple checks on the same column like('','Balance). So Asset goest to one file and ('','Balance) in another. import sys import yaml import csv def dynamicQuery(config_file, data_file, outputPath): """Loading Configuration file into dataframe""" try: with open(config_file) as file: doc = yaml.full_load(file) except Exception as err: print("Error Configuration data file: ", err) try: for k, v in doc.items(): if k != 'column': filename = k k = doc[k] q = ' , '.join(f'{v} ' for q, v in k.items()) q = '"' + str(strip(q)) + '"' print(q) #-- "Asset" df = csv.reader(open(data_file), delimiter=',') df = filter(lambda x: (x[2] == q), df) # Not working here #df = filter(lambda x: x[2] == "Asset", df) --> this is working csv.writer(open(filename + ".txt", 'w', newline=' '), delimiter=',').writerows(df) print("File is created for " + filename) except Exception as err: print("Error executing queries and saving output data file: ", err) def main(): if len(sys.argv) == 3: """File will be passed as parameter """ config_file = sys.argv[1] data_file = sys.argv[2] dynamicQuery(config_file, data_file) else: usage() def usage(): print("Usage: python splitGenric.py config_file data_file ") main() Sample file 1233,ACV,Asset,sample 1235,ACV,Asset,sample 1232,ACV,Asset,sample 1234,ACV,Asset,sample 1237,ACV,,sample 1238,ACV,,sample 1234,ACV,Balance,sample 1254,ACV,Balance,sample 1244,ACV,Balance,sample 1264,ACV,Balance,sample Config.yaml Asset : filter1: '"Asset"' Balance: filter1: '"Balance"' filter2: '""'
The YAML configuration file format is not particularly convenient for this, and yaml is not a standard Python module. I would probably go for something like regular expressions instead of a YAML file. But just to sort out the immediate problems, the problem here is that you are mixing up Python syntax and literal quoting characters. You are assembling a string containing literal double quotes around Asset for example, where your CSV file does not contain double quotes around this value; and so you are effectively comparing if 'Asset' == '"Asset"' which of course is False. The following might not do exactly what you want, but should at least demonstrate a rough first cut of what I think you are trying to do here. with open(config_file) as file: config = yaml.full_load(file) filters = dict() for k, v in config.items(): handle = open(k + '.txt', 'w', newline='') writer = csv.writer(handle, delimiter=',') filt = {'handle': handle, 'writer': writer, 'conditions': []} for _, expr in v.items(): filt['conditions'].append(expr.strip('"')) filters[k] = filt with open(data_file) as csvfile: reader = csv.reader(csvfile) for row in reader: for handle, conf in filters.items(): for i in range(len(conf['conditions'])): if row[2] == conf['conditions'][i]: conf['writer'].writerow(row) break for handle, conf in filters.items(): conf['handle'].close() I'm guessing you used pyyaml which seems to be the dominant YAML module for Python.
I tried to use the config.yaml, but I've got this error File "C:\Users\XXXXXX\AppData\Local\Programs\Python\Python36-32\lib\site-packages\yaml\parser.py", line 439, in parse_block_mapping_key "expected <block end>, but found %r" % token.id, token.start_mark) yaml.parser.ParserError: while parsing a block mapping in "config.yml", line 5, column 5 expected <block end>, but found ',' in "config.yml", line 5, column 17 But I will pretend it worked and the content was loaded in a dictionary, as it appears to be the intention. The dictionary is as: doc = {'Asset':'Asset','Balance':[' ','Balance']} #load directly to dataframe df = pd.read_csv('sample.txt',header=None) handler = '' for k,v in doc.items(): kList = {k:[]} #making empty lists with k values if isinstance(v,str): #Asset is string fil = v else: for i in range(len(v)): #Balance is list of values if v[i]: fil = v[i] else: handler = k #replace the null for types in df.values: if fil in types: kList[k].append(types) #append types to corresponding list csv.writer(open(k+".txt", 'a', newline='\n'), delimiter=',').writerows(kList[k]) if handler: #there is null values nulls = df[df.isnull().any(axis=1)].values.tolist() csv.writer(open(handler+".txt", 'a', newline='\n'), delimiter=',').writerows(nulls) The result are two files, with the following contents: Asset.txt: 1233,ACV,Asset,sample 1235,ACV,Asset,sample 1232,ACV,Asset,sample 1234,ACV,Asset,sample Balance.txt: 1234,ACV,Balance,sample 1254,ACV,Balance,sample 1244,ACV,Balance,sample 1264,ACV,Balance,sample 1237,ACV,nan,sample 1238,ACV,nan,sample
iterate over multiple files in my directory
Currently I am grabbing a excel file from a folder with Python just fine; in the below code.. and pushing this to a web form via selenium. However, I am trying to modify this to continue to go through a directory over multiple files. (there will be many excel files in my 'directory' or 'folder'). main.py from data.find_pending_records import FindPendingRecords from vital.vital_entry import VitalEntry if __name__ == "__main__": try: #Instantiates FindPendingRecords then gets records to process PENDING_RECORDS = FindPendingRecords().get_excel_data() #Reads excel to map data from excel to vital MAP_DATA = FindPendingRecords().get_mapping_data() #Configures Driver for vital VITAL_ENTRY = VitalEntry() #Start chrome and navigate to vital website VITAL_ENTRY.instantiate_chrome() #Begin processing Records VITAL_ENTRY.process_records(PENDING_RECORDS, MAP_DATA) print("All done, Bill") except Exception as exc: print(exc) config.py FILE_LOCATION = r"C:\Zip\2019.02.12 Data Docs.zip" UNZIP_LOCATION = r"C:\Zip\Pending" VITAL_URL = 'http://boringdatabasewebsite:8080/Horrible' HEADLESS = False PROCESSORS = 4 MAPPING_DOC = ".//map/mapping.xlsx" find_pending_records.py """Module used to find records that need to be inserted into Horrible website""" from zipfile import ZipFile import math import pandas import config class FindPendingRecords: """Class used to find records that need to be inserted into Site""" #classmethod def find_file(cls): """"Finds the excel file to process""" archive = ZipFile(config.FILE_LOCATION) for file in archive.filelist: if file.filename.__contains__('Horrible Data Log '): return archive.extract(file.filename, config.UNZIP_LOCATION) return FileNotFoundError def get_excel_data(self): """Places excel data into pandas dataframe""" excel_data = pandas.read_excel(self.find_file()) columns = pandas.DataFrame(columns=excel_data.columns.tolist()) excel_data = pandas.concat([excel_data, columns]) excel_data.columns = excel_data.columns.str.strip() excel_data.columns = excel_data.columns.str.replace("/", "_") excel_data.columns = excel_data.columns.str.replace(" ", "_") num_valid_records = 0 for row in excel_data.itertuples(): person = row.PERSON if person in ("", " ", None) or math.isnan(mrn): print(f"Invalid record: {row}") excel_data = excel_data.drop(excel_data.index[row.Index]) else: num_valid_records += 1 print(f"Processing #{num_valid_records} records") return self.clean_data_frame(excel_data) def clean_data_frame(self, data_frame): """Cleans up dataframes""" for col in data_frame.columns: if "date" in col.lower(): data_frame[col] = pandas.to_datetime(data_frame[col], errors='coerce', infer_datetime_format=True) data_frame[col] = data_frame[col].dt.date data_frame['PERSON'] = data_frame['PERSON'].astype(int).astype(str) return data_frame def get_mapping_data(self): map_data = pandas.read_excel(config.MAPPING_DOC, sheet_name='main') columns = pandas.DataFrame(columns=map_data.columns.tolist()) return pandas.concat([map_data, columns])
One way is as below (pseudocode) class FindPendingRecords: #classmethod def find_file(cls): return ["file1", "file2", "file3"] def __init__(self): self.files = self.find_file() def get_excel_data(self): for excel_data in self.files: # process your excel_data yield excel_data Your main should be if __name__ == "__main__": try: for PENDING_RECORDS in FindPendingRecords().get_excel_data(): # Do operations on PENDING_RECORDS print (PENDING_RECORDS) print("All done, Bill") except Exception as exc: print(exc) Your find_file method will be #classmethod def find_file(cls): all_files = list() """"Finds the excel file to process""" archive = ZipFile(config.FILE_LOCATION) for file in archive.filelist: if file.filename.__contains__('Horrible Data Log '): all_files.append(archive.extract(file.filename, config.UNZIP_LOCATION)) return all_files
flle processing using multiprocessing - python
I am beginner to Python and trying to add few lines of code to convert json to csv and back to json. Have thousands of files (size 300 MB) to be converted and processed. With current program (using 1 CPU), i am not able to use 16 CPUs of server and need suggestions to fine tune the program for multiprocessing. Below is my code with python 3.7 version. import json import csv import os os.chdir('/stagingData/Scripts/test') for JsonFile in os.listdir(os.getcwd()): PartialFileName = JsonFile.split('.')[0] j = 1 with open(PartialFileName +".csv", 'w', newline='') as Output_File: with open(JsonFile) as fileHandle: i = 1 for Line in fileHandle: try: data = json.loads(Line, parse_float=str) except: print("Can't load line {}".format(i)) if i == 1: header = data.keys() output = csv.writer(Output_File) output.writerow(header) #Writes header row i += 1 output.writerow(data.values()) #writes values row j += 1 Appreciate suggestions on multiprocessing logic
If you have a single big file that you want to process more effectively I suggest the following: Split file into chunks Create a process to process each chunk (if necessary) merge the processed chunks back into a single file Something like this: import csv import json from pathlib import Path from concurrent.futures import ProcessPoolExecutor source_big_file = Path('/path/to/file') def chunk_file_by_line(source_filepath: Path, chunk_size: int = 10_000): chunk_line_size = 10_000 intermediate_file_handlers = {} last_chunk_filepath = None with source_big_file.open('r', encoding='utf8') as big: for line_number, line in big: group = line_number - (line_number % chunk_line_size) chunk_filename = f'{source_big_file.stem}.g{group}{source_big_file.suffix}' chunk_filepath = source_big_file.parent / chunk_filename if chunk_filepath not in intermediate_file_handlers: file_handler = chuck_filepath.open('w', encoding='utf8') intermediate_file_handlers[chunk_filepath] = file_handler if last_chunk_filepath: last_file_hanlder = intermediate_file_handlers[last_chunk_filepath] last_file_handler.close() yield last_chunk_filepath else: file_handler = intermediate_file_handlers[chunk_filepath] file_handler.write(line) last_chunk_filepath = chunk_filepath # output last one yield last_chunk_filepath def json_to_csv(json_filepath: Path) -> Path: csv_filename = f'{json_filepath.stem}.csv' csv_filepath = json_filepath.parent / csv_filename with csv_filepath.open('w', encoding='utf8') as csv_out, json_filepath.open('r', encoding='utf8') as json_in: dwriter = csv.DictWriter(csv_out) headers_written = False for json_line in json_in: data = json.loads(json_line) if not headers_written: # create header record headers = {k:k for k in data.keys()} dwriter.writeline(headers) headers_written = True dwriter.writeline(data) return csv_filepath with ProcessPoolExecutor() as pool: futures = [] for chunk_filepath in chuck_file_by_line(source_big_file): future = pool.submit(json_to_csv, chunk_filepath) futures.append(future) # wait for all to finish for future in futures: csv_filepath = future.result(timeout=None) # waits until complete print(f'conversion complete> csv filepath: {csv_filepath}')
Since you have many files, the simplest multiprocessing example from the documentation should work for you. https://docs.python.org/3.4/library/multiprocessing.html?highlight=process f(JsonFile): # open input, output files and convert with Pool(16) as p: p.map(f, os.listdir(os.getcwd())) You could also try replacing listdir with os.scandir(), which doesn't have to return all directory entries before starting.
i want to write looping dataframe to excel
1.I am new to python.this task for mainly read the excel files in directory and filter the data in excel. After filtering write into excel.When iam trying to write to excel its storing only last iteration values.Please give advise to write all data to excel . I want to write df_filter and df_filter1 to excel which is for loop .Please help me i need to write these dataframe to excell import os import xlrd import pandas as pd import xlwt from openpyxl import load_workbook import xlsxwriter from pyexcelerate import Workbook import numpy as np from pandas import ExcelWriter from tempfile import TemporaryFile ALL_SHEETS = [] sheet_list = "" file_path = os.path.join(input("enter Dir path")) config_path = os.path.join(input("enter your config file path here")) output_path = os.path.join(input("Dude where you want store outputfile")) output1 = pd.ExcelWriter(output_path, engine='xlsxwriter') ALL_SHEETS = [os.path.join(file_path, f) for f in os.listdir(file_path) if os.path.isfile(os.path.join(file_path, f)) and f.endswith('.xlsx')] i = 0 data1 = [] data = [] Packet_size = [] Trail_numbers = [] Though_put = [] Latency = [] Jitter = [] df_filter = pd.DataFrame(columns=['packetsize', 'throughput', 'latency (us)', 'jitter (us)']) df_filter1 = pd.DataFrame(columns=['packetsize', 'throughput', 'latency (us)', 'jitter (us)']) #df_sheet = pd.DataFrame(columns=['zsheet']) merged_inner=pd.DataFrame([]) def sheets(val): s = wb.worksheets[val] df_sheet = pd.DataFrame( data=['%s' % str(s) + '\n']) #Name_sheet(s) HeaderList = pd.read_csv(config_path) column_list = [] for col in HeaderList: col = col.lstrip("'") col = col.rstrip("'") column_list.append(col) df1 = xl.parse(sheet_list[val], skiprows=i) df1 = df1.filter(column_list) df2 = df1[(df1['Result'] != 'Failed') & (df1['Frame Size Type'] == 'iMIX')] if df2.empty: pass else: final3= df2.groupby(['Trial Number', 'iMIX Distribution'], sort=False).apply(lambda x: x.loc[x['Throughput (%)'].idxmax()]) #df_filter['sheetaname']=df_sheet(lambda a:'%s' % a['sheetvise'],axis=1) final = final3.groupby(['iMIX Distribution'], sort=False).apply(lambda x: x.loc[x['Throughput (%)'].idxmax()]) df_filter['packetsize'] = final.apply(lambda z: '%s' % (z['iMIX Distribution']), axis=1) df_filter['throughput'] = final.apply(lambda z: '%s' % (z['Throughput (%)']), axis=1) df_filter['latency (us)'] = final.apply(lambda x: '%s/%s/%s' % (x['Minimum Latency (us)'], x['Maximum Latency (us)'], x['Average Latency (us)']),axis=1) df_filter['jitter (us)'] = final.apply(lambda y: '%s/%s/%s' % (y['Minimum Jitter (us)'], y['Maximum Jitter (us)'], y['Average Jitter (us)']),axis=1) df_filter.to_excel(output1,sheet_name='mani') output1.save() df_filter.to_excel(output1, startrow=len(df_filter1)+len(df_filter)+2,sheet_name='mani') output1.save() df3 = df1[(df1['Result'] != 'Failed') & (df1['Frame Size Type'] == 'Fixed')] if df3.empty: pass else: final2 = df3.groupby(['Trial Number', 'Configured Frame Size'], sort=False).apply(lambda x: x.loc[x['Throughput (%)'].idxmax()]) final1=final2.groupby(['Configured Frame Size'],sort=False).apply(lambda x: x.loc[x['Throughput (%)'].idxmax()]) df_filter1['packetsize'] = final1.apply(lambda z: '%s' % (z['Configured Frame Size']), axis=1) df_filter1['throughput'] = final1.apply(lambda z: '%s' % (z['Throughput (%)']), axis=1) df_filter1['latency (us)'] = final1.apply(lambda x: '%s/%s/%s' % (x['Minimum Latency (us)'], x['Maximum Latency (us)'], x['Average Latency (us)']),axis=1) df_filter1['jitter (us)'] = final1.apply(lambda y: '%s/%s/%s' % (y['Minimum Jitter (us)'], y['Maximum Jitter (us)'], y['Average Jitter (us)']),axis=1) df_filter1.to_excel(output1, sheet_name='mani') df_filter1.to_excel(output1, startrow=len(df_filter1)+len(df_filter) + 2, sheet_name='mani') output1.save() def sheet_every(): for sheet in range(0, sheet_list_lenght): sheets(sheet) for file in (ALL_SHEETS): df_file = pd.DataFrame(data=[file]) workbook = xlrd.open_workbook(file) wb = load_workbook(file) xl = pd.ExcelFile(file) i = 0 sheet_list = workbook.sheet_names() sheet_list_lenght = (len(sheet_list)) for sheet in sheet_list: worksheet = workbook.sheet_by_name(sheet) for i in range(0, worksheet.nrows): row = worksheet.row_values(i) if 'Trial Number' in row:`` break sheet_every()
Not sure if this answers your question or not, but if you want to read from a dataframe and add rows to a new dataframe thorugh a loop you can refer the code below: dummyData = pd.read_csv("someexcelfile.csv") #You can merge mutiple dataframes into dummyData and make it a big dataframe dummyInsertTable = pd.DataFrame(columns=["Col1","Col2","Col3"]) for i in range(len(dummyData)): dummyInsertTable.loc[i,"Col1"] = dummyData["Col1"][i] dummyInsertTable.loc[i, "Col2"] = dummyData["Col2"][i] dummyInsertTable.loc[i, "Col3"] = dummyData["Col3"][i] dummyInsertTable.to_csv("writeCSVFile.csv") And next time be precise where you are facing the problem. EDIT Try loading the first dataframe and then loop through the other files and append the files in the first dataframe. Refer the code: import pandas as pd #Make a list of all the file you have filesList = ["/home/bhushan/firstFile.csv","/home/bhushan/secondFile.csv","/home/bhushan/thirdFile.csv","/home/bhushan/fourthFile.csv"] #Read the first csv file using pandas.read_csv firstFile = pd.read_csv(filesList[0]) #Loop through the rest of the files and append the files in the first DataFrame for i in range(1,len(filesList)): fileToBeAdded = pd.read_csv(filesList[i]) firstFile = firstFile.append(fileToBeAdded) #Write the final file finalFile = firstFile finalFile.to_csv("finalFile.csv")
If I get your question correctly, you have two data frames which you want to write to one excel file but you are only getting the last one. You should write them to two different sheets instead, then you can retrieve them as per requirement, either individually or combined. Follow the below links for more details and implementation : https://xlsxwriter.readthedocs.io/example_pandas_multiple.html https://campus.datacamp.com/courses/importing-managing-financial-data-in-python/importing-stock-listing-data-from-excel?ex=11 Also, you can instead write to a csv file, that is also excel compatible and easier to handle. Also I have observed that it is faster and more space efficient compared to writing to .xlsx file. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html