parallel processing write dictionary to multiple csv files

parallel processing write dictionary to multiple csv files - python

I have a large dataframe that I would like to write to different files depending on the value in a particular column.
The first function takes a dictionary where the key is the file to write out to and the value is a numpy array which is a subset of the original dataframe.
def write_in_parallel(inputDict):
for key,value in inputDict.items():
df = pd.DataFrame(value)
with open(baseDir + outDir + outputFileName + key + outputFileType, 'a') as oFile:
data.to_csv(oFile, sep = '|', index = False, header = False)
print("Finished writing month: " + outputFileName + key)
function 2 takes the column values for partitioning the dataframe and the dataframe itself, and returns the dataframe.
def make_slices(files, df):
outlist = dict()
for item in files:
data = np.array(df[df.iloc[:,1] == item])
outlist[item] = data
return outlist
the final function uses multiprocessing to call write_in_parallel and iterates over the dictionary from make_slices, hopefully in parallel.
def make_dynamic_columns():
perfPath = baseDir + rawDir
perfFiles = glob.glob(perfPath + "/*" + inputFileType)
perfFrame = pd.DataFrame()
for file_ in perfFiles:
df = pd.read_table(file_, delimiter = '|', header = None)
df.fillna(missingDataChar,inplace=True)
df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
with mp.Pool(processes=10) as pool:
pool.map(write_in_parallel, make_slices(fileList, df))
the error I am getting is 'str object has no attribute items' which leads me to believe that pool.map and write_in_parallel is not receiving the dictionary. I am not sure how to solve this issue. Any help is greatly appreciated.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "_FHLMC_LLP_dataprep.py", line 22, in write_in_parallel
for key,value in dict.items():
AttributeError: 'str' object has no attribute 'items'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "_FHLMC_LLP_dataprep.py", line 59, in <module>
make_dynamic_columns_freddie()
File "_FHLMC_LLP_dataprep.py", line 55, in make_dynamic_columns_freddie
pool.map(write_in_parallel, dictinput)
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/ssun/library/python/Python-3.5.2/build/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
AttributeError: 'str' object has no attribute 'items'

Your problem is that make_slices returns a dictionary, not a list, and pool.map() does not like that. It just passes your dictionary keys to your workers, which means they are strings (try printing what you receive as inputDict). It is not dictionary but just keys.
def make_slices(files, df):
outlist = []
for item in files:
data = df + item
outlist.append({item: data})
return outlist
Could you try something like this, so that you actually return a list? Members would then be dictionary items. (I had to modify your code to just create something in data to test).
This way you can receive a key and a related data item in your worker if that is what you want to do.

Related

Try/Except in for loop failing

I am trying to catch some exceptions when processing files from an AWS s3 bucket. I know that the processing works just fine normally, as the error i get is expected and generated by myself by altering the column names of 1 file. The bucket contains several files that should process like normal, while the 1 file i altered should throw an exception. My desire is to append the filename to a list if it is not processed, print the exception with logging module, and continue processing the rest of the files. This is my code:
for item in settings.keys:
try:
response = settings.client.get_object(Bucket=settings.source_bucket, Key=item)
tmp = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='unicode_escape', sep=None, engine='python')
tmp['account_number'] = item.split('/')[4][:-4]
tmp.columns = tmp.columns.str.strip()
tmp.columns = tmp.columns.map(settings._config['balances']['columns'])
df = pd.concat([df, tmp], ignore_index=False)
except:
settings.unprocessed.append(item)
logger.exception(f'{item} Not Processed')
Before i altered the 1 file, everything processed like it should. By using try/except, i want to catch the exception if a file contains errors, and still process the rest of the files. However, after i altered the 1 file, every single file in the bucket threw an exception, and nothing was processed. Does anyone have any input as to why this happens?
2023-01-25 14:59:56 - ERROR - xxxx.csv Not Processed
Traceback (most recent call last):
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx.py", line 19, in balances
df = pd.concat([df, tmp], ignore_index=False)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\reshape\concat.py", line 360, in concat
return op.get_result()
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\reshape\concat.py", line 591, in get_result
indexers[ax] = obj_labels.get_indexer(new_labels)
File "C:\Users\xxxx\Desktop\xxxx\Python\xxxx\xxxx\xxxx\venv\lib\site-packages\pandas\core\indexes\base.py", line 3721, in get_indexer
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
UPDATE:
I did the same for some other files, and this works as expected. The error i generated is catched and printed to console, and the filename is appended to a list. This is the working code that does as expected:
for item in settings.keys:
try:
tmp = pd.DataFrame()
response = settings.client.get_object(Bucket=settings.source_bucket, Key=item)
if item.endswith('.csv'):
tmp = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='unicode_escape', sep=None, engine='python')
elif item.endswith('.xlsx'):
tmp = pd.read_excel(io.BytesIO(response['Body'].read()))
tmp['file'] = item.split('/')[4]
tmp.columns = tmp.columns.map(settings._config['account statements']['columns'])
tmp['row'] = tmp.index + 2
tmp.columns = tmp.columns.astype(str)
tmp.rename(columns=lambda x: x.strip())
for col in tmp.columns:
if col.startswith('Beløp'):
settings.statement_currencies[item.split('/')[-1:][0]] = col[-3:]
tmp[col] = tmp[col].astype(str)
tmp[col] = tmp[col].str.replace(',', '.')
tmp[col] = tmp[col].astype(float)
tmp['direction'] = np.where(tmp[col] > 0, 'Incoming', 'Outgoing')
df = pd.concat([df, tmp], ignore_index=False)
except:
settings.unprocessed.append(item)
logger.exception(f'{item} Not Processed')

JSON file specific line editing

I'd like to ask what is the best way to replace specific line in multiple json files. In everyfile its the same line that needs to be replaced. enter image description here
import json
with open('3.json') as f:
data = json.load(f)
for item in data['attributes']:
item['value'] = item['value'].replace("Untitled", item['BgTest'])
with open('3.json', 'w') as d:
json.dump(data, d)
I tried this code I found but it keeps giving me an error:
"/Users/jakubpitonak/Desktop/NFT/Gnomes Collection/ART-GEN-TUTORIAL 2.0/bin/python" /Users/jakubpitonak/PycharmProjects/pythonProject1/update.py
Traceback (most recent call last):
File "/Users/jakubpitonak/PycharmProjects/pythonProject1/update.py", line 25, in <module>
item['value'] = item['value'].replace("Untitled", item['BgTest'])
KeyError: 'BgTest'
Process finished with exit code 1

So item['BgTest'] does not exist in the items you're iterating through. I think you want to replace the "Untitled" value with the value "BgTest". In that case, replace the for loop with the one below:
for item in data['attributes']:
if item['value'] == 'Untitled':
item['value'] = 'BgTest'

import json
with open('3.json') as f:
data = json.load(f)
for item in data['attributes']:
item['value'] = "Your value here"
with open('3.json', 'w') as d:
json.dump(data, d)
BgTest is not a valid key for the example you posted. If you only have that key in certain rows of the list you can not use it in the for loop.

String issue for Python multiprocessing

facing some issues with string parsing and the multiprocessing library. Here is my code, and I also outline the function calls and error.
def semi_func(tile):
with open(tile, 'rb') as f:
img = Image.open(BytesIO(f.read()))
resized_im, seg_map = MODEL.run(img)
vis_segmentation_tiles(str(tile),resized_im, seg_map)
x = np.unique(seg_map)
x = x.tolist()
print("THIS IS X", x)
ans_tiles[str(tile)] = x
print(x)
return ans_tiles
def split_tiles_new(image_path, tiledir):
print("1")
pool = Pool(processes=5)
print("2")
num_tiles = 9
tiles = image_slicer.slice(image_path, num_tiles, save=False)
print("3")
print(tiles)
image_slicer.save_tiles(tiles, directory=tiledir)
print(tiles)
print("TILES ABOEVE")
onlytiles = [os.path.join(tiledir,f) for f in listdir(tiledir) if isfile(join(tiledir, f))]
ans_tiles = {}
print(onlytiles)
onlytiles = list(map(str, onlytiles))
for t in onlytiles:
print(t)
for tile in onlytiles:
print(tile)
pool.map(semi_func,tile)
pool.close()
pool.join()
print(ans_tiles)
return ans_tiles
Here's what I'm feeding in terms of my functions:
ans_tiles = split_tiles_new(local_jpg, tiledir)
local_jpg = 'wheat044146108.jpg'
tiledir = 'tiles044146108'
Inside tiledir (the directory), there's a bunch of tiled images:
['tiles044146108/_03_02.png', 'tiles044146108/_03_01.png', 'tiles044146108/_02_02.png', 'tiles044146108/_01_01.png', 'tiles044146108/_03_03.png', 'tiles044146108/_01_02.png', 'tiles044146108/_02_01.png', 'tiles044146108/_02_03.png', 'tiles044146108/_01_03.png']
That's what is in the variable 'onlytiles'.
But my issue is this error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "serve_wh.py", line 128, in semi_func
with open(tile, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 't'
"""
I am not sure why it is doing further slicing of the string? Any idea what I can do to ensure it just grabs each file from 'onlyfiles' list separately in this?

Your iterable is a filename string thats why it's trying to open file with name t. Check Pool.map second argument.
pool.map(semi_func,tile)
You should use
pool.map(semi_func,onlytiles)
Without the for loop so that it iterates over the list rather than string.

" 'NoneType' object is not iterable" error when using fileinput python module to read lines from file

I'm running the following python code in Linux environment that uses the fileinput library.
filelist = glob.glob(os.path.join(LOCAL_DESTINATION, "*.*"))
for file in filelist:
if comment_type.lower() == 'header':
f = fileinput.input(file, inplace=1)
print(f)
print(f.__dict__)
for xline in f:
print(4567)
if f.isfirstline():
sys.stdout.write(comments + '\n' + xline)
else:
sys.stdout.write(xline)
The stderr I see even though the file is present in the LOCAL_DESTINATION folder:
'NoneType' object is not iterable
Exception ignored in: <bound method FileInput.__del__ of <fileinput.FileInput object at 0x7fb6164ed240>>
Traceback (most recent call last):
File "/usr/lib/python3.5/fileinput.py", line 229, in __del__
File "/usr/lib/python3.5/fileinput.py", line 233, in close
File "/usr/lib/python3.5/fileinput.py", line 290, in nextfile
TypeError: 'NoneType' object is not callable``
Can someone tell what could be the problem.
P.S. f.dict prints out the following:
{'_file': None, '_backup': '', '_openhook': None, '_filename': None, '_savestdout': None, '_mode': 'r', '_inplace': 1, '_startlineno': 0, '_files': ('4f5b11ef-601f-4607-a4d0-45173d2bbc53/Q3_2019_PlacementGUID_555168561629745350_f99a8d275e4_11_13_2019.txt',), '_isstdin': False, '_filelineno': 0, '_backupfilename': None, '_output': None}```

You get error
'NoneType' object is not iterable
when you try to loop over None or null value
eg.
k = None
for i in k:
print(k)
In above case you will get an error as you are trying to iterate over None value.
In your case you have 2 for loops
for file in filelist and for xline in f
so either filelist is None or f is None

Here there are two places where you are iterating over collection.
for file in filelist:
for xline in f:
The error means, either filelist or f return None instead of any elements. In other words, there is nothing to iterate on, it's not even an empty collection - it doesn't have a value.
You could use something list below to avoid error. However, you must check and handle empty collection.
for file in filelist or []:
for xline in f or []:

python - AttributeError: 'pyodbc.Row' object has no attribute 'JobCount'

I have a query that's grabbing data from a database and returning the values so I can parse.
def executeScriptsFromFile(monitor):
# Open and read the file as a single buffer
fd = open(os.path.join(BASE_DIR, 'sql/{0}.sql'.format(monitor)), 'r')
if args.alias:
sql_query = fd.read().format("'" + args.alias + "'")
else:
sql_query = fd.read()
fd.close()
# Execute SQL query from the input file
cursor.execute(sql_query)
result = cursor.fetchone()
return result
The query can differ so I'm trying to build in logic so it will skip part if JobCount isn't one of the values.
query_data = executeScriptsFromFile(args.monitor)
print query_data
if query_data.JobCount:
print query_data.JobCount
else:
send_status = send_data(job_data)
print send_status
Unfortunately I get the following traceback. How do I ignore the value if it isn't there?
Traceback (most recent call last):
File "tidal-zabbix.py", line 92, in <module>
if query_data.JobCount:
AttributeError: 'pyodbc.Row' object has no attribute 'JobCount'

If you want to check whether 'JobCount' is an attribute of query_data use hasattr()
if hasattr(query_data, 'JobCount'):
print query_data.JobCount
else:
send_status = send_data(job_data)
print send_status

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parallel processing write dictionary to multiple csv files - python

Related

Try/Except in for loop failing

JSON file specific line editing

String issue for Python multiprocessing

" 'NoneType' object is not iterable" error when using fileinput python module to read lines from file

python - AttributeError: 'pyodbc.Row' object has no attribute 'JobCount'

Categories

Resources