I recently started learning about multiprocessing in python and made this code to test it. So I have around 1300 csv files which I simply want to open and then save it to a folder with multiprocessing to test the speed. The issue here is that, the first 600-700 files are processed and saved in less than 10 seconds but the next 600-700 files takes more than 1 minute. I am really not sure why its happening. I have 8 cores and 16 gb ram in my system. Below is my code
import pandas as pd
import os,time
import multiprocessing
import numpy as np
def csv_processing(p):
final_df = pd.DataFrame(columns=['File_name', 'col'])
for file in p :
url = 'E:\\Ashish\\Market\\Data\\Processed GFDL_options\\Bank Nifty\\Intraday\\'
output = 'E:\\Testing\\'
df = pd.read_csv(url +file)
df.to_csv(output+file)
def split_list_into_prcessess(main_list, req_process):
index_freq = round(.5 + len(main_list)/req_process)
splitted_list = [main_list[r*index_freq:(r+1)*index_freq] for r in range(req_process)]
return [x for x in splitted_list if len(x)>0]
if __name__ == '__main__':
start_time = time.time()
processes = []
all_files = os.listdir('E:\\Ashish\\Market\\Data\\Processed GFDL_options\\Bank Nifty\\Intraday\\')
print(len(all_files))
data = split_list_into_prcessess(all_files,os.cpu_count())
print(data)
print(len(data))
for t in data:
p = multiprocessing.Process(target=csv_processing,args=(t,))
p.start()
processes.append(p)
for l in processes:
l.join()
end_time = time.time()
time_took = end_time - start_time
print(time_took)
Related
Considering I have a large number of json files, but small in size (about 20000 files around 100 Mbs), reading them the first time with code snippet:
from time import perf_counter
from glob import glob
def load_all_jsons_serial():
t_i = perf_counter()
json_files = glob("*json")
for file in json_files:
with open(file,"r") as f:
f.read()
t_f = perf_counter()
return t_f-t_i
load_all_jsons_serial()
take around 50 seconds.
However, if I rerun the code, it takes less than a second to finish! Could someone please:
Explain this observation. Why does it take longer the first time and less for the nexts runs?
How can I reduce the time for loading for the first time?
I am on a windows 11 machine and run the code in a notebook extension of VSCode.
Thanks.
You can read in parallel with aiofiles. Here is a full example, where i had 1000 json files (200kb each) in folder jsonfiles\async\ and jsonfiles\sync\ to prevent any hard disk or OS level caching. Removing the files and recreated the JSON files again after each run.
from glob import glob
import aiofiles
import asyncio
from time import perf_counter
###
# Synchronous file operation:
###
def load_all_jsons_serial():
json_files = glob("jsonfiles\\sync\\*.json")
for file in json_files:
with open(file,"r") as f:
f.read()
return
t_i = perf_counter()
load_all_jsons_serial()
t_f = perf_counter()
print(f"Synchronous: {t_f - t_i}")
###
# Async file operation
###
async def load_async(files: list[str]):
for file in files:
async with aiofiles.open(file, "r") as f:
await f.read()
return
async def main():
json_files = glob("jsonfiles\\async\\*.json")
no_of_tasks = 10
files_per_task = len(json_files)//no_of_tasks + 1
tasks = []
for i in range(no_of_tasks):
tasks.append(
asyncio.create_task(load_async(
json_files[i*files_per_task : i*files_per_task+files_per_task]))
)
await asyncio.gather(*tasks)
return
t_i = perf_counter()
asyncio.run(main())
t_f = perf_counter()
print(f"Asynchronous: {t_f - t_i}")
It's not exactly science but you can see there is a significant gain in performance:
Synchronous: 13.353551400010474
Asynchronous: 3.1800755000440404
I have a folder that contains 1000's of folder under which there are 1000's of file.
cb = []
for root, dirs, files in os.walk(dir):
for name in files:
filepath = root + os.sep + name
df = pd.read_csv(filepath,index_col=False)
df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
date = df['TimeStamp'].dt.date.values[0]
time = df['TimeStamp'].dt.time.values[0]
if (df.shape[0] > 0):
cb.append({'Time': time, 'Date': date})
I need to open all the files and do some data processing on them and append the data to empty dataframe.
Doing it sequentially takes days to run, is there a way I can use multiprocessing/threading to reduce the time and not skipping any files in the process?
You can put the per-file work into a separate function and then use a multiprocessing pool to push the processing to separate processes. This helps with CPU bound calculations but the file reads will take just as long as your original serial processing. The trick to multiprocessing is to keep the amount of data flowing through the pool itself to a minimum. Since you only pass a file name and return a couple of date time objects in this example, you're good on that point.
import multiprocessing as mp
import pandas as pd
import os
def worker(filepath):
df = pd.read_csv(filepath,index_col=False)
df['TimeStamp'] = pd.to_datetime(df.TimeStamp, format = '%Y-%m-%d %H:%M:%S')
date = df['TimeStamp'].dt.date.values[0]
time = df['TimeStamp'].dt.time.values[0]
if (df.shape[0] > 0):
return({'Time': time, 'Date': date})
else:
return None
if __name__ == "__main__":
csv_files = [root + os.sep + name
for root, dirs, files in os.walk(dir)
for name in files]
with mp.Pool() as pool:
cb = [result for result in pool.map(worker, csv_files, chunksize=1)
if result]
I have the following code. I want to load the data once in memory and then run the function get_ids in parallel. Actually, the data was loading 8 times. What follows in a memory error. Also, I'm very happy over hints to optimize the multiprocessing.
I use python 3.8
With windows and 16 GB ram and 8 CPU
import multiprocessing as mp
import os
import json
import datetime
from dateutil.relativedelta import relativedelta
import re
import time
NUM_CPUS = mp.cpu_count()
os.chdir(r'C:\Users\final_tweets_de')
directory =r'C:\Users\final_tweets_de'
path= r'C:\Users\final_tweets_de'
for file in os.listdir(directory):
fh = open(os.path.join(path, file),'r')
if file =="SARS_CoV.json":
with open(file, 'r', encoding='utf-8') as json_file:
data_tweets = json.load(json_file)
def get_id(data_tweets):
for i in range(0,len(data_tweets)):
try:
account = data_tweets[i]['user_screen_name']
created = datetime.datetime.strptime(data_tweets[i]['date'], '%Y-%m-%d').date()
until = created + relativedelta(days=10)
id = data_tweets[i]['id']
filename = re.search(r'(.*).json',file).group(1) + '_' + 'tweet_id_' +str(id)+ '_' + 'user_id_' + str(data_tweets[i]['user_id'])
try:
os.system('snscrape twitter-search "(to:'+account+') since:'+created.strftime("%Y-%m-%d")+' until:'+until.strftime("%Y-%m-%d")+' filter:replies" >C:\\Users\\Antworten\\antworten_SARS_CoV.json\\'+filename)
except:
continue
except:
Exception:logging.exception("f(%r) failed" % (args,))
if __name__ == "__main__":
pool = mp.Pool(NUM_CPUS)
get_id(data_tweets)
pool.close()
pool.join()
Update
After the comment from #tdelaney, I split the data into smaller pieces and have no memory errors yet. But now, the cores are still not fully used. I have a workload of around 20 percent.
i have below code for list excel file in folder, and after read the file using panda, doing some filldown/etc on each, compile all of the file into one dataframe
below is single thread version of the code, and its working :
Folder = "Folder Name"
StartRead = 2
num_cores = 2
DefaultPath = "C:\\Users\\"
path = DefaultPath + Folder
file_identifier = "*.xlsx"
def reader(filename):
raw_read = pd.read_excel(filename,skiprows=StartRead)
fixed_read = raw_read.fillna(method='ffill')
return fixed_read
def load_serial():
dfs = pd.DataFrame()
list_ = []
file_list = glob2.glob(path + "\\*" + file_identifier)
for f in file_list:
df = reader(f)
list_.append(df)
dfs = pd.concat(list_)
return dfs
data = load_serial()
im trying to do the same in parallel, since the actual file to be compiled is around~2000 excel file.
But somehow doesnt show anything after running it in jupyter with 10 test file, just like above :
import time
import glob2
import pandas as pd
import multiprocessing as mp
from joblib import Parallel, delayed
import os
import concurrent.futures
Folder = "Folder"
StartRead = 2
num_cores = 2
DefaultPath = "C:\\Users\\"
path = DefaultPath + Folder
file_identifier = "*.xlsx"
def reader(filename):
raw_read = pd.read_excel(filename, skiprows=StartRead)
fixed_read = raw_read.fillna(method='ffill')
return fixed_read
def load_serial():
dfs = pd.DataFrame()
list_ = []
file_list = glob2.glob(path + "\\*" + file_identifier)
for f in file_list:
df = reader(f)
list_.append(df)
dfs = pd.concat(list_)
return dfs
start1 = time.time()
if __name__ == "__main__":
file_list = glob2.glob(path + "\\*" + file_identifier)
pool = mp.Pool(num_cores)
list_of_results = pool.map_async(reader, file_list)
pool.close()
pool.join()
end1 = time.time()
Did i miss something? Thank you
this is tested on my dual core notebook with 10 file
the actual run will be on my 32 core server running windows, with ~2000 excel file in each folder
Thank you
Edit:
Finally i have it run
First, function reader need to be on separate file
Then wrap the main function within if name == 'main'
I've tested it on may main 32 core machine and its properly spool up all the core
Will put the fixed code tomorrow, im really sleepy right now...
I want to download 20 csv files with the size of all of them together - 5MB.
Here is the first version of my code:
import os
from bs4 import BeautifulSoup
import urllib.request
import datetime
def get_page(url):
try:
return urllib.request.urlopen(url).read()
except:
print("[warn] %s" % (url))
raise
def get_all_links(page):
soup = BeautifulSoup(page)
links = []
for link in soup.find_all('a'):
url = link.get('href')
if '.csv' in url:
return url
print("[warn] Can't find a link with CSV file!")
def get_csv_file(company):
link = 'http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'
g = link.find('s=')
name = link[g + 2:g + 6]
link = link.replace(name, company)
urllib.request.urlretrieve(get_all_links(get_page(link)), os.path.join('prices', company + '.csv'))
print("[info][" + company + "] Download is complete!")
if __name__ == "__main__":
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for security in security_list:
get_csv_file(security)
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
This code downloads 20 csv files with the size of all of them together - 5MB, within 1.2 minute.
Then i have tried to use multiprocessing to make it download faster.
Here is version 2:
if __name__ == "__main__":
import multiprocessing
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for i in range(20):
p = multiprocessing.Process(target=hP.get_csv_files([index] + security_list), args=(i,))
p.start()
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
But, unfortunately version 2 downloads 20 csv files with the size of all of them together - 5MB, within 2.4 minutes.
Why multiprocessing slowdowns my program?
What am I doing wrong?
What is the best way to download these files faster than now?
Thank you?
I don't know what exactly you are trying to start with Process in your example (I think you have a few typos). I think you want something like this:
processs = []
for security in security_list:
p = multiprocessing.Process(target=get_csv_file, args=(security,))
p.start()
processs.append(p)
for p in processs:
p.join()
You can iterate in this way over the security, create a new process for each security name and put the process in a list.
After you started all the processes, you loop over them and wait for them to finish, using join.
There is also a simpler way to do this, using Pool and its parallel map implementation.
pool = multiprocessing.Pool(processes=5)
pool.map(get_csv_file, security_list)
You create a Pool of processes (if you omit the argument, it will create a number equal to your processor count), and then you apply your function to each element in the list using map. The pool will take care of the rest.