How to efficiently perform row-wise operations using pandas? - python

I want to get some basic statistics from some csv files without loading the whole file in memory. I do it in two ways, one seemingly "smart" way using pandas and another casual way using csv I expect the pandas way to be faster but the csv way is actually faster by a very large margin. I was wondering why.
Here is my code:
import pandas as pd
import csv
movies = pd.read_csv('movies.csv') # movieId,title,genres
movie_count = movies.shape[0] # 9742
movieId_min = ratings.movieId.min()
movieId_max = ratings.movieId.max()
movieId_disperse = movies.movieId.sort_values().to_dict()
movieId_squeeze = {v: k for k, v in movieId_disperse.items()}
def get_ratings_stats():
gp_by_user = []
gp_by_movie = [0] * movie_count
top_rator = (0, 0) # (idx, value)
top_rated = (0, 0) # (idx, value)
rating_count = 0
user_count = 0
last_user = -1
for row in csv.DictReader(open('ratings.csv')):
user = int(row['userId'])-1
movie = movieId_squeeze[int(row['movieId'])]
if last_user != user:
last_user = user
user_count += 1
gp_by_user += [0]
rating_count += 1
gp_by_user[user] += 1
gp_by_movie[movie] += 1
top_rator = (user, gp_by_user[user]) if gp_by_user[user] > top_rator[1] else top_rator
top_rated = (movie, gp_by_movie[movie]) if gp_by_movie[movie] > top_rated[1] else top_rated
top_rator = (top_rator[0]+1, top_rator[1])
top_rated = (movieId_disperse[top_rated[0]], top_rated[1])
return rating_count, top_rator, top_rated
Now if I replace the line:
for row in csv.DictReader(open('ratings.csv')):
With:
for chunk in pd.read_csv('ratings.csv', chunksize=1000):
for _,row in chunk.iterrows():
The code actually becomes 10 times slower.
Here are the timing results:
> %timeit get_ratings_stats() # with csv
325 ms ± 9.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit get_ratings_stats() # with pandas
3.45 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any comments as to how I can make this code better/faster/more readable would be much appreciated

I think the point is that you shouldn't use pandas if you're going to then treat the big, expensive data structure like a dict. The question shouldn't be how to get pandas to be better at that, it should be how to write your code with pandas to do what you want.
import pandas as pd
def get_ratings_stats():
movie_rating_data = pd.read_csv('ratings.csv')
# Get the movie with the best rating
top_movie = movie_rating_data.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
# Get the user with the best rating
top_user = movie_rating_data.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
return movie_rating_data.shape[0], top_movie, top_user
def get_ratings_stats_slowly():
movies = pd.DataFrame(columns = ["movieId", "ratings"])
users = pd.DataFrame(users = ["userId", "ratings"])
data_size = 0
for chunk in pd.read_csv('ratings.csv', chunksize=1000):
movies = movies.append(chunk.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max'))
users = users.append(chunk.loc[:, ['userId', 'rating']].groupby('userId').agg('max'))
data_size += chunk.shape[0]
top_movie = movies.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
top_user = users.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
return data_size, top_movie, top_user
I'm not really sure that this is what you want to do overall, but your code is incomprehensible - this should be a good place to start (you could replace .agg('max') with .count() if you're interested in the number of ratings, etc).

I think parallel processing is the answer for your question. I've tried doing some parallel processing on your problem but I had to split the ratings file into multiple files for processing.
What I did initially was to duplicate the ratings data from the CSV files by a factor of 10, and then I executed your script to have an initial execution time, which for me was about 3.6 seconds. Now, by splitting the files into multiple ones, that can be addressed by multiple child processes, and for example by using my script with -k 2 (basically 2 workers), the total execution time reduced to 1.87 seconds. If I use -k 4 (4 workers) the execution time will be 1.13 seconds.
I am not sure if it is possible to read the CSV in chunks and basically do a parallel seek reading from the CSV, from a single big file, but that would make it a lot faster, the only drawback being the need to do an initial count of the rows in the big CSV file, to know how many rows will go per worker.
The splitting script:
import csv
file_path = "data/ratings.csv"
out_path = "data/big_ratings_{}.csv"
out_csv = None
for i in range(10):
print("Iteration #{}".format(i+1))
pin = open(file_path, "r")
pout = open(out_path.format(i), "w")
in_csv = csv.DictReader(pin)
out_csv = csv.DictWriter(pout, fieldnames=in_csv.fieldnames)
out_csv.writeheader()
for row in in_csv:
out_csv.writerow(row)
pin.close()
pout.close()
The actual rating processing script
import time
import csv
import argparse
import os
import sys
from multiprocessing import Process, Queue, Value
import pandas as pd
top_rator_queue = Queue()
top_rated_queue = Queue()
DEFAULT_NO_OF_WORKERS = 1
RATINGS_FILE_PATH = "data/big_ratings_{}.csv"
NUMBER_OF_FILES = 10
class ProcessRatings(Process):
def __init__(self, file_index_range, top_rator_queue, top_rated_queue, movie_id_squeeze):
super(ProcessRatings, self).__init__()
self.file_index_range = file_index_range
self.top_rator_queue = top_rator_queue
self.top_rated_queue = top_rated_queue
self.movie_id_squeeze = movie_id_squeeze
def run(self):
for file_index in self.file_index_range:
print("[PID: {}] Processing file index {} .".format(os.getpid(), file_index))
start = time.time()
gp_by_user = []
gp_by_movie = [0] * movie_count
top_rator = (0, 0) # (idx, value)
top_rated = (0, 0) # (idx, value)
rating_count = 0
user_count = 0
last_user = -1
for row in csv.DictReader(open(RATINGS_FILE_PATH.format(file_index))):
user = int(row['userId'])-1
movie = self.movie_id_squeeze[int(row['movieId'])]
if last_user != user:
last_user = user
user_count += 1
gp_by_user += [0]
gp_by_user[user] += 1
gp_by_movie[movie] += 1
top_rator = (user, gp_by_user[user]) if gp_by_user[user] > top_rator[1] else top_rator
top_rated = (movie, gp_by_movie[movie]) if gp_by_movie[movie] > top_rated[1] else top_rated
end = time.time()
print("[PID: {}] Processing time for file index {} : {}s!".format(os.getpid(), file_index, end-start))
print("[PID: {}] WORKER DONE!".format(os.getpid()))
if __name__ == "__main__":
print("Processing ratings in multiple worker processes.")
start = time.time()
# script arguments handling
parser = argparse.ArgumentParser()
parser.add_argument("-k", dest="workers", action="store")
args_space = parser.parse_args()
# determine the number of workers
number_of_workers = DEFAULT_NO_OF_WORKERS
if args_space.workers:
number_of_workers = int(args_space.workers)
else:
print("Number of workers not specified. Assuming: {}".format(number_of_workers))
# rating data
rating_count = 0
movies = pd.read_csv('data/movies.csv') # movieId,title,genres
movie_count = movies.shape[0] # 9742
movieId_min = movies.movieId.min()
movieId_max = movies.movieId.max()
movieId_disperse = movies.movieId.sort_values().to_dict()
movieId_squeeze = {v: k for k, v in movieId_disperse.items()}
# process data
processes = []
# initialize the worker processes
number_of_files_per_worker = NUMBER_OF_FILES // number_of_workers
for i in range(number_of_workers):
p = ProcessRatings(
range(i, i+number_of_files_per_worker), # file index
top_rator_queue,
top_rated_queue,
movieId_squeeze
)
p.start()
processes.append(p)
print("MAIN: Wait for processes to finish ...")
# wait until all processes are done
while True:
# determine if the processes are still running
if not any(p.is_alive() for p in processes):
break
# gather the data and do a final processing
end = time.time()
print("Processing time: {}s".format(end - start))
print("Rating count: {}".format(rating_count))

Related

Export data to gsheet workbooks / worksheets after looping through a script 10 times

I have the following script I'm running to get data from Google's pagespeed insights tool via API:
from datetime import datetime
from urllib import request
import requests
import pandas as pd
import numpy as np
import re
from os import truncate
import xlsxwriter
import time
import pygsheets
import pickle
domain_strip = 'https://www.example.co.uk'
gc = pygsheets.authorize(service_file='myservicefile.json')
API = "myapikey"
strat = "mobile"
def RunCWV():
with open('example_urls_feb_23.txt') as pagespeedurls:
content = pagespeedurls.readlines()
content = [line.rstrip('\n') for line in content]
#Dataframes
dfCWV2 = pd.DataFrame({'Page':[],'Overall Performance Score':[],'FCP (seconds) CRUX':[],'FCP (seconds) Lab':[],'FID (seconds)':[],'Max Potential FID (seconds)':[],'LCP (seconds) CRUX':[],'LCP (seconds) Lab':[],'LCP Status':[],'CLS Score CRUX':[],'Page CLS Score Lab':[],'CLS Status':[],'Speed Index':[],'Uses Efficient Cache Policy?':[],'Landing Page':[]})
dfCLSPath2 = pd.DataFrame({'Page':[],'Path':[],'Selector':[],'Node Label':[],'Element CLS Score':[],'Landing Page':[],'large_uid':[]})
dfUnsizedImages2 = pd.DataFrame({'Page':[],'Image URL':[],'Landing Page':[],'unsized_uid':[]})
dfNCAnim2 = pd.DataFrame({'Page':[],'Animation':[],'Failure Reason':[],'Landing Page':[]})
dfLCP_Overview = pd.DataFrame({'Page':[],'Preload LCP Savings (seconds)':[],'Resize Images Savings (seconds)':[],'Text Compression Savings (seconds)':[],'Preload Key Requests Savings (seconds)':[],'Preconnect Savings (seconds)':[],'Unused CSS Savings (seconds)':[],'Unused JS Savings (seconds)':[],'Unminified CSS Savings (seconds)':[],'Unminified JS Savings (seconds)':[],'Efficiently Animated Content Savings':[],'Landing Page':[]})
dfLCPOb2 = pd.DataFrame({'Page':[],'LCP Tag':[],'LCP Tag Type':[],'LCP Image Preloaded?':[],'Wasted Seconds':[],'Action':[],'Landing Page':[]})
dfresize_img = pd.DataFrame({'Page':[],'Image URL':[],'Total Bytes':[],'Wasted Bytes':[],'Overall Savings (seconds)':[],'Action':[],'Landing Page':[]})
dfFontDisplay2 = pd.DataFrame({'Page':[],'Resource':[],'Font Display Utilised?':[],'Wasted Seconds':[],'Action':[],'Landing Page':[]})
dfTotalBW2 = pd.DataFrame({'Page':[],'Total Byte Weight of Page':[],'Large Network Payloads?':[],'Resource':[],'Total KB':[],'Landing Page':[]})
dfRelPreload2 = pd.DataFrame({'Page':[],'Resource':[],'Wasted Seconds':[],'Landing Page':[]})
dfRelPreconnect2 = pd.DataFrame({'Page':[],'Resource':[],'Wasted Ms':[],'Passed Audit':[],'Landing Page':[]})
dfTextCompression2 = pd.DataFrame({'Page':[],'Text Compression Optimal?':[],'Action':[],'Savings':[],'Landing Page':[]})
dfUnusedCSS2 = pd.DataFrame({'Page':[],'CSS File':[],'Unused CSS Savings KiB':[],'Unused CSS Savings (seconds)':[],'Wasted %':[],'Landing Page':[]})
dfUnusedJS2 = pd.DataFrame({'Page':[],'JS File':[],'Unused JS Savings (seconds)':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfUnminCSS2 = pd.DataFrame({'Page':[],'CSS File':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfUnminJS2 = pd.DataFrame({'Page':[],'JS File':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfCritRC2 = pd.DataFrame({'Page':[],'Resource':[],'Start Time':[],'End Time':[],'Total Time':[],'Transfer Size':[],'Landing Page':[]})
dfAnimContent2 = pd.DataFrame({'Page':[],'Efficient Animated Content?':[],'Resource':[],'Total Bytes':[],'Wasted Bytes':[],'Landing Page':[]})
dfSRT2 = pd.DataFrame({'Page':[],'Passed Audit?':[],'Server Response Time ms':[],'Server Response Time Savings':[],'Landing Page':[]})
dfRedirects2 = pd.DataFrame({'Page':[],'Redirects':[],'Wasted ms':[],'Landing Page':[]})
dfFID_Summary2 = pd.DataFrame({'Page':[],'FID (seconds)':[],'Total Blocking Time (seconds)':[],'FID Rating':[],'Total Tasks':[],'Total Task Time of Page (seconds)':[],'Tasks over 50ms':[],'Tasks over 100ms':[],'Tasks over 500ms':[],'3rd Party Total Wasted Seconds':[],'Bootup Time (seconds)':[],'Number of Dom Elements':[],'Mainthread work Total Seconds':[],'Duplicate JS Savings (Seconds)':[],'Legacy JS Savings (seconds)':[],'Landing Page':[]})
dflongTasks2 = pd.DataFrame({'Page':[],'Task':[],'Task Duration Seconds':[],'Total Tasks':[],'Total Task Time of Page (seconds)':[],'Tasks over 50ms':[],'Tasks over 100ms':[],'Tasks over 500ms':[],'Landing Page':[]})
dfthirdP2 = pd.DataFrame({'Page':[],'3rd Party Total wasted Seconds':[],'3rd Party Total Blocking Time (seconds)':[],'3rd Party Resource Name':[],'Landing Page':[]})
dfbootup2 = pd.DataFrame({'Page':[],'Page Bootup Time Score':[],'Resource':[],'Time spent Parsing / Compiling Ms':[]})
dfthread2 = pd.DataFrame({'Page':[],'Score':[],'Mainthread work total seconds':[],'Mainthread work Process Type':[],'Duration (Seconds)':[],'Landing Page':[]})
dfDOM2 = pd.DataFrame({'Page':[],'Dom Size Score':[],'DOM Stat':[],'DOM Value':[],'Landing Page':[],})
dfdupJS2 = pd.DataFrame({'Page':[],'Score':[],'Audit Status':[],'Duplicate JS Savings (seconds)':[], 'Landing Page':[]})
dflegacyJS2 = pd.DataFrame({'Page':[],'Audit Status':[],'Legacy JS Savings (seconds)':[],'JS File of Legacy Script':[],'Wasted Bytes':[],'Landing Page':[]})
#Run PSI
for line in content:
x = f'https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url={line}&strategy={strat}&key={API}'
print(f'Running CWV Audit on {strat} from: {line} - Please Wait...')
r = requests.get(x)
data = r.json()
line_stripped = line
if domain_strip in line:
line_stripped = line_stripped.replace(domain_strip, '')
else:
pass
#CWV Overview
try:
op_score = data["lighthouseResult"]["categories"]["performance"]["score"] * 100
fcp_score_CRUX = data["loadingExperience"]["metrics"]["FIRST_CONTENTFUL_PAINT_MS"]["percentile"] / 1000
fcp_score_LAB = data["lighthouseResult"]["audits"]["first-contentful-paint"]["numericValue"] / 1000
fid_score = data["loadingExperience"]["metrics"]["FIRST_INPUT_DELAY_MS"]["percentile"] / 1000
Max_P_FID = data["lighthouseResult"]["audits"]["max-potential-fid"]["numericValue"] / 1000
lcp_score_CRUX_ms = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"]
lcp_score_CRUX = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"] / 1000
lcp_score_LAB = data["lighthouseResult"]["audits"]["first-contentful-paint"]["numericValue"] / 1000
cls_score_Sitewide = data["loadingExperience"]["metrics"]["CUMULATIVE_LAYOUT_SHIFT_SCORE"]["percentile"] / 100
cls_score_Page_mult = data["lighthouseResult"]["audits"]["cumulative-layout-shift"]["numericValue"] * 1000
cls_score_Page = data["lighthouseResult"]["audits"]["cumulative-layout-shift"]["numericValue"]
speed_index = data["lighthouseResult"]["audits"]["speed-index"]["numericValue"] / 1000
efficient_cache = data["lighthouseResult"]["audits"]["uses-long-cache-ttl"]["score"]
if efficient_cache == 1:
efficient_cache = "Yes"
else:
efficient_cache = "No"
lcp_status = lcp_score_CRUX_ms
if lcp_score_CRUX_ms <=2500:
lcp_status = "Good"
elif lcp_score_CRUX_ms in range (2501, 4000):
lcp_status = "Needs Improvement"
else:
lcp_status = "Poor"
cls_status = cls_score_Page_mult
if cls_score_Page_mult <=100:
cls_status = "Good"
elif cls_score_Page_mult in range (101,150):
cls_status = "Needs Improvement"
else:
cls_status = "Poor"
new_row = pd.DataFrame({'Page':line_stripped,'Overall Performance Score':op_score, 'FCP (seconds) CRUX':round(fcp_score_CRUX,4),'FCP (seconds) Lab':round(fcp_score_LAB,4), 'FID (seconds)':round(fid_score,4),
'Max Potential FID (seconds)':round(Max_P_FID,4), 'LCP (seconds) CRUX':round(lcp_score_CRUX,4),'LCP (seconds) Lab':round(lcp_score_LAB,4), 'LCP Status':lcp_status, 'CLS Score CRUX':round(cls_score_Sitewide,4),
'Page CLS Score Lab':round(cls_score_Page,4),'CLS Status':cls_status,'Speed Index':round(speed_index,4),'Uses Efficient Cache Policy?':efficient_cache, 'Landing Page':line_stripped}, index=[0])
dfCWV2 = pd.concat([dfCWV2, new_row], ignore_index=True) #, ignore_index=True
except KeyError:
print(f'<KeyError> CWV Summary One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('CWV Summary')
print (dfCWV2)
#Export to GSheets line by line
sh = gc.open('CWV Overview AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('CWV')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfCWV2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
# #End test
#CLS
#Large Shifts
try:
for x in range (len(data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"])):
path = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["path"]
selector = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["selector"]
nodeLabel = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["nodeLabel"]
score = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["score"]
i = 1
new_row = pd.DataFrame({'Page':line_stripped, 'Path':path, 'Selector':selector, 'Node Label':nodeLabel,'Element CLS Score':round(score,4), 'Landing Page':line_stripped, 'large_uid':i}, index=[0])
dfCLSPath2 = pd.concat([dfCLSPath2, new_row], ignore_index=True)
except KeyError:
print(f'<KeyError> Layout Shift Elements - One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('Large Shifts')
print (dfCLSPath2)
sh = gc.open('CLS Audit AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('Large CLS Elements')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfCLSPath2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
#Unsized Images
try:
for x in range (len(data["lighthouseResult"]["audits"]["unsized-images"]["details"]["items"])):
unsized_url = data["lighthouseResult"]["audits"]["unsized-images"]["details"]["items"][x]["url"]
i = 1
new_row = pd.DataFrame({'Page':line_stripped, 'Image URL':unsized_url, 'Landing Page':line_stripped, 'unsized_uid':i}, index=[0])
dfUnsizedImages2 = pd.concat([dfUnsizedImages2, new_row], ignore_index=True)
except KeyError:
print(f'<KeyError> Unsized Images One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('Unsized Images')
print(dfUnsizedImages2)
sh = gc.open('CLS Audit AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('Unsized Images')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfUnsizedImages2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
I've only included the first few TRY blocks as the script is very long. Essentially what I want to do is the same as I have here, but rather than exporting the results from the dataframes after every URL has run, I want to export it, say, every 10 urls (or more). I have around 4000 urls in total and I need to capture the results from the audit for every url
I used to have the script set up to export to gsheets at the end of the whole script with every loop, but I always end up with the script crashing before it loops through every URL I'm auditing which is why I set it up as above to export line by line - it's SUPER slow though, taking over 2 weeks to run through all urls in my text file so I want to speed it up by only exporting every 10 urls worth of data at a time. That way if the script crashes, I've not lost everything, only the last 10 urls.
I tried setting a counter on each of the export blocks:
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")
result=pd.concat(results, ignore_index=True)
result=result.drop_duplicates(keep='last')
# add results to export list
results_to_export.append(result)
if results_to_export:
sh = gc.open('CWV Overview AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('CWV')
combined_results = pd.concat(results_to_export, ignore_index=True)
worksheet.set_dataframe(combined_results, 'A1')
results_to_export.clear()
results =[]
But this just kept looping through the while loop and not moving onto the next Try block or throwing up errors (I tried every version of unindenting the if statements too but nothing worked).
Please help!
A shorter program would be more likely to get an expert answer
It may be a long time until you find somebody stumbling on to this page who is willing to read so much text, and who knows how to solve the problem. To improve your chances, it is best to trim your program to the absolute smallest size that allows the problem to manifest.
Is your if counter statement not indented enough?
Currently you have:
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
# your other code here
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")
But the if counter, positioned where it is in the above code, will only be reached after 5000 steps of "while counter".
Did you mean this?
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
# your other code here
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")

Parallelizing the process

I want to parallelize the spec which is generated by _spectrum_generator. I am using futures.ThreadPoolExecutor which is called in _gather_lcms_data. The spec is passed through function. The file is in .mzML format. Below is the output that i get which is empty.
(base) ashish#14-ce3xxx:~/GNPS_LCMSDashboard$ python3 lcms_map.py
Empty DataFrame
Columns: [mz, rt, i, scan, index, polarity]
Index: []
The output should be look like this:
(base) ashish#14-ce3xxx:/media/ashish/ubuntu7/GNPS_LCMSDashboard$ python3 lcms_map.py
mz rt i scan index polarity
0 169.038696 0.003722 1652.959961 1 1 1
1 177.969086 0.003722 1786.755127 1 1 1
2 194.156967 0.003722 1802.361450 1 1 1
3 154.059418 0.003722 1840.889160 1 1 1
4 164.080978 0.003722 1973.758423 1 1 1
5 150.079514 0.003722 1976.528687 1 1 1
6 160.096634 0.003722 2057.728516 1 1 1
7 201.182205 0.003722 2077.768311 1 1 1
8 162.078735 0.003722 2101.843018 1 1 1
9 171.044205 0.003722 2223.230713 1 1 1
Below is the code of _spectrum_generator:
def _spectrum_generator(filename, min_rt, max_rt):
run = pymzml.run.Reader(filename, MS_precisions=MS_precisions)
# Don't do this if the min_rt and max_rt are not reasonable values
if min_rt <= 0 and max_rt > 1000:
for spec in run:
yield spec
else:
try:
min_rt_index = _find_lcms_rt(run, min_rt) # These are inclusive on left
max_rt_index = _find_lcms_rt(run, max_rt) + 1 # Exclusive on the right
for spec_index in tqdm(range(min_rt_index, max_rt_index)):
spec = run[spec_index]
yield spec
print("USED INDEX")
except:
run = pymzml.run.Reader(filename, MS_precisions=MS_precisions)
for spec in run:
yield spec
print("USED BRUTEFORCE")
Below is code of lcms_map.py:
import os
import pymzml
import numpy as np
import datashader as ds
from tqdm import tqdm
import json
import pandas as pd
import xarray
import time
import utils
import plotly.express as px
import plotly.graph_objects as go
from utils import _spectrum_generator
from utils import _get_scan_polarity
from multiprocessing import Pool
import concurrent.futures
from multiprocessing import Process
# Enum for polarity
POLARITY_POS = 1
POLARITY_NEG = 2
def _gather_lcms_data(filename, min_rt, max_rt, min_mz, max_mz, polarity_filter="None", top_spectrum_peaks=100, include_polarity=False):
all_mz = []
all_rt = []
all_polarity = []
all_i = []
all_scan = []
all_index = []
spectrum_index = 0
number_spectra = 0
all_msn_mz = []
all_msn_rt = []
all_msn_polarity = []
all_msn_scan = []
all_msn_level = []
#fun(filename, min_rt, max_rt)
for spec in _spectrum_generator(filename, min_rt, max_rt):
rt = spec.scan_time_in_minutes()
try:
# Still waiting for the window
if rt < min_rt:
continue
# pass
# We've passed the window
if rt > max_rt:
break
except:
pass
if polarity_filter == "None":
pass
else:
scan_polarity = _get_scan_polarity(spec)
if polarity_filter != scan_polarity:
continue
if spec.ms_level == 1:
spectrum_index += 1
number_spectra += 1
try:
# Filtering peaks by mz
if min_mz <= 0 and max_mz >= 2000:
peaks = spec.peaks("raw")
else:
peaks = spec.reduce(mz_range=(min_mz, max_mz))
# Filtering out zero rows
peaks = peaks[~np.any(peaks < 1.0, axis=1)]
# Sorting by intensity
peaks = peaks[peaks[:,1].argsort()]
peaks = peaks[-1 * top_spectrum_peaks:]
mz, intensity = zip(*peaks)
all_mz += list(mz)
all_i += list(intensity)
all_rt += len(mz) * [rt]
all_scan += len(mz) * [spec.ID]
all_index += len(mz) * [number_spectra]
# Adding polarity
if include_polarity is True:
scan_polarity = _get_scan_polarity(spec)
if scan_polarity == "Positive":
all_polarity += len(mz) * [POLARITY_POS]
else:
all_polarity += len(mz) * [POLARITY_NEG]
except:
pass
elif spec.ms_level > 1:
try:
msn_mz = spec.selected_precursors[0]["mz"]
if msn_mz < min_mz or msn_mz > max_mz:
continue
all_msn_mz.append(msn_mz)
all_msn_rt.append(rt)
all_msn_scan.append(spec.ID)
all_msn_level.append(spec.ms_level)
# Adding polarity
if include_polarity is True:
scan_polarity = _get_scan_polarity(spec)
if scan_polarity == "Positive":
all_msn_polarity.append(POLARITY_POS)
else:
all_msn_polarity.append(POLARITY_NEG)
except:
pass
ms1_results = {}
ms1_results["mz"] = all_mz
ms1_results["rt"] = all_rt
ms1_results["i"] = all_i
ms1_results["scan"] = all_scan
ms1_results["index"] = all_index
msn_results = {}
msn_results["precursor_mz"] = all_msn_mz
msn_results["rt"] = all_msn_rt
msn_results["scan"] = all_msn_scan
msn_results["level"] = all_msn_level
# Adding polarity
if include_polarity is True:
ms1_results["polarity"] = all_polarity
msn_results["polarity"] = all_msn_polarity
ms1_results = pd.DataFrame(ms1_results)
msn_results = pd.DataFrame(msn_results)
return ms1_results, number_spectra, msn_results
def _get_feather_filenames(filename):
output_ms1_filename = filename + ".ms1.feather"
output_msn_filename = filename + ".msn.feather"
return output_ms1_filename, output_msn_filename
# These are caching layers for fast loading
def _save_lcms_data_feather(filename):
output_ms1_filename, output_msn_filename = _get_feather_filenames(filename)
start=time.time()
# with Pool(5) as p:
# #ms1_results, number_spectra, msn_results = p.starmap(_gather_lcms_data, (filename, 0, 1000000, 0, 10000, "None", 100000, True))
# ms1_results, number_spectra, msn_results = _gather_lcms_data(filename, 0, 1000000, 0, 10000, polarity_filter="None", top_spectrum_peaks=100000, include_polarity=True)
# with concurrent.futures.ProcessPoolExecutor(max_workers=100) as executor:
# f=executor.submit(_gather_lcms_data, filename, 0, 1000000, 0, 10000, polarity_filter="None", top_spectrum_peaks=100000, include_polarity=True)
# ms1_results, number_spectra, msn_results = f.result()
ms1_results, number_spectra, msn_results = _gather_lcms_data(filename, 0, 1000000, 0, 10000, polarity_filter="None", top_spectrum_peaks=100000, include_polarity=True)
print(ms1_results.head(10))
print("Gathered data in", time.time() - start)
ms1_results = ms1_results.sort_values(by='i', ascending=False).reset_index()
ms1_results.to_feather(output_ms1_filename)
msn_results.to_feather(output_msn_filename)
_save_lcms_data_feather("/media/ashish/ubuntu7/GNPS_LCMSDashboard/QC_0.mzML")
How do i get the desired output by parallelizing. Suggest the changes that i need make in order to make it work.
As Simon Lundberg pointed out, you posted very complicated code, which makes it difficult to parallelize and even more difficult to explain how it is to be done. But if you were to present a simplified version of your code that was readily parallelizable, any answer would not be dealing with the complexities of your actual current code and would therefore be of little help. So I will try to create code that is an abstraction of your code's structure and then show how I would parallelize that. I am afraid that since you are not that familiar with multiprocessing, this may be rather difficult for you to follow.
First, a few observations about your code:
_gather_lcms_data currently is passed a filename and then using a generator function, _spectrum_generator, loops on all of its elements, called variable spec. In each loop iteration variable results are appended to various lists and variable number_spectra is optionally incremented. You have another variable spectrum_index that is also optionally incremented but its value is not otherwise used and could be eliminated. Finally, these lists are added to various dictionaries.
To parallelize the _gather_lcms_data function, it needs to process a single element, spec from the _spectrum_generator function so that we can run multiple invocations of this function in parallel. Consequently it needs to return a tuple of elements back to the main process which will do the necessary appending to lists and then create the dictionaries.
In your current code for each iteration of spec you optionally increment number_spectra and optionally append elements to various lists. Since we are now going to be parallelizing this function by returning individual elements, the main process must (1) accumulate the returned number_spectra value and optionally append the returned elements to result lists. Where in the original code you were not appending an element to a list for a given iteration, in the parallelized code you must return a None value so that the main process knows that for that iteration nothing needs to be appended.
In this abstraction, I have also reduced the number of lists down to two and I am generating dummy results.
First an abstraction of your current code.
def _spectrum_generator(filename, min_rt, max_rt):
#run = pymzml.run.Reader(filename, MS_precisions=MS_precisions)
run = [1, 2, 3, 4, 5, 6]
for spec in run:
yield spec
def _gather_lcms_data(filename, min_rt, max_rt, min_mz, max_mz, polarity_filter="None", top_spectrum_peaks=100, include_polarity=False):
# Remainder of the list declarations omitted for simplicity
all_mz = []
all_msn_mz= []
number_spectra = 0
for spec in _spectrum_generator(filename, min_rt, max_rt):
... # Code omitted for simplicity
number_spectra += 1 # Conditionally done
msn_mz = spec # Conditionally done
all_msn_mz.append(msn_mz)
mz = (spec * spec,) # Conditionally done
all_mz += list(mz)
...
ms1_results = {}
msn_results = {}
...
ms1_results["mz"] = all_mz
msn_results["precursor_mz"] = all_msn_mz
...
# Return
return ms1_results, number_spectra, msn_results
def _save_lcms_data_feather(filename):
ms1_results, number_spectra, msn_results = _gather_lcms_data(filename, 0, 1000000, 0, 10000, polarity_filter="None", top_spectrum_peaks=100000, include_polarity=True)
print(ms1_results)
print(number_spectra)
print(msn_results)
if __name__ == '__main__':
_save_lcms_data_feather("/media/ashish/ubuntu7/GNPS_LCMSDashboard/QC_0.mzML")
Prints:
{'mz': [1, 4, 9, 16, 25, 36]}
6
{'precursor_mz': [1, 2, 3, 4, 5, 6]}
This is the parallelized version of the above code:
def _spectrum_generator(filename, min_rt, max_rt):
#run = pymzml.run.Reader(filename, MS_precisions=MS_precisions)
run = [1, 2, 3, 4, 5, 6]
for spec in run:
yield spec
def _gather_lcms_data(spec, min_rt, max_rt, min_mz, max_mz, polarity_filter="None", top_spectrum_peaks=100, include_polarity=False):
# Remainder of the list declarations omitted for simplicity
number_spectra = 0
... # Code omitted for simplicity
number_spectra += 1
msn_mz = spec # Conditionally done. If not done then set msn_mz to None
mz = list((spec * spec,)) # Conditionally done. If not done then set mz to None
...
return mz, number_spectra, msn_mz
def _save_lcms_data_feather(filename):
from multiprocessing import Pool
from functools import partial
min_rt = 0
max_rt = 1000000
worker_function = partial(_gather_lcms_data, min_rt=min_rt, max_rt=max_rt, min_mz=0, max_mz=10000, polarity_filter="None", top_spectrum_peaks=1000000, include_polarity=True)
with Pool() as pool:
all_mz = []
all_msn_mz = []
number_spectra = 0
for mz, _number_spectra, msn_mz in pool.map(worker_function, _spectrum_generator(filename, min_rt, max_rt)):
if mz is not None:
all_mz += mz
number_spectra += _number_spectra
if msn_mz is not None:
all_msn_mz.append(msn_mz)
ms1_results = {}
msn_results = {}
ms1_results["mz"] = all_mz
msn_results["precursor_mz"] = all_msn_mz
print(ms1_results)
print(number_spectra)
print(msn_results)
if __name__ == '__main__':
_save_lcms_data_feather("/media/ashish/ubuntu7/GNPS_LCMSDashboard/QC_0.mzML")

Python multiprocessing multiple iterations

I am trying to use multiprocessing to speed up my data processing. I am working on a machine with 6 Cores, so I want to iterate through a table of 12 million rows, and for each of these rows I iterate through several time steps doing a calculation (executing a function).
This line I would like to split up that it runs in parallel on different cores:
test = [rowiteration(i, output, ini_cols, cols) for i in a] # this should run in parallel
I tried something with
from multiprocessing import Pool
but I did not manage to pass the arguments of the function and the iterator.
I would appreciate any idea. I am new to Python.
This is what i have:
import pyreadr
import pandas as pd
import numpy as np
import time
from datetime import timedelta
import functools
from pathlib import Path
def read_data():
current_path = os.getcwd()
myfile = os.path.join(str(Path(current_path).parents[0]), 'dummy.RData')
result = pyreadr.read_r(myfile)
pc = result["pc"]
u = result["u"]
return pc, u
# add one column per time
def prepare_output_structure(pc):
ini_cols = pc.columns
pc = pc.reindex(columns=[*pc.columns, *np.arange(0, 11), 'cat'], fill_value=0)
pc.reset_index(level=0, inplace=True)
# print(pc.columns, pc.shape, pc.dtypes)
return pc, ini_cols
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
def timeloop(t_final: int, count_final: int, tipo):
if tipo == 'A':
count_ini = 35
else: # B:
count_ini = 30
yy_list = []
for t in np.arange(0, 11):
yy = ((count_final - count_ini) / t_final) * t + count_ini
yy_list.append(int(yy))
return yy_list
def rowiteration(i, output, ini_cols, cols):
c_2: bool = pc.loc[i, 'tipo'] == u.iloc[:, 0].str[:1] # first character of category e.g. 'A1'
c_5: bool = pc.loc[i, 't_final'] >= u.iloc[:, 1] # t_min (u)
c_6: bool = pc.loc[i, 't_final'] <= (u.iloc[:, 2]) # t_max (u)
pc.loc[i, 'cat'] = u[conjunction(c_2, c_5, c_6)].iloc[0, 0]
pc.iloc[i, (0 + (len(ini_cols))+1):(10 + (len(ini_cols))+2)] = timeloop(int(pc.loc[i, 't_final']), int(pc.loc[i, 'count_final']), pc.loc[i, 'tipo'])
out = pd.DataFrame(pc.iloc[i, :])
out = pd.DataFrame(out.transpose(), columns=cols)
output = output.append(out.iloc[0, :])
return output
if __name__ == '__main__':
start_time = time.time()
pc, u = read_data()
nrowpc = len(pc.index)
a = np.arange(0, nrowpc) # filas tabla pc
# print(a, nrowpc, len(pc.index))
pc, ini_cols = prepare_output_structure(pc)
cols = pc.columns
output = pd.DataFrame()
test = [rowiteration(i, output, ini_cols, cols) for i in a] # this should run in parallel
pc2 = pd.concat(test, ignore_index=True)
pc2 = pc2.iloc[:, np.r_[5, (len(ini_cols)+1):(len(pc2.columns))]]
print(pc2.head)
elapsed_time_secs = time.time() - start_time
msg = "Execution took: %s secs (Wall clock time)" % timedelta(milliseconds=elapsed_time_secs)
print(msg)```
Replace your [rowiteration(i, output, ini_cols, cols) for i in a] with:
from multiprocessing import Pool
n_cpu = 10 # put in the number of threads of cpu
with Pool(processes=n_cpu) as pool:
ret = pool.starmap(rowiteration,
[(i, output, ini_cols, cols) for i in a])
Here is an approach that I think solves the problem and that only sends what is necessary to the worker processes. I haven't tested this as is (which would be difficult without the data your code reads in) but this is basic idea:
import multiprocessing as mp
p = mp.Pool(processes=mp.cpu_count())
# Note that you already define the static cols and ini_cols
# in global scope so you don't need to pass them to the Pool.
# ... Other functions you've defined ...
def rowiteration(row):
c_2: bool = row['tipo'] == u.iloc[:, 0].str[:1]
c_5: bool = row['t_final'] >= u.iloc[:, 1]
c_6: bool = row['t_final'] <= (u.iloc[:, 2])
row['cat'] = u[conjunction(c_2, c_5, c_6)].iloc[0, 0]
row[(0 + (len(ini_cols))+1):(10 + (len(ini_cols))+2)] = timeloop(int(row['t_final']), int(row['count_final']), row['tipo'])
return row
out = []
for row in p.imap_unordered(rowiteration, [r for _, r in pc.iterrows()]):
row.index = cols
out.append(cols)
pc2 = pd.DataFrame(out, ignore_index=True)

While loop incrementer not functioning properly

Right now, my code is correctly spitting out the first game (identified by start_id) in games. I am trying to increment in the bottom two lines, but the while loop doesn't seem to read the fact that I'm incrementing. So the input of this with start_id 800 and end_id 802 is just the information from 800, for some reason.
Am I using the incrementers correctly? Should I be initializing one of i or start_id elsewhere?
games = console(start_id, end_id)
final_output = []
while start_id < (end_id + 1):
single_game = []
i = 0
game_id = games[i][0]
time_entries = games[i][1][2][0]
play_entries = games[i][1][2][1]
score_entries = games[i][1][2][2]
team_entries = games[i][1][2][3]
bovada = games[i][1][0][0][0]
at_capacity = games[i][1][0][1]
idisagree_yetrespect_thatcall = games[i][1][0][2][0]
imsailingaway = games[i][1][1][0][0]
homeiswheretheheartis = games[i][1][1][1][0]
zipper = zip(time_entries, play_entries, score_entries, team_entries)
for play_by_play in zipper:
single_game.append(game_id)
single_game.append(play_by_play)
single_game.append(bovada)
single_game.append(at_capacity)
single_game.append(idisagree_yetrespect_thatcall)
single_game.append(imsailingaway)
single_game.append(homeiswheretheheartis)
start_id += 1
i += 1
final_output.append(single_game)
return final_output
Your problem is that you initialize the increment-er i inside the while loop so every time your loop iterates i is reset to zero.
Try changing it to:
i = 0
while start_id < (end_id + 1):
...

object of type '_Task' has no len() error

I am using the parallel programming module for python I have a function that returns me an array but when I print the variable that contain the value of the function parallelized returns me "pp._Task object at 0x04696510" and not the value of the matrix.
Here is the code:
from __future__ import print_function
import scipy, pylab
from scipy.io.wavfile import read
import sys
import peakpicker as pea
import pp
import fingerprint as fhash
import matplotlib
import numpy as np
import tdft
import subprocess
import time
if __name__ == '__main__':
start=time.time()
#Peak picking dimensions
f_dim1 = 30
t_dim1 = 80
f_dim2 = 10
t_dim2 = 20
percentile = 80
base = 100 # lowest frequency bin used (peaks below are too common/not as useful for identification)
high_peak_threshold = 75
low_peak_threshold = 60
#TDFT parameters
windowsize = 0.008 #set the window size (0.008s = 64 samples)
windowshift = 0.004 #set the window shift (0.004s = 32 samples)
fftsize = 1024 #set the fft size (if srate = 8000, 1024 --> 513 freq. bins separated by 7.797 Hz from 0 to 4000Hz)
#Hash parameters
delay_time = 250 # 250*0.004 = 1 second#200
delta_time = 250*3 # 750*0.004 = 3 seconds#300
delta_freq = 128 # 128*7.797Hz = approx 1000Hz#80
#Time pair parameters
TPdelta_freq = 4
TPdelta_time = 2
#Cargando datos almacenados
database=np.loadtxt('database.dat')
songnames=np.loadtxt('songnames.dat', dtype=str, delimiter='\t')
separator = '.'
print('Please enter an audio sample file to identify: ')
userinput = raw_input('---> ')
subprocess.call(['ffmpeg','-y','-i',userinput, '-ac', '1','-ar', '8k', 'filesample.wav'])
sample = read('filesample.wav')
userinput = userinput.split(separator,1)[0]
print('Analyzing the audio sample: '+str(userinput))
srate = sample[0] #sample rate in samples/second
audio = sample[1] #audio data
spectrogram = tdft.tdft(audio, srate, windowsize, windowshift, fftsize)
mytime = spectrogram.shape[0]
freq = spectrogram.shape[1]
print('The size of the spectrogram is time: '+str(mytime)+' and freq: '+str(freq))
threshold = pea.find_thres(spectrogram, percentile, base)
peaks = pea.peak_pick(spectrogram,f_dim1,t_dim1,f_dim2,t_dim2,threshold,base)
print('The initial number of peaks is:'+str(len(peaks)))
peaks = pea.reduce_peaks(peaks, fftsize, high_peak_threshold, low_peak_threshold)
print('The reduced number of peaks is:'+str(len(peaks)))
#Store information for the spectrogram graph
samplePeaks = peaks
sampleSpectro = spectrogram
hashSample = fhash.hashSamplePeaks(peaks,delay_time,delta_time,delta_freq)
print('The dimensions of the hash matrix of the sample: '+str(hashSample.shape))
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print ("Starting pp with", job_server.get_ncpus(), "workers")
print('Attempting to identify the sample audio clip.')
Here I call the function in fingerprint, the commented line worked, but when I try parallelize don't work:
timepairs = job_server.submit(fhash.findTimePairs, (database, hashSample, TPdelta_freq, TPdelta_time, ))
# timepairs = fhash.findTimePairs(database, hashSample, TPdelta_freq, TPdelta_time)
print (timepairs)
#Compute number of matches by song id to determine a match
numSongs = len(songnames)
songbins= np.zeros(numSongs)
numOffsets = len(timepairs)
offsets = np.zeros(numOffsets)
index = 0
for i in timepairs:
offsets[index]=i[0]-i[1]
index = index+1
songbins[i[2]] += 1
# Identify the song
#orderarray=np.column_stack((songbins,songnames))
#orderarray=orderarray[np.lexsort((songnames,songbins))]
q3=np.percentile(songbins, 75)
q1=np.percentile(songbins, 25)
j=0
for i in songbins:
if i>(q3+(3*(q3-q1))):
print("Result-> "+str(i)+":"+songnames[j])
j+=1
end=time.time()
print('Tiempo: '+str(end-start)+' s')
print("Time elapsed: ", +time.time() - start, "s")
fig3 = pylab.figure(1003)
ax = fig3.add_subplot(111)
ind = np.arange(numSongs)
width = 0.35
rects1 = ax.bar(ind,songbins,width,color='blue',align='center')
ax.set_ylabel('Number of Matches')
ax.set_xticks(ind)
xtickNames = ax.set_xticklabels(songnames)
matplotlib.pyplot.setp(xtickNames)
pylab.title('Song Identification')
fig3.show()
pylab.show()
print('The sample song is: '+str(songnames[np.argmax(songbins)]))
The function in fingerprint that I try to parallelize is:
def findTimePairs(hash_database,sample_hash,deltaTime,deltaFreq):
"Find the matching pairs between sample audio file and the songs in the database"
timePairs = []
for i in sample_hash:
for j in hash_database:
if(i[0] > (j[0]-deltaFreq) and i[0] < (j[0] + deltaFreq)):
if(i[1] > (j[1]-deltaFreq) and i[1] < (j[1] + deltaFreq)):
if(i[2] > (j[2]-deltaTime) and i[2] < (j[2] + deltaTime)):
timePairs.append((j[3],i[3],j[4]))
else:
continue
else:
continue
else:
continue
return timePairs
The complete error is:
Traceback (most recent call last):
File "analisisPrueba.py", line 93, in <module>
numOffsets = len(timepairs)
TypeError: object of type '_Task' has no len()
The submit() method submits a task to the server. What you get back is a reference to the task, not its result. (How could it return its result? submit() returns before any of that work has been done!) You should instead provide a callback function to receive the results. For example, timepairs.append is a function that will take the result and append it to the list timepairs.
timepairs = []
job_server.submit(fhash.findTimePairs, (database, hashSample, TPdelta_freq, TPdelta_time, ), callback=timepairs.append)
(Each findTimePairs call should calculate one result, in case that isn't obvious, and you should submit multiple tasks. Otherwise you're invoking all the machinery of Parallel Python for no reason. And make sure you call job_server.wait() to wait for all the tasks to finish before trying to do anything with your results. In short, read the documentation and some example scripts and make sure you understand how it works.)

Categories