ThreadPoolExecutor multiprocessing with while loop and breadth first search? - python

I'm trying to speed up some API calls by using ThreadPoolExecutor. I have a class that accepts a string list of h3 cells like cell1,cell2. h3 uses hexagons at different resolutions to get finer detail in mapping. The class methods take the returned cells and gets information about them that is passed to an API with params. The API will return a total number of results (could be over 1000). Because the API is limited to returning at most the first 1000 results through pagination, I utilize h3 to zoom into each cell until all of its children/grandchildren/etc have a total number of results under 1000. This is effectively doing BFS from the original cells provided.
When running this code with the run method, the expectation is that the search_queue would be empty as all cells have been processed. However, with the way its set up currently, only the origin_cells provided to the class get processed and retrieving search_queue shows unprocessed items. Swapping the while and ThreadPoolExecutor lines does run everything as expected, but it runs at the same speed as without using ThreadPoolExecutor.
Is there a way to make the multiprocessing work as expected?
Edit with working example
import h3
import math
import requests
from concurrent.futures import ThreadPoolExecutor
from time import sleep
dummy_results = {
'85489e37fffffff': {'total': 1001},
'85489e27fffffff': {'total': 999},
'86489e347ffffff': {'total': 143},
'86489e34fffffff': {'total': 143},
'86489e357ffffff': {'total': 143},
'86489e35fffffff': {'total': 143},
'86489e367ffffff': {'total': 143},
'86489e36fffffff': {'total': 143},
'86489e377ffffff': {'total': 143},
}
class SearchH3Test(object):
def __init__(self, origin_cells):
self.search_queue = list(filter(None, origin_cells.split(',')))
self.params_list = []
def get_h3_radius(self, cell, buffer=False):
"""
Get the approximate radius of the h3 cell
"""
return math.ceil(
math.sqrt(
(h3.cell_area(cell))/(1.5*math.sqrt(3))
)*1000
+ ((100*(h3.h3_get_resolution(cell)/10)) if buffer else 0)
)
def get_items(self, cell):
"""
Return API items from passed params, including total number of items and a dict of items
r = requests.get(
url = 'https://someapi.com',
headers = api_headers,
params = params
).json()
"""
sleep(1)
r = dummy_results[cell]
return r['total']
def get_hex_params(self, cell):
"""
Return results from the derived params of the h3 cell
"""
lat, long = h3.h3_to_geo(cell)
radius = self.get_h3_radius(cell, buffer=True)
params = {
'latitude': lat,
'longitude': long,
'radius': radius,
}
total = self.get_items(cell)
print(total)
return total, params
def hex_search(self):
"""
Checks if the popped h3 cell produces a total value over 1000.
If over 1000, get the h3 cell children and append them to the search_queue
If greater than 0, append params to params_list
"""
cell = self.search_queue.pop(0)
total, params = self.get_hex_params(cell)
if total > 1000:
self.search_queue.extend(list(h3.h3_to_children(cell)))
elif total > 0:
self.params_list.append(params)
def get_params_list(self):
"""
Keep looping through the search quque until no items remain.
Use multiprocessing to speed up things
"""
with ThreadPoolExecutor() as e:
while self.search_queue:
e.submit(self.hex_search)
def run(self):
self.get_params_list()
h = SearchH3Test(
'85489e37fffffff,85489e27fffffff',
)
h.run()
len(h.search_queue) # returns 7 for the children that weren't processed as expected
len(h.params_list) # returns 1 for the cell under 1000

Related

Apply function not looping through list in dataframe correctly

I have a dataframe that looks something like the below
Client
Nodes
Client A
[32321,32312,2133432,43242,...]
Client B
[575945,545345,54353,5345,...]
I'm trying to use the apply function to loop through each item in each list for a client and run the function on each number, so first use 32321 then 32312 for client A, then take the results of both of those and put them in a list and return that in the next column.
Right now my below function is taking the first item from each rows list and applying that, so each row gets the same result every time.
def FindNodeLL(route_nodes):
for node in route_nodes:
try:
response_xml = requests.get(f'https://api.openstreetmap.org/api/0.6/node/{node}')
response_xml_as_string = response_xml.content
responseXml = ET.fromstring(response_xml_as_string)
for child in responseXml.iter('node'):
RouteNodeLL.append((float(child.attrib['lat']), float(child.attrib['lon'])))
return RouteNodeLL
except:
pass
df[f'Route Nodes LL'] = df.apply(lambda row: FindNodeLL(row['Route Nodes']), axis = 1)
You just need to return after your for loop and instantiate your list within the function:
import pandas as pd
import requests
import xml.etree.ElementTree as ET
data = {
"client": ["client a", "client b"],
"nodes": [[1, 2, 10], [11, 12, 13]],
}
df = pd.DataFrame(data)
def FindNodeLL(route_nodes):
RouteNodeLL = []
for node in route_nodes:
try:
response_xml = requests.get(
f"https://api.openstreetmap.org/api/0.6/node/{node}"
)
response_xml_as_string = response_xml.content
responseXml = ET.fromstring(response_xml_as_string)
for child in responseXml.iter("node"):
RouteNodeLL.append(
(float(child.attrib["lat"]), float(child.attrib["lon"]))
)
except:
pass
return RouteNodeLL
df[f"Route Nodes LL"] = df["nodes"].apply(FindNodeLL)

Is creating multiple instances of a class dynamically good practice? Alternatives?

Background
I'm working with API calls that return network element data (telecommunications industry fyi).
I'm dynamically creating 'x' number of class instances based on a 'meta' response from the api, that tells me how many of such elements I'm querying exist in the network. This is not something I can know beforehand (may be 1 element, may be 10, 100 etc).
Now each element has several possible attributes (name, lat/long, ip_address, MAC, type.... etc) hence why I thought a class structure would be best. This is my first time using Classes.
Question
Is this best practice? What are the alternatives to this approach?
Code
# Update the Node class's 'num_nodes' class level variable
Node.get_number_of_nodes()
# Create 'x' amount of Node instances and store in a list, based on 'num_nodes'
nodes = [Node() for x in range(Node.num_nodes)]
# Grab unique node ids and assign to the instances of Node
[node.get_node_ids() for node in nodes]
class Node:
"""
A class to represent a network node.
"""
# Shared class variables
num_nodes = 0
# Shared class instances
session = Session()
#classmethod
def get_number_of_nodes(cls):
"""
Api call to ### which fetches number of nodes metadata
"""
path = "####"
url = f"{cls.session.HOST}{path}"
headers = {'Authorization': #####,}
params = {'limit': 10000,}
resp = request(
"GET", url, headers=headers, params=params, verify=False
)
resp = resp.text.encode('utf8')
resp = json.loads(resp)
cls.num_nodes = resp['meta']['total']
def get_node_ids(self):
"""
For an instance of Node, get its attributes from ###.
"""
pass

Simplest way complex dask graph creation

There is a complex system of calculations over some objects.
The difficulty is that some calculations are group calculations.
This can demonstrate by the following example:
from dask distributed import client
def load_data_from_db(id):
# load some data
...
return data
def task_a(data):
# some calculations
...
return result
def group_task(*args):
# some calculations
...
return result
def task_b(data, group_data):
# some calculations
...
return result
def task_c(data, task_a_result)
# some calculations
...
return result
ids = [1, 2]
dsk = {'id_{}'.format(i): id for i, id in enumerate(ids)}
dsk['data_0'] = (load_data_from_db, 'id_0')
dsk['data_1'] = (load_data_from_db, 'id_1')
dsk['task_a_result_0'] = (task_a, 'data_0')
dsk['task_a_result_1'] = (task_a, 'data_1')
dsk['group_result'] = (
group_task,
'data_0', 'task_a_result_0',
'data_1', 'task_a_result_1')
dsk['task_b_result_0'] = (task_b, 'data_0', 'group_result')
dsk['task_b_result_1'] = (task_b, 'data_1', 'group_result')
dsk['task_c_result_0'] = (task_c, 'data_0', 'task_a_result_0')
dsk['task_c_result_1'] = (task_c, 'data_1', 'task_a_result_1')
client = Client(scheduler_address)
result = client.get(
dsk,
['task_a_result_0',
'task_b_result_0',
'task_c_result_0',
'task_a_result_1',
'task_b_result_1',
'task_c_result_1'])
The list of objects is counted is thousands elements, and the number of tasks is dozens (including several group tasks).
With such method of graph creation it is difficult to modify the graph (add new tasks, change dependencies, etc.).
Is there a more efficient way of distributed computing using dask for these context?
Added
With futures graph is:
client = Client(scheduler_address)
ids = [1, 2]
data = client.map(load_data_from_db, ids)
result_a = client.map(task_a, data)
group_args = list(chain(*zip(data, result_a)))
result_group = client.submit(task_group, *group_args)
result_b = client.map(task_b, data, [result_group] * len(ids))
result_c = client.map(task_c, data, result_a)
result = client.gather(result_a + result_b + result_c)
And in task functions input arguments is Future instance then arg.result() before use.
If you want to modify the computation during computation then I recommend the futures interface.

How do I join results of looping script into a single variable?

I have looping script returning different filtered results, I can make this data return as an array for each of the different filter classes. However I am unsure of the best method to join all of these arrays together.
import mechanize
import urllib
import json
import re
import random
import datetime
from sched import scheduler
from time import time, sleep
from sets import Set
##### Code to loop the script and set up scheduling time
s = scheduler(time, sleep)
random.seed()
##### Code to stop duplicates part 1
userset = set ()
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval + random.randrange(-5, 10)
s.run()
##### Code to get the data required from the URL desired
def getData():
post_url = "URL OF INTEREST"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
##### These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '250',
'sortname' : 'race_time',
'sortorder' : 'asc'
}
##### Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern4 = re.compile('title=\'posted: (.*) strikes:')
pattern5 = re.compile('strikes: (.*)\'><img src=')
for row in xmlload1['rows']:
cell = row["cell"]
##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex
user_delimiter = cell['username']
selection_delimiter = cell['race_horse']
user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
user_numberofstrikes = float(re.findall(pattern5, user_delimiter)[0])
strikeratecalc1 = user_numberofstrikes/user_numberofselections
strikeratecalc2 = strikeratecalc1*100
userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0])
##### Code to stop duplicates throughout the day part 2 (skips if the id is already in the userset)
if userid_delimiter_results in userset: continue;
userset.add(userid_delimiter_results)
arraym = ""
arrayna = ""
if strikeratecalc2 > 50 and strikeratecalc2 < 100):
arraym0 = "System M"
arraym1 = "user id = ",userid_delimiter_results
arraym2 = "percantage = ",strikeratecalc2,"%"
arraym3 = ""
arraym = [arraym0, arraym1, arraym2, arraym3]
if strikeratecalc2 > 0 and strikeratecalc2 < 50):
arrayna0 = "System NA"
arrayna1 = "user id = ",userid_delimiter_results
arrayna2 = "percantage = ",strikeratecalc2,"%"
arrayna3 = ""
arrayna = [arrayna0, arrayna1, arrayna2, arrayna3]
getData()
run_periodically(time()+5, time()+1000000, 10, getData)
What I want to be able to do, is return both the 'arraym' and the 'arrayna' as one final Array, however due to the looping nature of the script upon each loop of the script the old 'arraym'/'arrayna' are overwritten, currently my attempts to yield one array containing all of the data has resulted in the last userid for 'systemm' and the last userid for 'sustemna'. This is obviously because, upon each run of the loop it overwrites the old 'arraym' and the 'arrayna' however I do not know of a way to get around this, so that all of my data can be accumulated in one array. Please note, I have been coding for cumulatively two weeks now, so there may well be some simple function to overcome this problem.
Kind regards AEA
Without looking at that huge code segment, typically you can do something like:
my_array = [] # Create an empty list
for <some loop>:
my_array.append(some_value)
# At this point, my_array is a list containing some_value for each loop iteration
print(my_array)
Look into python's list.append()
So your code might look something like:
#...
arraym = []
arrayna = []
for row in xmlload1['rows']:
#...
if strikeratecalc2 > 50 and strikeratecalc2 < 100):
arraym.append("System M")
arraym.append("user id = %s" % userid_delimiter_results)
arraym.append("percantage = %s%%" % strikeratecalc2)
arraym.append("")
if strikeratecalc2 > 0 and strikeratecalc2 < 50):
arrayna.append("System NA")
arrayna.append("user id = %s" % userid_delimiter_results)
arrayna.append("percantage = %s%%" % strikeratecalc2)
arrayna.append("")
#...

Use of SET to ignore pre logged users in a looping script

I am trying to use a set in order to stop users being re printed in the following code. I managed to get python to accept he code without producing any bugs, but if I let the code run on a 10 second loop it continues to print the users who should have already been logged. This is my first attempt at using a set, and I am a complete novice at python (building it all so far based on examples I have seen and reverse engineering them.)
Below is an example of the code I am using
import mechanize
import urllib
import json
import re
import random
import datetime
from sched import scheduler
from time import time, sleep
######Code to loop the script and set up scheduling time
s = scheduler(time, sleep)
random.seed()
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval + random.randrange(-5, 45)
s.run()
###### Code to get the data required from the URL desired
def getData():
post_url = "URL OF INTEREST"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '250',
'sortname' : 'roi',
'sortorder' : 'desc'
}
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern1 = re.compile('> (.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
##### Making the code identify each row, removing the need to numerically quantify the number of rows in the xmlfile,
##### thus making number of rows dynamic (change as the list grows, required for looping function to work un interupted)
for row in xmlload1['rows']:
cell = row["cell"]
##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex
user_delimiter = cell['username']
selection_delimiter = cell['race_horse']
if strikeratecalc2 < 12 : continue;
##### REMAINDER OF THE REGEX DELMITATIONS
username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0])
user_selection = re.findall(pattern3, selection_delimiter)[0]
##### Code to stop duplicate posts of each user throughout the day
userset = set ([])
if userid_delimiter_results in userset: continue;
##### Printing the results of the code at hand
print "user id = ",userid_delimiter_results
print "username = ",username_delimiter_results
print "user selection = ",user_selection
print ""
##### Code to stop duplicate posts of each user throughout the day part 2 (udating set to add users already printed to the ignore list)
userset.update(userid_delimiter_results)
getData()
run_periodically(time()+5, time()+1000000, 300, getData)
Any comments will be greatly appreciated, this may seem common sense to you seasoned coders, but I really am just getting past "Hello world"
Kind regards AEA
This:
userset.update(userid_delimiter_results)
Should probably be this:
userset.add(userid_delimiter_results)
To prove it, try printing the contents of userset after each call.

Categories