I am running a dask.distributed Client that gets data from an API with several parameters, parses results and joins/aggregates on each result. This is done with client.map()
Sometimes the API call gives an empty string because the specific combination of input parameters doesn't exist. It doesn't make sense to continue with computations and I would like to just kill that worker (without passing on e.g. a None).
How do I tell Dask to kill a worker if its result is None/error and exclude that future from the following operations?
Please let me know if you need more details.
Thanks.
EDIT:
Added a minimal working example to show the logic: the first map produces a lot of "useless" workers that I would like to kill.
Please notice that this is not my actual use case, I am querying an Influx database via http requests but the general structure of the code is the same. I am open to any comments on how to do that faster/more efficiently.
´´´´python
import requests
import numpy as np
import pandas as pd
from dask.distributed import Client, LocalCluster, as_completed
import dask.dataframe as dd
def fetch_html(pair):
req_string = 'https://www.bitstamp.net/api/v2/order_book/{currency_pair}/'
response = requests.get(req_string.format(currency_pair=pair))
try:
result = response.json()
return result
except Exception as e:
print('Error: {}\nMessage: {}'.format(e,response.reason))
return None
def parse_result(result):
if result:
data = {}
data['prices'] = [e[0] for e in result['bids']]
data['vols'] = [e[1] for e in result['bids']]
data['index'] = [result['timestamp'] for i in data['prices']]
df = pd.DataFrame.from_dict(data).set_index('index')
return df
else:
return pd.DataFrame()
def other_calcs(result):
if not result.empty:
# something
return result
else:
return pd.DataFrame()
def aggregator(res1, res2):
if (not res1.empty) and (not res2.empty):
# something
return res1
elif not res2.empty:
# something
return res2
elif not res1.empty:
return res1
else:
return pd.DataFrame()
if __name__=='__main__':
pairs = [
# legit params (100s of these):
'btcusd',
'btceur',
'btcgbp',
'bateur',
'batbtc',
'umausd',
'xrpusdt',
'eurteur',
'eurtusd',
'manausd',
'sandeur',
'storjusd',
'storjeur',
'adausd',
'adaeur',
# bad params resulting in error / empty result (100s of these)
'foobar',
'foobaz',
'foousd',
'barbaz',
'bazbar',
]
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)
futures_list = client.map(fetch_html, pairs)
futures_list = client.map(parse_result, futures_list)
futures_list = client.map(other_calcs, futures_list)
seq = as_completed(futures_list)
while seq.count() > 1:
f1 = next(seq)
f2 = next(seq)
new = client.submit(aggregator, f1, f2, priority=1)
seq.add(new)
final = next(seq)
final = final.result()
print(final.head())
´´´´
Related
I started two days ago with ethereum blockchain, so my knowledge is still a little bit all over the place. Nevertheless, i managed to connect to a node, pull some general block data and so on. As a next level of difficulty, I tried to start building event filters, in order to look at more specific types of historical data (to be clear, I don't want to fetch live data, I would rather like to query through the entire chain, and get historical sample extracts for various types of data).
See here my first attempt to build an event filter for the USDC Uniswap V2 contract, in order to collect Swap events (its not about speed or efficiency right now, just to make it work):
w3 = Web3(Web3.HTTPProvider(NODE_ADDRESS))
# uniswap v2 USDC
address = w3.toChecksumAddress('0xb4e16d0168e52d35cacd2c6185b44281ec28c9dc')
# get the ABI for uniswap v2 pair events
resp = requests.get("https://unpkg.com/#uniswap/v2-core#1.0.0/build/IUniswapV2Pair.json")
if resp.status_code==200:
abi = json.loads(resp.content)['abi']
# create contract object
contract = w3.eth.contract(address=address, abi=abi)
# get topics by hashing abi event signatures
res = contract.events.Swap.build_filter()
# put this into a filter input dictionary
filter_params = {'fromBlock':int_to_hex(12000000),'toBlock':int_to_hex(12010000),**res.filter_params}
# res.filter_params contains: 'topics' and 'address'
# create a filter id (i.e. a hashed version of the filter data, representing the filter)
method = 'eth_newFilter'
params = [filter_params]
resp = self.block_manager.general_sample_request(method,params)
if 'error' in resp:
print(resp)
else:
filter_id = resp['result']
# pass on the filter id, in order to query the respective logs
params = [filter_id]
method = 'eth_getFilterLogs'
resp = self.block_manager.general_sample_request(method,params)
# takes about 10-12s for about 12000 events
the resulting array contains event logs of this structure:
resp['result'][0]
>>>
{'address': '0xb4e16d0168e52d35cacd2c6185b44281ec28c9dc',
'topics': ['0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
'0x0000000000000000000000007a250d5630b4cf539739df2c5dacb4c659f2488d',
'0x0000000000000000000000000ffd670749d4179558b6b367e30e72ce2efea28f'],
'data': '0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000034f0f8a0c7663264000000000000000000000000000000000000000000000\
000000000019002d5b60000000000000000000000000000000000000000000000000000000000000000',
'blockNumber': '0xb71b01',
'transactionHash': '0x76403053ee0300411b68fc223b327b51fb4f1a26e1f6cb8667e05ec370e8176e',
'transactionIndex': '0x22',
'blockHash': '0x4bd35cb48395e77fd317a0309342c95d6687dbc4fcb85ada2d635fe266d1e769',
'logIndex': '0x16',
'removed': False}
As far as I understand now, I can somehow apply the ABI to decode the 'data' field.
I tried with this function:
contract.decode_function_input(resp['result'][0]['data'])
but it gives me this error:
>>> ValueError: Could not find any function with matching selector
Seems like there is some problem with decoding the data. However, I am so close now to getting the real data, I dont wanna give up xD. Any help will be appreciated!
Thanks!
import json
import traceback
from pprint import pprint
from eth_utils import event_abi_to_log_topic, to_hex
from hexbytes import HexBytes
from web3._utils.events import get_event_data
from web3.auto import w3
def decode_tuple(t, target_field):
output = dict()
for i in range(len(t)):
if isinstance(t[i], (bytes, bytearray)):
output[target_field[i]['name']] = to_hex(t[i])
elif isinstance(t[i], (tuple)):
output[target_field[i]['name']] = decode_tuple(t[i], target_field[i]['components'])
else:
output[target_field[i]['name']] = t[i]
return output
def decode_list_tuple(l, target_field):
output = l
for i in range(len(l)):
output[i] = decode_tuple(l[i], target_field)
return output
def decode_list(l):
output = l
for i in range(len(l)):
if isinstance(l[i], (bytes, bytearray)):
output[i] = to_hex(l[i])
else:
output[i] = l[i]
return output
def convert_to_hex(arg, target_schema):
"""
utility function to convert byte codes into human readable and json serializable data structures
"""
output = dict()
for k in arg:
if isinstance(arg[k], (bytes, bytearray)):
output[k] = to_hex(arg[k])
elif isinstance(arg[k], (list)) and len(arg[k]) > 0:
target = [a for a in target_schema if 'name' in a and a['name'] == k][0]
if target['type'] == 'tuple[]':
target_field = target['components']
output[k] = decode_list_tuple(arg[k], target_field)
else:
output[k] = decode_list(arg[k])
elif isinstance(arg[k], (tuple)):
target_field = [a['components'] for a in target_schema if 'name' in a and a['name'] == k][0]
output[k] = decode_tuple(arg[k], target_field)
else:
output[k] = arg[k]
return output
def _get_topic2abi(abi):
if isinstance(abi, (str)):
abi = json.loads(abi)
event_abi = [a for a in abi if a['type'] == 'event']
topic2abi = {event_abi_to_log_topic(_): _ for _ in event_abi}
return topic2abi
def _sanitize_log(log):
for i, topic in enumerate(log['topics']):
if not isinstance(topic, HexBytes):
log['topics'][i] = HexBytes(topic)
if 'address' not in log:
log['address'] = None
if 'blockHash' not in log:
log['blockHash'] = None
if 'blockNumber' not in log:
log['blockNumber'] = None
if 'logIndex' not in log:
log['logIndex'] = None
if 'transactionHash' not in log:
log['transactionHash'] = None
if 'transactionIndex' not in log:
log['transactionIndex'] = None
def decode_log(log, abi):
if abi is not None:
try:
# get a dict with all available events from the ABI
topic2abi = _get_topic2abi(abi)
# ensure the log contains all necessary keys
_sanitize_log(log)
# get the ABI of the event in question (stored as the first topic)
event_abi = topic2abi[log['topics'][0]]
# get the event name
evt_name = event_abi['name']
# get the event data
data = get_event_data(w3.codec, event_abi, log)['args']
target_schema = event_abi['inputs']
decoded_data = convert_to_hex(data, target_schema)
return (evt_name, decoded_data, target_schema)
except Exception:
return ('decode error', traceback.format_exc(), None)
else:
return ('no matching abi', None, None)
Example usage:
output = decode_log(
{'data': '0x000000000000000000000000000000000000000000000000000000009502f90000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000093f8f932b016b1c',
'topics': [
'0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822',
'0x0000000000000000000000007a250d5630b4cf539739df2c5dacb4c659f2488d',
'0x000000000000000000000000242301fa62f0de9e3842a5fb4c0cdca67e3a2fab'],
},
pair_abi
)
print(output[0])
pprint(output[1])
# Swap
# {'amount0In': 2500000000,
# 'amount0Out': 0,
# 'amount1In': 0,
# 'amount1Out': 666409132118600476,
# 'sender': '0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D',
# 'to': '0x242301FA62f0De9e3842A5Fb4c0CdCa67e3A2Fab'}
Or in your case:
output = decode_log(resp['result'][0], pair_abi)
print(output[0])
pprint(output[1])
# Swap
# {'amount0In': 0,
# 'amount0Out': 6711072182,
# 'amount1In': 3814822253806629476,
# 'amount1Out': 0,
# 'sender': '0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D',
# 'to': '0x0Ffd670749D4179558b6B367E30e72ce2efea28F'}
Now, note that you need to provide the pair_abi variable. It depends on the type of smart contract that you're using. I've found that when on Uniswap V3, the UniswapV2Pair ABI worked for some events, while UniswapV3Pool ABI worked for others, in particular for the Swap event that I've found the most useful.
After a few hours of digging I managed to find this solution, which is a slightly modified version of the one proposed in: https://towardsdatascience.com/decoding-ethereum-smart-contract-data-eed513a65f76 Big thumbs up to its author 👍 You can read more there on parsing the transaction input too.
I have a code that results in multiple string objects and I want to convert them into an array. The end result looks like this
Queue1
Queue2
Queue3
but, I need it like this
[Queue1, Queue2, Queue3]
P.S. I am new to programming
import boto3
import numpy
rg = boto3.client('resource-groups')
cloudwatch = boto3.client('cloudwatch')
#def queuenames(rg):
response = rg.list_group_resources(
Group='env_prod'
)
resources = response.get('Resources')
for idents in resources:
identifier = idents.get('Identifier')
resourcetype = identifier.get('ResourceType')
if resourcetype == 'AWS::SQS::Queue':
RArn = identifier.get('ResourceArn')
step0 = RArn.split(':')
step1 = step0[5]
print(step1)
To convert a string to a list do this:
arr = 'Queue1 Queue2 Queue3'.split(' ')
# Result:
['Queue1', 'Queue2', 'Queue3']
You have a cycle where upon each step you print a string. Try creating an array before the cycle and adding each string inside the cycle, like this (I'm not fluent in Python, please excuse me if there is something wrong in the syntax)
import boto3
import numpy
rg = boto3.client('resource-groups')
cloudwatch = boto3.client('cloudwatch')
#def queuenames(rg):
response = rg.list_group_resources(
Group='env_prod'
)
resources = response.get('Resources')
myArray = []
for idents in resources:
identifier = idents.get('Identifier')
resourcetype = identifier.get('ResourceType')
if resourcetype == 'AWS::SQS::Queue':
RArn = identifier.get('ResourceArn')
step0 = RArn.split(':')
step1 = step0[5]
print(step1)
myArray.append(step1)
The code above will not change the way your output is displayed, but builds the array you need. You can remove the print line and print the array after the cycle instead.
There is a complex system of calculations over some objects.
The difficulty is that some calculations are group calculations.
This can demonstrate by the following example:
from dask distributed import client
def load_data_from_db(id):
# load some data
...
return data
def task_a(data):
# some calculations
...
return result
def group_task(*args):
# some calculations
...
return result
def task_b(data, group_data):
# some calculations
...
return result
def task_c(data, task_a_result)
# some calculations
...
return result
ids = [1, 2]
dsk = {'id_{}'.format(i): id for i, id in enumerate(ids)}
dsk['data_0'] = (load_data_from_db, 'id_0')
dsk['data_1'] = (load_data_from_db, 'id_1')
dsk['task_a_result_0'] = (task_a, 'data_0')
dsk['task_a_result_1'] = (task_a, 'data_1')
dsk['group_result'] = (
group_task,
'data_0', 'task_a_result_0',
'data_1', 'task_a_result_1')
dsk['task_b_result_0'] = (task_b, 'data_0', 'group_result')
dsk['task_b_result_1'] = (task_b, 'data_1', 'group_result')
dsk['task_c_result_0'] = (task_c, 'data_0', 'task_a_result_0')
dsk['task_c_result_1'] = (task_c, 'data_1', 'task_a_result_1')
client = Client(scheduler_address)
result = client.get(
dsk,
['task_a_result_0',
'task_b_result_0',
'task_c_result_0',
'task_a_result_1',
'task_b_result_1',
'task_c_result_1'])
The list of objects is counted is thousands elements, and the number of tasks is dozens (including several group tasks).
With such method of graph creation it is difficult to modify the graph (add new tasks, change dependencies, etc.).
Is there a more efficient way of distributed computing using dask for these context?
Added
With futures graph is:
client = Client(scheduler_address)
ids = [1, 2]
data = client.map(load_data_from_db, ids)
result_a = client.map(task_a, data)
group_args = list(chain(*zip(data, result_a)))
result_group = client.submit(task_group, *group_args)
result_b = client.map(task_b, data, [result_group] * len(ids))
result_c = client.map(task_c, data, result_a)
result = client.gather(result_a + result_b + result_c)
And in task functions input arguments is Future instance then arg.result() before use.
If you want to modify the computation during computation then I recommend the futures interface.
Suppose I am given a collection of documents. I am required to tokenize them, and then turn them into vectors for further work. As I find elasticsearch's tokenizer works much better than my own solution, I am switching to that. However, it is considerably slower. Then the end result is expected to be fed into the vectorizer in a stream.
The whole process can be done with a chained list of generators
def fetch_documents(_cursor):
with _cursor:
# a lot of documents expected, may not fit in memory
_cursor.execute('select ... from ...')
for doc in _cursor:
yield doc
def tokenize(documents):
for doc in documents:
yield elasticsearch_tokenize_me(doc)
def build_model(documents):
some_model = SomeModel()
for doc in documents:
some_model.add_document(doc)
return some_model
build_model(tokenize(fetch_documents))
So this basically works fine, but doesn't utilize all the available processing capability. As dask is used in other related projects, I try to adapt and get this (I am using psycopg2 for database access).
from dask import delayed
import psycopg2
import psycopg2.extras
from elasticsearch import Elasticsearch
from elasticsearch.client import IndicesClient
def loader():
conn = psycopg2.connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute('''
SELECT document, ... FROM ...
''')
return cur
#delayed
def tokenize(partition):
result = []
client = IndicesClient(Elasticsearch())
for row in partition:
_result = client.analyze(analyzer='standard', text=row['document'])
result.append(dict(row,
tokens=tuple(item['token'] for item in _result['tokens'])))
return result
#delayed
def build_model(sequence_of_data):
some_model = SomeModel()
for item in chain.from_iterable(sequence_of_data):
some_model.add_document(item)
return some_model
with loader() as cur:
partitions = []
for idx_start in range(0, cur.rowcount, 200):
partitions.append(delayed(cur.fetchmany)(200))
tokenized = []
for partition in partitions:
tokenized.append(tokenize(partition))
result = do_something(tokenized)
result.compute()
Code more or less work, except at the end all documents are tokenized, before being fed into the model. While this works for smaller collection of data, however not for a huge collection of data (due to huge memory consumption). Should I just use plain concurrent.futures for this work or am I using dask wrongly?
A simple solution would be to load data locally on your machine (it's hard to partition a single SQL query) and then send the data to the dask-cluster for the expensive tokenization step. Perhaps something as follows:
rows = cur.execute(''' SELECT document, ... FROM ... ''')
from toolz import partition_all, concat
partitions = partition_all(10000, rows)
from dask.distributed import Executor
e = Executor('scheduler-address:8786')
futures = []
for part in partitions:
x = e.submit(tokenize, part)
y = e.submit(process, x)
futures.append(y)
results = e.gather(futures)
result = list(concat(results))
In this example the functions tokenize and process expect to consume and return a list of elements.
Using just concurrent.futures for the work
from concurrent.futures import ProcessPoolExecutor
def loader():
conn = psycopg2.connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute('''
SELECT document, ... FROM ...
''')
return cur
def tokenize(partition):
result = []
client = IndicesClient(Elasticsearch())
for row in partition:
_result = client.analyze(analyzer='standard', text=row['document'])
result.append(dict(row,
tokens=tuple(item['token'] for item in _result['tokens'])))
return result
def do_something(partitions, total):
some_model = 0
for partition in partitions:
result = partition.result()
for item in result:
some_model.add_document(item)
return some_model
with loader() as cur, \
ProcessPoolExecutor(max_workers=8) as executor:
print(cur.rowcount)
partitions = []
for idx_start in range(0, cur.rowcount, 200):
partitions.append(executor.submit(tokenize,
cur.fetchmany(200)))
build_model(partitions)
I have the following program to scrap data from a website. I want to improve the below code by using a generator with a yield instead of calling generate_url and call_me multiple times sequentially. The purpose of this exersise is to properly understand yield and the context in which it can be used.
import requests
import shutil
start_date='03-03-1997'
end_date='10-04-2015'
yf_base_url ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
index_list = ['BSESN','NSEI']
def generate_url(index, start_date, end_date):
s_day = start_date.split('-')[0]
s_month = start_date.split('-')[1]
s_year = start_date.split('-')[2]
e_day = end_date.split('-')[0]
e_month = end_date.split('-')[1]
e_year = end_date.split('-')[2]
if (index == 'BSESN') or (index == 'NSEI'):
url = yf_base_url + index + '&a={}&b={}&c={}&d={}&e={}&f={}'.format(s_day,s_month,s_year,e_day,e_month,e_year)
return url
def callme(url,index):
print('URL {}'.format(url))
r = requests.get(url, verify=False,stream=True)
if r.status_code!=200:
print "Failure!!"
exit()
else:
r.raw.decode_content = True
with open(index + "file.csv", 'wb') as f:
shutil.copyfileobj(r.raw, f)
print "Success"
if __name__ == '__main__':
url = generate_url(index_list[0],start_date,end_date)
callme(url,index_list[0])
url = generate_url(index_list[1],start_date,end_date)
callme(url,index_list[1])
There are multiple options. You could use yield to iterate over URL's. Or over request objects.
If your index_list were long, I would suggest yielding URLs.
Because then you could use multiprocessing.Pool to map a function that does a request and saves the output over these URLs. That would execute them in parallel, potentially making it a lot faster (assuming that you have enough network bandwidth, and that yahoo finance doesn't throttle connections).
yf ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
'{}&a={}&b={}&c={}&d={}&e={}&f={}'
index_list = ['BSESN','NSEI']
def genurl(symbols, start_date, end_date):
# assemble the URLs
s_day, s_month, s_year = start_date.split('-')
e_day, e_month, e_year = end_date.split('-')
for s in symbols:
url = yf.format(s, s_day,s_month,s_year,e_day,e_month,e_year)
yield url
def download(url):
# Do the request, save the file
p = multiprocessing.Pool()
rv = p.map(download, genurl(index_list, '03-03-1997', '10-04-2015'))
If I understand you correctly, what you want to know is how to change the code so that you can replace the last part by
if __name__ == '__main__':
for url in generate_url(index_list,start_date,end_date):
callme(url,index)
If this is correct, you need to change generate_url, but not callme. Changing generate_url is rather mechanical. Make the first parameter index_list instead of index, wrap the function body in a for index in index_list loop, and change return url to yield url.
You don't need to change callme because you never want to say something like for call in callme(...). You won't do anything with it but a normal function call.