Suppose I am given a collection of documents. I am required to tokenize them, and then turn them into vectors for further work. As I find elasticsearch's tokenizer works much better than my own solution, I am switching to that. However, it is considerably slower. Then the end result is expected to be fed into the vectorizer in a stream.
The whole process can be done with a chained list of generators
def fetch_documents(_cursor):
with _cursor:
# a lot of documents expected, may not fit in memory
_cursor.execute('select ... from ...')
for doc in _cursor:
yield doc
def tokenize(documents):
for doc in documents:
yield elasticsearch_tokenize_me(doc)
def build_model(documents):
some_model = SomeModel()
for doc in documents:
some_model.add_document(doc)
return some_model
build_model(tokenize(fetch_documents))
So this basically works fine, but doesn't utilize all the available processing capability. As dask is used in other related projects, I try to adapt and get this (I am using psycopg2 for database access).
from dask import delayed
import psycopg2
import psycopg2.extras
from elasticsearch import Elasticsearch
from elasticsearch.client import IndicesClient
def loader():
conn = psycopg2.connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute('''
SELECT document, ... FROM ...
''')
return cur
#delayed
def tokenize(partition):
result = []
client = IndicesClient(Elasticsearch())
for row in partition:
_result = client.analyze(analyzer='standard', text=row['document'])
result.append(dict(row,
tokens=tuple(item['token'] for item in _result['tokens'])))
return result
#delayed
def build_model(sequence_of_data):
some_model = SomeModel()
for item in chain.from_iterable(sequence_of_data):
some_model.add_document(item)
return some_model
with loader() as cur:
partitions = []
for idx_start in range(0, cur.rowcount, 200):
partitions.append(delayed(cur.fetchmany)(200))
tokenized = []
for partition in partitions:
tokenized.append(tokenize(partition))
result = do_something(tokenized)
result.compute()
Code more or less work, except at the end all documents are tokenized, before being fed into the model. While this works for smaller collection of data, however not for a huge collection of data (due to huge memory consumption). Should I just use plain concurrent.futures for this work or am I using dask wrongly?
A simple solution would be to load data locally on your machine (it's hard to partition a single SQL query) and then send the data to the dask-cluster for the expensive tokenization step. Perhaps something as follows:
rows = cur.execute(''' SELECT document, ... FROM ... ''')
from toolz import partition_all, concat
partitions = partition_all(10000, rows)
from dask.distributed import Executor
e = Executor('scheduler-address:8786')
futures = []
for part in partitions:
x = e.submit(tokenize, part)
y = e.submit(process, x)
futures.append(y)
results = e.gather(futures)
result = list(concat(results))
In this example the functions tokenize and process expect to consume and return a list of elements.
Using just concurrent.futures for the work
from concurrent.futures import ProcessPoolExecutor
def loader():
conn = psycopg2.connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cur.execute('''
SELECT document, ... FROM ...
''')
return cur
def tokenize(partition):
result = []
client = IndicesClient(Elasticsearch())
for row in partition:
_result = client.analyze(analyzer='standard', text=row['document'])
result.append(dict(row,
tokens=tuple(item['token'] for item in _result['tokens'])))
return result
def do_something(partitions, total):
some_model = 0
for partition in partitions:
result = partition.result()
for item in result:
some_model.add_document(item)
return some_model
with loader() as cur, \
ProcessPoolExecutor(max_workers=8) as executor:
print(cur.rowcount)
partitions = []
for idx_start in range(0, cur.rowcount, 200):
partitions.append(executor.submit(tokenize,
cur.fetchmany(200)))
build_model(partitions)
Related
I am running a dask.distributed Client that gets data from an API with several parameters, parses results and joins/aggregates on each result. This is done with client.map()
Sometimes the API call gives an empty string because the specific combination of input parameters doesn't exist. It doesn't make sense to continue with computations and I would like to just kill that worker (without passing on e.g. a None).
How do I tell Dask to kill a worker if its result is None/error and exclude that future from the following operations?
Please let me know if you need more details.
Thanks.
EDIT:
Added a minimal working example to show the logic: the first map produces a lot of "useless" workers that I would like to kill.
Please notice that this is not my actual use case, I am querying an Influx database via http requests but the general structure of the code is the same. I am open to any comments on how to do that faster/more efficiently.
´´´´python
import requests
import numpy as np
import pandas as pd
from dask.distributed import Client, LocalCluster, as_completed
import dask.dataframe as dd
def fetch_html(pair):
req_string = 'https://www.bitstamp.net/api/v2/order_book/{currency_pair}/'
response = requests.get(req_string.format(currency_pair=pair))
try:
result = response.json()
return result
except Exception as e:
print('Error: {}\nMessage: {}'.format(e,response.reason))
return None
def parse_result(result):
if result:
data = {}
data['prices'] = [e[0] for e in result['bids']]
data['vols'] = [e[1] for e in result['bids']]
data['index'] = [result['timestamp'] for i in data['prices']]
df = pd.DataFrame.from_dict(data).set_index('index')
return df
else:
return pd.DataFrame()
def other_calcs(result):
if not result.empty:
# something
return result
else:
return pd.DataFrame()
def aggregator(res1, res2):
if (not res1.empty) and (not res2.empty):
# something
return res1
elif not res2.empty:
# something
return res2
elif not res1.empty:
return res1
else:
return pd.DataFrame()
if __name__=='__main__':
pairs = [
# legit params (100s of these):
'btcusd',
'btceur',
'btcgbp',
'bateur',
'batbtc',
'umausd',
'xrpusdt',
'eurteur',
'eurtusd',
'manausd',
'sandeur',
'storjusd',
'storjeur',
'adausd',
'adaeur',
# bad params resulting in error / empty result (100s of these)
'foobar',
'foobaz',
'foousd',
'barbaz',
'bazbar',
]
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)
futures_list = client.map(fetch_html, pairs)
futures_list = client.map(parse_result, futures_list)
futures_list = client.map(other_calcs, futures_list)
seq = as_completed(futures_list)
while seq.count() > 1:
f1 = next(seq)
f2 = next(seq)
new = client.submit(aggregator, f1, f2, priority=1)
seq.add(new)
final = next(seq)
final = final.result()
print(final.head())
´´´´
I'm writing some code to query data in Elasticsearch. We have huge amounts of data so I am using a scan feature and searching a specific index. We index elasticsearch by the day, so for example today = index_2019_04_15 and yesterday = index_2019_04_14. Is there a way I can query only the previous days index?
Second, in terms of doing _all and then limiting the query to say 2019-04-14, will I see a big performance hit? If not, then I can just do a previous day query.
Here's my code:
import pandas as pd
from elasticsearch_dsl import Search
from elasticsearch_dsl import connections
class get_data:
def __init__(self, host, query):
self.host = host
self.query = query
def pull_es_data(self):
connections.create_connection(alias='client',hosts=self.host,timeout=60)
s = Search(using='client', index="data-2019-04-15") \
.query("match", clientid=r"AB1234-12345")
response = s.scan()
return response
test = get_data("localhost","test")
x = test.pull_es_data()
results_df = pd.DataFrame(([item.clientid,item.clientlocation] for item in x),\
columns=['clientid','clientlocation'])
I was able to take care of this using Index in Elasticsearch-dsl
def get_index_list(self):
i = Index("*").get_alias("client")
return i
There is a complex system of calculations over some objects.
The difficulty is that some calculations are group calculations.
This can demonstrate by the following example:
from dask distributed import client
def load_data_from_db(id):
# load some data
...
return data
def task_a(data):
# some calculations
...
return result
def group_task(*args):
# some calculations
...
return result
def task_b(data, group_data):
# some calculations
...
return result
def task_c(data, task_a_result)
# some calculations
...
return result
ids = [1, 2]
dsk = {'id_{}'.format(i): id for i, id in enumerate(ids)}
dsk['data_0'] = (load_data_from_db, 'id_0')
dsk['data_1'] = (load_data_from_db, 'id_1')
dsk['task_a_result_0'] = (task_a, 'data_0')
dsk['task_a_result_1'] = (task_a, 'data_1')
dsk['group_result'] = (
group_task,
'data_0', 'task_a_result_0',
'data_1', 'task_a_result_1')
dsk['task_b_result_0'] = (task_b, 'data_0', 'group_result')
dsk['task_b_result_1'] = (task_b, 'data_1', 'group_result')
dsk['task_c_result_0'] = (task_c, 'data_0', 'task_a_result_0')
dsk['task_c_result_1'] = (task_c, 'data_1', 'task_a_result_1')
client = Client(scheduler_address)
result = client.get(
dsk,
['task_a_result_0',
'task_b_result_0',
'task_c_result_0',
'task_a_result_1',
'task_b_result_1',
'task_c_result_1'])
The list of objects is counted is thousands elements, and the number of tasks is dozens (including several group tasks).
With such method of graph creation it is difficult to modify the graph (add new tasks, change dependencies, etc.).
Is there a more efficient way of distributed computing using dask for these context?
Added
With futures graph is:
client = Client(scheduler_address)
ids = [1, 2]
data = client.map(load_data_from_db, ids)
result_a = client.map(task_a, data)
group_args = list(chain(*zip(data, result_a)))
result_group = client.submit(task_group, *group_args)
result_b = client.map(task_b, data, [result_group] * len(ids))
result_c = client.map(task_c, data, result_a)
result = client.gather(result_a + result_b + result_c)
And in task functions input arguments is Future instance then arg.result() before use.
If you want to modify the computation during computation then I recommend the futures interface.
In the code below, the worker function checks if the data passed is valid and if it is valid, it returns a dictionary which will be used in a bulk SQLAlchemy Core insert. If its invalid, I want the None value not to be added to the receiving_list because if it is, the bulk insert will fail as a single None value cannot map out to the table structure.
from datetime import datetime
from sqlalchemy import Table
import multiprocessing
CONN = Engine.connect() #Engine is imported from another module
NUM_CONSUMERS = multiprocessing.cpu_count()
p = multiprocessing.Pool(NUM_CONSUMERS)
def process_data(data):
#Long process to validate data
if is_valid_data(data) == True:
returned_dict = {}
returned_dict['created_at'] = datetime.now()
returned_dict['col1'] = data[0]
returned_dict['colN'] = data[N]
return returned_dict
else:
return None
def spawn_some_processes(data):
table_to_insert = Table('postgresql_database_table', meta, autoload=True, autoload_with=Engine)
While True:
#Get some data here and pass it on to the worker
receiving_list = p.map(process_data, data_to_process)
try:
if len(receiving_list) > 0:
trans = CONN.begin()
CONN.execute(table_to_insert.insert(), receiving_list)
trans.commit()
except IntegrityError:
trans.rollback()
except:
trans.rollback()
Trying to rephrase the question, how can I stop a spawned process from adding to receiving_list when the value None is returned by the spawned process?
A workaround is incorporating a queue with queue.put() and queue.get() that will put only valid data. The disadvantage with this is that after the processes are over, I have to then unpack the queue which adds overhead. My ideal solution would be one where a clean list of dictionaries is returned which SQLAlchemy can use to do the bulk insert
You can just remove the None entries from the list:
received_list = filter(None, p.map(process_data, data_to_process))
This is pretty quick even for really huge lists:
>>> timeit.timeit('l = filter(None, l)', 'l = range(0,10000000)', number=1)
0.47683095932006836
Note that using filter will remove anything where bool(val) is False, like empty strings, empty lists, etc. This should be fine for your use-case, though.
I'm trying to fetch results in a python2.7 appengine app using cursors, but each time I use with_cursor() it fetches the same result set.
query = Model.all().filter("profile =", p_key).order('-created')
if r.get('cursor'):
query = query.with_cursor(start_cursor = r.get('cursor'))
cursor = query.cursor()
objs = query.fetch(limit=10)
count = len(objs)
for obj in objs:
...
Each time through I'm getting same 10 results. I'm thinkng it has to do with using end_cursor, but how do I get that value if query.cursor() is returning the start_cursor. I've looked through the docs but this is poorly documented.
Your formatting is a bit screwy by the way. Looking at your code (which is incomplete and therefore potentially leaving something out.) I have to assume you have forgotten to store the cursor after fetching results (or return to the user - I am assuming r is a request ?).
So after you have fetched some data you need to call cursor() on the query. e.g This function counts all entities using a cursor.
def count_entities(kind):
c = None
count = 0
q = kind.all(keys_only=True)
while True:
if c:
q.with_cursor(c)
i = q.fetch(1000)
count = count + len(i)
if not i:
break
c = q.cursor()
return count
See how after fetch() has been called the c=q.cursor() call and it's is used as the cursor next time through the loop.
Here's what finally worked:
query = Model.all().filter("profile =", p_key).order('-created')
if request.get('cursor'):
query = query.with_cursor(request.get('cursor'))
objs = query.fetch(limit=10)
cursor = query.cursor()
for obj in objs:
...