How would I do a Spark explode in Dask? - python

I'm new to dask so bear with me.
I have a JSON file where each row has the following schema:
{
'id': 2,
'version': 7.3,
'participants': range(10)
}
participants is a nested field.
input_file = 'data.json'
df = db.read_text(input_file).map(json.loads)
I can do either:
df.pluck(['id', 'version'])
or
df.pluck('participants').flatten()
But how can I do the equivalent of a Spark explode, where I could at the same time select the id, version and flatten the participants ?
So the output would be :
{'id': 2, 'version': 7.3, 'participants': 0}
{'id': 2, 'version': 7.3, 'participants': 1}
{'id': 2, 'version': 7.3, 'participants': 2}
{'id': 2, 'version': 7.3, 'participants': 3}
...

Its possible to write a custom functions that reads & transforms rows the file with dask.bag.from_sequence
def mapper(row, denest_field):
js = json.loads(row)
for v in js[denest_field]:
yield {'id': js['id'], denest_field: v, 'version': js['version']}
def yield_unnested(fname, denest_field):
with open (fname) as f:
for row in f:
yield from mapper(row, denest_field)
I've saved a file called 'data.json' with the following contents
{"id": 2, "version": 7.3, "participants": [0,1,2,3,4,5,6,7,9,9]}
Then reading with from_sequence
df = db.from_sequence(yield_unnested('data.json', 'participants'))
list(df) # outputs:
[{'id': 2, 'participants': 0, 'version': 7.3},
{'id': 2, 'participants': 1, 'version': 7.3},
{'id': 2, 'participants': 2, 'version': 7.3},
{'id': 2, 'participants': 3, 'version': 7.3},
{'id': 2, 'participants': 4, 'version': 7.3},
{'id': 2, 'participants': 5, 'version': 7.3},
{'id': 2, 'participants': 6, 'version': 7.3},
{'id': 2, 'participants': 7, 'version': 7.3},
{'id': 2, 'participants': 9, 'version': 7.3},
{'id': 2, 'participants': 9, 'version': 7.3}]
Note that I'm new to dask and this may not be the most efficient way to go about things.

Related

Python: Descending order and just 3 objects has a high value [duplicate]

This question already has answers here:
How do I sort a list of dictionaries by a value of the dictionary?
(20 answers)
Closed 6 months ago.
I have an array object like that, Not sort value, I want descending order and just 3 objects has a high value:
[{'id': 1, 'value': 3},
{'id': 2, 'value': 6},
{'id': 3, 'value': 8},
{'id': 4, 'value': 8},
{'id': 5, 'value': 10},
{'id': 6, 'value': 9},
{'id': 7, 'value': 8},
{'id': 8, 'value': 4},
{'id': 9, 'value': 5}]
I want result is descending order and just 3 objects have a high value, like this
[{'id': 5, 'value': 10},
{'id': 6, 'value': 9},
{'id': 7, 'value': 8},
{'id': 3, 'value': 8},
{'id': 4, 'value': 8},]
Please help me, thanks
t = [{'id': 1, 'value': 3},
{'id': 2, 'value': 6},
{'id': 3, 'value': 8},
{'id': 4, 'value': 8},
{'id': 5, 'value': 10},
{'id': 6, 'value': 9},
{'id': 7, 'value': 8}]
newlist = sorted(t, key=lambda d: d['value'])
newlist.reverse()
print(newlist[:3])
# [{'id': 5, 'value': 10}, {'id': 6, 'value': 9}, {'id': 7, 'value': 8}]
More info about list slicing
More info about reverse()
More info

Separate list elements by theirs property in Python

I have list p1:
p1 = [
{'id': 1, 'area': 5},
{'id': 2, 'area': 6},
{'id': 3, 'area': 10},
{'id': 4, 'area': 6},
{'id': 5, 'area': 6},
{'id': 6, 'area': 6},
{'id': 7, 'area': 4},
{'id': 8, 'area': 4}
]
And I need to separate this list by area value, like this (p2):
p2 = {
4: [
{'id': 7, 'area': 4},
{'id': 8, 'area': 4}
],
5: [
{'id': 1, 'area': 5}
],
6: [
{'id': 2, 'area': 6},
{'id': 4, 'area': 6},
{'id': 5, 'area': 6},
{'id': 6, 'area': 6}
],
10: [
{'id': 3, 'area': 10}
]
}
My solution is:
areas = {x['area'] for x in p1}
p2 = {}
for area in areas:
p2[area] = [x for x in p1 if x['area'] == area]
It seems to work, but is there any better and more "pythonic" solution?
Using groupby you get
>>> import itertools
>>> f = lambda t: t['area']
>>> {i: list(b) for i, b in itertools.groupby(sorted(p1, key=f), key=f)}
Gives
{4: [{'area': 4, 'id': 7},
{'area': 4, 'id': 8}],
5: [{'area': 5, 'id': 1}],
6: [{'area': 6, 'id': 2},
{'area': 6, 'id': 4},
{'area': 6, 'id': 5},
{'area': 6, 'id': 6}],
10: [{'area': 10, 'id': 3}]}
edit: If you don't like using lambdas you can also do, as suggested by bro-grammer
>>> import operator
>>> f = operator.itemgetter('area')
You can simply use defaultdict:
from collections import defaultdict
result = defaultdict(list)
for i in p1:
result[i['area']].append(i)
Yes, use one of the grouping idioms. Using a vanilla dict:
In [15]: p1 = [
...: {'id': 1, 'area': 5},
...: {'id': 2, 'area': 6},
...: {'id': 3, 'area': 10},
...: {'id': 4, 'area': 6},
...: {'id': 5, 'area': 6},
...: {'id': 6, 'area': 6},
...: {'id': 7, 'area': 4},
...: {'id': 8, 'area': 4}
...: ]
In [16]: p2 = {}
In [17]: for d in p1:
...: p2.setdefault(d['area'], []).append(d)
...:
In [18]: p2
Out[18]:
{4: [{'area': 4, 'id': 7}, {'area': 4, 'id': 8}],
5: [{'area': 5, 'id': 1}],
6: [{'area': 6, 'id': 2},
{'area': 6, 'id': 4},
{'area': 6, 'id': 5},
{'area': 6, 'id': 6}],
10: [{'area': 10, 'id': 3}]}
Or more neatly, using a defaultdict:
In [23]: from collections import defaultdict
In [24]: p2 = defaultdict(list)
In [25]: for d in p1:
...: p2[d['area']].append(d)
...:
In [26]: p2
Out[26]:
defaultdict(list,
{4: [{'area': 4, 'id': 7}, {'area': 4, 'id': 8}],
5: [{'area': 5, 'id': 1}],
6: [{'area': 6, 'id': 2},
{'area': 6, 'id': 4},
{'area': 6, 'id': 5},
{'area': 6, 'id': 6}],
10: [{'area': 10, 'id': 3}]})

Sort list of lists that each contain a dictionary

I have this list:
list_users= [[{'points': 9, 'values': 1, 'division': 1, 'user_id': 3}], [{'points': 3, 'values': 0, 'division': 1, 'user_id': 1}], [{'points': 2, 'values': 0, 'division': 1, 'user_id': 4}], [{'points': 9, 'values': 0, 'division': 1, 'user_id': 11}], [{'points': 3, 'values': 0, 'division': 1, 'user_id': 10}], [{'points': 100, 'values': 4, 'division': 1, 'user_id': 2}], [{'points': 77, 'values': 2, 'division': 1, 'user_id': 5}], [{'points': 88, 'values': 3, 'division': 1, 'user_id': 6}], [{'points': 66, 'values': 1, 'division': 1, 'user_id': 7}], [{'points': 2, 'values': 0, 'division': 1, 'user_id': 8}]]
I need to sort the list by points and values.
How can I sort it if dict is inside a list inside the main list?
I generated this list by query and than just append to list_users?
Access the dictionary containing points and values by indexing on the inner list:
list_users_sorted = sorted(list_users, key=lambda x: (x[0]['points'], x[0]['values']))
# ^ ^
Sort using a key function for sorted that builds a tuple of points and values for each dict in each list.
def kf(x):
return (x[0]["points"], x[0]["values"])
s = sorted(list_users, key=kf)
print(s)
Output:
[[{'division': 1, 'points': 2, 'user_id': 4, 'values': 0}],
[{'division': 1, 'points': 2, 'user_id': 8, 'values': 0}],
[{'division': 1, 'points': 3, 'user_id': 1, 'values': 0}],
[{'division': 1, 'points': 3, 'user_id': 10, 'values': 0}],
[{'division': 1, 'points': 9, 'user_id': 11, 'values': 0}],
[{'division': 1, 'points': 9, 'user_id': 3, 'values': 1}],
[{'division': 1, 'points': 66, 'user_id': 7, 'values': 1}],
[{'division': 1, 'points': 77, 'user_id': 5, 'values': 2}],
[{'division': 1, 'points': 88, 'user_id': 6, 'values': 3}],
[{'division': 1, 'points': 100, 'user_id': 2, 'values': 4}]]

Merging arrays of versioned dictionaries

Given the following two arrays of dictionaries, how can I merge them such that the resulting array of dictionaries contains only those dictionaries whose version is greatest?
data1 = [{'id': 1, 'name': u'Oneeee', 'version': 2},
{'id': 2, 'name': u'Two', 'version': 1},
{'id': 3, 'name': u'Three', 'version': 2},
{'id': 4, 'name': u'Four', 'version': 1},
{'id': 5, 'name': u'Five', 'version': 1}]
data2 = [{'id': 1, 'name': u'One', 'version': 1},
{'id': 2, 'name': u'Two', 'version': 1},
{'id': 3, 'name': u'Threeee', 'version': 3},
{'id': 6, 'name': u'Six', 'version': 2}]
The merged result should look like this:
data3 = [{'id': 1, 'name': u'Oneeee', 'version': 2},
{'id': 2, 'name': u'Two', 'version': 1},
{'id': 3, 'name': u'Threeee', 'version': 3},
{'id': 4, 'name': u'Four', 'version': 1},
{'id': 5, 'name': u'Five', 'version': 1},
{'id': 6, 'name': u'Six', 'version': 2}]
If you want to get the highest version according to the dictionaries ids then you can use itertools.groupby method like this:
sdata = sorted(data1 + data2, key=lambda x:x['id'])
res = []
for _,v in itertools.groupby(sdata, key=lambda x:x['id']):
v = list(v)
if len(v) > 1: # happened that the same id was in both datas
# append the one with higher version
res.append(v[0] if v[0]['version'] > v[1]['version'] else v[1])
else: # the id was in one of the two data
res.append(v[0])
The solution is not a one liner but I think is simple enough (once you understand groupby() which is not trivial).
This will result in res containing this list:
[{'id': 1, 'name': u'Oneeee', 'version': 2},
{'id': 2, 'name': u'Two', 'version': 1},
{'id': 3, 'name': u'Threeee', 'version': 3},
{'id': 4, 'name': u'Four', 'version': 1},
{'id': 5, 'name': u'Five', 'version': 1},
{'id': 6, 'name': u'Six', 'version': 2}]
I think is possible to shrink the solution even more, but it could be quite hard to understand.
Hope this helps!
A fairly straightforward procedural solution, where we build a dictionary keyed by item id, and then replace the items:
indexed_data = { item['id']: item for item in data1 }
# or, pre-Python2.7:
# indexed_data = dict((item['id'], item) for item in data1)
for item in data2:
if indexed_data.get(item['id'], {'version': float('-inf')})['version'] < item['version']:
indexed_data[item['id']] = item
data3 = [item for (_, item) in sorted(indexed_data.items())]
The same thing, but using a more functional approach:
sorted_items = sorted(data1 + data2, key=lambda item: (item['id'], item['version']))
merged = { item['id']: item for item in sorted_items }
# or, pre-Python2.7:
# merged = dict((item['id'], item) for item in sorted_items )
data3 = [item for (_, item) in sorted(merged.items())]

Find a minimal value in each array in Python

Suppose I have the following data in Python 3.3:
my_array =
[{'man_id': 1, '_id': ObjectId('1234566'), 'type': 'worker', 'value': 11},
{'man_id': 1, '_id': ObjectId('1234577'), 'type': 'worker', 'value': 12}],
[{'man_id': 2, '_id': ObjectId('1234588'), 'type': 'worker', 'value': 11},
{'man_id': 2, '_id': ObjectId('3243'), 'type': 'worker', 'value': 7},
{'man_id': 2, '_id': ObjectId('54'), 'type': 'worker', 'value': 99},
{'man_id': 2, '_id': ObjectId('9879878'), 'type': 'worker', 'value': 135}],
#.............................
[{'man_id': 13, '_id': ObjectId('111'), 'type': 'worker', 'value': 1},
{'man_id': 13, '_id': ObjectId('222'), 'type': 'worker', 'value': 2},
{'man_id': 13, '_id': ObjectId('3333'), 'type': 'worker', 'value': 9}]
There are 3 arrays. How do I find an element in each array with minimal value?
[min(arr, key=lambda s:s['value']) for arr in my_array]
Maybe something like that is acc for you:
for arr in my_array:
minVal = min([row['value'] for row in arr])
print [row for row in arr if row['value'] == minVal]

Categories