Handle missing data when flattening nested array field in pandas dataframe - python

We need to flatten this into a standard 2D DataFrame:
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Here's our code, which is working just fine for this dummy example
for i in range(len(mydf)):
output_list = []
for i in range(len(mydf)):
team1 = mydf['arr'][i][0]
team2 = mydf['arr'][i][1]
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
output_list.append(zed)
output_df = pd.DataFrame(output_list)
final_df = pd.concat([mydf, output_df], axis=1)
final_df.pop('arr')
final_df
name t1 t2
0 nick ORL ATL
1 tom   NYK BOS
Our source of data is not reliable and ma have missing values, and our code seems fraught with structural weaknesses. In particular, errors are thrown when either of these are the raw data (missing field, missing dict):
# missing dict
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
# missing "abbr" field
arr = [
[{ 'id': 3, 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Is it possible to (a) replace the for-loop with a more structurally sound approach (apply), and (b) handle the missing data concerns?

The main issue with your code is that "abbr" key may not exist. You could account for that using dict.get method. If you replace:
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
with
zed = { 't1': team1.get('abbr', np.nan), 't2': team2.get('abbr', np.nan) }
it will work as expected.
An alternative approach that doesn't use explicit loop:
You could explode and str.get the abbr field; convert it to a list; build a DataFrame with it and join it back to df:
df = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
out = (df.join(pd.DataFrame((df['arr']
.explode()
.str.get('abbr')
.groupby(level=0)
.agg(list)
.tolist()),
columns=['t1','t2'], index=df.index))
.drop(columns='arr')
.fillna(np.nan))
For the sample that works in your code:
name t1 t2
0 nick ORL ATL
1 tom NYK BOS
For the first sample that doesn't work:
name t1 t2
0 nick ORL NaN
1 tom NYK BOS
For the second sample that doesn't work:
name t1 t2
0 nick NaN ATL
1 tom NYK BOS

Related

Nested structures in beam

Question: I want to do a similar operation to
ARRAY_AGG(STRUCT(table)) in beam for python.
Background:
Similar to this thread I'm running a beam pipeline in python. I have two tables, one with ids and a sum:
ID
total
1
10
2
15
3
5
And one breakdown table where each row is:
table1_id
item_name
item_price
1
a
2
1
b
8
2
c
5
2
d
5
2
e
5
3
f
7
I want the output in biq query to look like:
id
total
item.item_name
item.item_price
1
10
a
2
b
8
2
15
c
5
d
5
e
5
3
5
f
7
In BQ this is solvable by doing an ARRAY_AGG(SRUCT(line_items)) and grouping by table1_id which can then be joined on table1. Is there a smart way to do so in beam with python?
(Assuming it's something with groupby by haven't been able to get it working)
I propose you a full code to implement your solution in an unit test :
from typing import List, Dict, Tuple, Any
import apache_beam as beam
import pytest
from apache_beam import Create
from apache_beam.pvalue import AsList
from apache_beam.testing.test_pipeline import TestPipeline
def test_pipeline(self):
with TestPipeline() as p:
ids = [
{
'ID': 1,
'total': 10
},
{
'ID': 2,
'total': 15
},
{
'ID': 3,
'total': 5
}
]
items = [
{
'table1_id': 1,
'item_name': 'a',
'item_price': 2
},
{
'table1_id': 1,
'item_name': 'b',
'item_price': 8
},
{
'table1_id': 2,
'item_name': 'c',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'd',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'e',
'item_price': 5
},
{
'table1_id': 3,
'item_name': 'f',
'item_price': 7
}
]
ids_side_inputs = p | 'Side input IDs' >> Create(ids)
result = (p
| 'Input items' >> Create(items)
| beam.GroupBy(lambda i: i['table1_id'])
| beam.Map(self.to_item_tuple_with_total, ids=AsList(ids_side_inputs))
| beam.Map(self.to_item_result)
)
result | "Print outputs" >> beam.Map(print)
def to_item_tuple_with_total(self, item_tuple: Tuple[int, Any], ids: List[Dict]) -> Tuple[Dict, List[Dict]]:
table_id = item_tuple[0]
total = next(id_element for id_element in ids if id_element['ID'] == table_id)['total']
return {'id': table_id, 'total': total}, item_tuple[1]
def to_item_result(self, item_tuple: Tuple[Dict, Any]) -> Dict:
item_key = item_tuple[0]
return {'id': item_key['id'], 'total': item_key['total'], 'item': item_tuple[1]}
The result is :
{
'id': 1,
'total': 10,
'item':
[
{'table1_id': 1, 'item_name': 'a', 'item_price': 2},
{'table1_id': 1, 'item_name': 'b', 'item_price': 8}
]
}
{
'id': 2,
'total': 15,
'item':
[
{'table1_id': 2, 'item_name': 'c', 'item_price': 5},
{'table1_id': 2, 'item_name': 'd', 'item_price': 5},
{'table1_id': 2, 'item_name': 'e', 'item_price': 5}
]
}
{
'id': 3,
'total': 5,
'item':
[
{'table1_id': 3, 'item_name': 'f', 'item_price': 7}
]
}
Some explanations :
I simulated the items input PCollection from BigQuery
I sumulated the ids side input PCollection from BigQuery
I added a GroupBy on table1_id from item PCollection
I added a Map with side input list IDs to link total to items
The last Map returns a Dict with expected fields before to save the result to BigQuery

Store rows of DataFrame with certain value in list

I have a DataFrame like:
id
country
city
amount
duplicated
1
France
Paris
200
1
2
France
Paris
200
1
3
France
Lyon
50
2
4
France
Lyon
50
2
5
France
Lyon
50
2
And I would like to store a list per distinct value in duplicated, like:
list 1
[
{
"id": 1,
"country": "France",
"city": "Paris",
"amount": 200,
},
{
"id": 2,
"country": "France",
"city": "Paris",
"amount": 200,
}
]
list 2
[
{
"id": 3,
"country": "France",
"city": "Lyon",
"amount": 50,
},
{
"id": 4,
"country": "France",
"city": "Lyon",
"amount": 50,
},
{
"id": 5,
"country": "France",
"city": "Lyon",
"amount": 50,
}
]
I tried filtering duplicates with
df[df.duplicated(['country','city','amount', 'duplicated'], keep = False)]
but it just returns the same df.
You can use groupby:
lst = (df.groupby(['country', 'city', 'amount']) # or .groupby('duplicated')
.apply(lambda x: x.to_dict('records'))
.tolist())
Output:
>>> lst
[[{'id': 3,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2},
{'id': 4,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2},
{'id': 5,
'country': 'France',
'city': 'Lyon',
'amount': 50,
'duplicated': 2}],
[{'id': 1,
'country': 'France',
'city': 'Paris',
'amount': 200,
'duplicated': 1},
{'id': 2,
'country': 'France',
'city': 'Paris',
'amount': 200,
'duplicated': 1}]]
Another solution if you want a dict indexed by duplicated key:
data = {k: v.to_dict('records') for k, v in df.set_index('duplicated').groupby(level=0)}
>>> data[1]
[{'id': 1, 'country': 'France', 'city': 'Paris', 'amount': 200},
{'id': 2, 'country': 'France', 'city': 'Paris', 'amount': 200}]
>>> data[2]
[{'id': 3, 'country': 'France', 'city': 'Lyon', 'amount': 50},
{'id': 4, 'country': 'France', 'city': 'Lyon', 'amount': 50},
{'id': 5, 'country': 'France', 'city': 'Lyon', 'amount': 50}]
If I understand you correctly, you can use DataFrame.to_dict('records') to make your lists:
list_1 = df[df['duplicated'] == 1].to_dict('records')
list_1 = df[df['duplicated'] == 2].to_dict('records')
Or for an arbitrary number of values in the column, you can make a dict:
result = {}
for value in df['duplicated'].unique():
result[value] = df[df['duplicated'] == value].to_dict('records')

pandas: split string, and count values? [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 3 years ago.
I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
I'd like to get a count and a mean score for each value in topics. So:
topic_id,count,mean
1,2,5
11,2,8
12,1,6
et cetera. How can I do this?
I've got as far as:
df['topic_ids'] = df.topics.str.split()
But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?
unnest then groupby and agg
df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})
New_df.groupby('topics').score.agg(['count','mean'])
Out[1256]:
count mean
topics
1 2 5.0
11 2 8.0
12 1 6.0
18 2 5.5
22 1 9.0
30 4 6.5
In [111]: def mean1(x): return np.array(x).astype(int).mean()
In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
mean1 len
0 21.000000 3
1 19.666667 3
2 14.333333 3
3 16.333333 3
This is one way. Reindex & stack, then groupby & agg.
import pandas as pd
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
.stack()\
.reset_index(name='topics')\
.drop('level_2', 1)
df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

How can I create aggregate expressions of this list of dicts?

I have a list of dictionaries that expresses periods+days for a class in a student information system. Here's the data I'd like to aggregate:
[
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'A',
'sort_order': 1
}
},
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'B',
'sort_order': 2
}
},
{
'period': {
'name': '1',
'sort_order': 1
},
'day': {
'name': 'C',
'sort_order': 1
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'A',
'sort_order': 1
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'B',
'sort_order': 2
}
},
{
'period': {
'name': '3',
'sort_order': 3
},
'day': {
'name': 'C',
'sort_order': 2
}
},
{
'period': {
'name': '4',
'sort_order': 4
},
'day': {
'name': 'D',
'sort_order': 3
}
}
]
The aggregated string I'd like the above to reduce to is 1,3(A-C) 4(D). Notice that objects that aren't "adjacent" (determined by the object's sort_order) to each other are delimited by , and "adjacent" records are delimited by a -.
EDIT
Let me try to elaborate on the aggregation process. Each "class meeting" object contains a period and day. There are usually ~5 periods per day, and the days alternate cyclically between A,B,C,D, etc. So if I have a class that occurs 1st period on an A day, we might express that as 1(A). If a class occurs on 1st and 2nd period on an A day, the raw form of that might be 1(A),2(A), but it can be shortened to 1-2(A).
Some classes might not be in "adjacent" periods or days. A class might occur on 1st period and 3rd period on an A day, so its short form would be 1,3(A). However, if that class were on 1st, 2nd, and 3rd period on an A day, it could be written as 1-3(A). This also applies to days, so if a class occurs on 1st,2nd, and 3rd period, on A,B, and C day, then we could write it 1-3(A-C).
Finally, if a class occurs on 1st,2nd, and 3rd period and on A,B, and C day, but also on 4th period on D day, its short form would be 1-3(A-C) 4(D).
What I've tried
The first step that occurs to me to perform is to "group" the meeting objects into related sub-lists with the following function:
def _to_related_lists(list):
"""Given a list of section meeting dicts, return a list of lists, where each sub-list is list of
related section meetings, either related by period or day"""
related_list = []
sub_list = []
related_values = set()
for index, section_meeting_object in enumerate(list):
# starting with empty values list
if not related_values:
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
elif section_meeting_object['period']['name'] in related_values or section_meeting_object['day']['name'] in related_values:
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
else:
# no related values found in current section_meeting_object
related_list.append(sub_list)
sub_list = []
related_values = set()
related_values.add(section_meeting_object['period']['name'])
related_values.add(section_meeting_object['day']['name'])
sub_list.append(section_meeting_object)
related_list.append(sub_list)
return related_list
Which returns:
[
[{
'period': {
'sort_order': 1,
'name': '1'
},
'day': {
'sort_order': 1,
'name': 'A'
}
}, {
'period': {
'sort_order': 1,
'name': '1'
},
'day': {
'sort_order': 2,
'name': 'B'
}
}, {
'period': {
'sort_order': 2,
'name': '2'
},
'day': {
'sort_order': 1,
'name': 'A'
}
}, {
'period': {
'sort_order': 2,
'name': '2'
},
'day': {
'sort_order': 2,
'name': 'B'
}
}],
[{
'period': {
'sort_order': 4,
'name': '4'
},
'day': {
'sort_order': 3,
'name': 'C'
}
}]
]
If the entire string 1-3(A-C) 4(D) is the aggregate expression I'd like in the end, let's call 1-3(A-C) and 4(D) "sub-expressions". Each related sub-list would be a "sub-expression", so I was thinking I'd somehow iterate through every sublist and create the sub-expression, but I"m not exactly sure how to do that.
First, let us define your list as d_list.
d_list = [
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'C'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'C'}},
{'period': {'sort_order': 4, 'name': '4'}, 'day': {'sort_order': 3, 'name': 'D'}},
]
Note that I use the python native module string to define that B is between A and C. Thus what you may want to do is
import string
agg0 = {}
for d in d_list:
name = d['period']['name']
if name not in agg0:
agg0[name] = []
day = d['day']
agg0[name].append(day['name'])
agg1 = {}
for k,v in agg0.items():
pos_in_alph = [string.ascii_lowercase.index(el.lower()) for el in v]
allowed_indexes = [max(pos_in_alph),min(pos_in_alph)]
agg1[k] = [el for el in v if string.ascii_lowercase.index(el.lower()) in allowed_indexes]
agg = {}
for k,v in agg1.items():
w = tuple(v)
if w not in agg:
agg[w] = {'ks':[],'gr':len(agg0[k])>2}
agg[w]['ks'].append(k)
print agg[w]
str_ = ''
for k,v in sorted(agg.items(), key=lambda item:item[0], reverse=False):
str_ += ' {pnames}({dnames})'.format(pnames=('-' if v['gr'] else ',').join(sorted(v['ks'])),
dnames='-'.join(k))
print(str_.strip())
which outputs 1-3(A-C) 4(D)
Following #NathanJones's comment, note that if d_list were defined as
d_list = [
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'A'}},
##{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 1, 'name': '1'}, 'day': {'sort_order': 1, 'name': 'C'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 1, 'name': 'A'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'B'}},
{'period': {'sort_order': 3, 'name': '3'}, 'day': {'sort_order': 2, 'name': 'C'}},
{'period': {'sort_order': 4, 'name': '4'}, 'day': {'sort_order': 3, 'name': 'D'}},
]
The code above would print 1,3(A-C) 4(D)

How to aggregate particular property value that is group by a particular property in a list

I have a list which represent the item purchase of a customer:
purchases = [
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 4
},
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 8
},
{
'id': 2, 'product': 'Item 2', 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': 'Item 3', 'price': 18, 'qty': 7
}
]
Now i want the output that returns the distinct product with an aggregated qty.
result = [
{
'id': 1, 'product': 'Item 1', 'price': 12.4, 'qty' : 12 # 8 + 4
},
{
'id': 2, 'product': 'Item 2', 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': 'Item 3', 'price': 18, 'qty': 7
}
]
And the answers here never makes sense to me
How to sum dict elements
In pandas it is simple - groupby with aggregate, last to_dict:
import pandas as pd
df = pd.DataFrame(purchases)
print (df)
id price product qty
0 1 12.4 Item 1 4
1 1 12.4 Item 1 8
2 2 7.5 Item 2 10
3 3 18.0 Item 3 7
print (df.groupby('product', as_index=False)
.agg({'id':'first','price':'first','qty':'sum'})
.to_dict(orient='records'))
[{'qty': 12, 'product': 'Item 1', 'price': 12.4, 'id': 1},
{'qty': 10, 'product': 'Item 2', 'price': 7.5, 'id': 2},
{'qty': 7, 'product': 'Item 3', 'price': 18.0, 'id': 3}]
If is possible groupby by 3 elements:
print (df.groupby(['id','product', 'price'], as_index=False)['qty'].sum()
.to_dict(orient='records'))
[{'qty': 12, 'product': 'Item 1', 'id': 1, 'price': 12.4},
{'qty': 10, 'product': 'Item 2', 'id': 2, 'price': 7.5},
{'qty': 7, 'product': 'Item 3', 'id': 3, 'price': 18.0}]
from itertools import groupby
from operator import itemgetter
grouper = itemgetter("id", "product", "price")
result = []
for key, grp in groupby(sorted(purchases, key = grouper), grouper):
temp_dict = dict(zip(["id", "product", "price"], key))
temp_dict["qty"] = sum(item["qty"] for item in grp)
result.append(temp_dict)
print(result)
[{'qty': 12, 'product': 'Item 1', 'id': 1, 'price': 12.4},
{'qty': 10, 'product': 'Item 2', 'id': 2, 'price': 7.5},
{'qty': 7, 'product': 'Item 3', 'id': 3, 'price': 18}]
EDIT by comment:
purchases = [
{
'id': 1, 'product': { 'id': 1, 'name': 'item 1' }, 'price': 12.4, 'qty' : 4
},
{
'id': 1, 'product': { 'id': 1, 'name': 'item 2' }, 'price': 12.4, 'qty' : 8
},
{
'id': 2, 'product':{ 'id': 2, 'name': 'item 3' }, 'price': 7.5, 'qty': 10
},
{
'id': 3, 'product': { 'id': 3, 'name': 'item 4' }, 'price': 18, 'qty': 7
}
]
from pandas.io.json import json_normalize
df = json_normalize(purchases)
print (df)
id price product.id product.name qty
0 1 12.4 1 item 1 4
1 1 12.4 1 item 2 8
2 2 7.5 2 item 3 10
3 3 18.0 3 item 4 7
print (df.groupby(['id','product.id', 'price'], as_index=False)['qty'].sum()
.to_dict(orient='records'))
[{'qty': 12.0, 'price': 12.4, 'id': 1.0, 'product.id': 1.0},
{'qty': 10.0, 'price': 7.5, 'id': 2.0, 'product.id': 2.0},
{'qty': 7.0, 'price': 18.0, 'id': 3.0, 'product.id': 3.0}]
Another solution, not the most elegant, but easier to understand
from collections import Counter
c = Counter()
some = [((x['id'], x['product'], x['price']), x['qty']) for x in purchases]
for x in some:
c[x[0]] += x[1]
[{'id': k[0], 'product': k[1], 'price': k[2], 'qty': v} for k, v in c.items()]
And that i measured that solution with groupby solution of #jezrael
And 100000 loops, best of 3: 9.03 µs per loop vs #jezrael's 100000 loops, best of 3: 12.2 µs per loop

Categories