pandas: split string, and count values? [duplicate] - python

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 3 years ago.
I've got a pandas dataset with a column that's a comma-separated string, e.g. 1,2,3,10:
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
I'd like to get a count and a mean score for each value in topics. So:
topic_id,count,mean
1,2,5
11,2,8
12,1,6
et cetera. How can I do this?
I've got as far as:
df['topic_ids'] = df.topics.str.split()
But now I guess I want to explode topic_ids out, so there's a column for each unique value in the entire set of values...?

unnest then groupby and agg
df.topics=df.topics.str.split(',')
New_df=pd.DataFrame({'topics':np.concatenate(df.topics.values),'id':df.id.repeat(df.topics.apply(len)),'score':df.score.repeat(df.topics.apply(len))})
New_df.groupby('topics').score.agg(['count','mean'])
Out[1256]:
count mean
topics
1 2 5.0
11 2 8.0
12 1 6.0
18 2 5.5
22 1 9.0
30 4 6.5

In [111]: def mean1(x): return np.array(x).astype(int).mean()
In [112]: df.topics.str.split(',', expand=False).agg([mean1, len])
Out[112]:
mean1 len
0 21.000000 3
1 19.666667 3
2 14.333333 3
3 16.333333 3

This is one way. Reindex & stack, then groupby & agg.
import pandas as pd
data = [
{ 'id': 1, 'score': 9, 'topics': '11,22,30' },
{ 'id': 2, 'score': 7, 'topics': '11,18,30' },
{ 'id': 3, 'score': 6, 'topics': '1,12,30' },
{ 'id': 4, 'score': 4, 'topics': '1,18,30' }
]
df = pd.DataFrame(data)
df.topics = df.topics.str.split(',')
df2 = pd.DataFrame(df.topics.tolist(), index=[df.id, df.score])\
.stack()\
.reset_index(name='topics')\
.drop('level_2', 1)
df2.groupby('topics').score.agg(['count', 'mean']).reset_index()

Related

Nested structures in beam

Question: I want to do a similar operation to
ARRAY_AGG(STRUCT(table)) in beam for python.
Background:
Similar to this thread I'm running a beam pipeline in python. I have two tables, one with ids and a sum:
ID
total
1
10
2
15
3
5
And one breakdown table where each row is:
table1_id
item_name
item_price
1
a
2
1
b
8
2
c
5
2
d
5
2
e
5
3
f
7
I want the output in biq query to look like:
id
total
item.item_name
item.item_price
1
10
a
2
b
8
2
15
c
5
d
5
e
5
3
5
f
7
In BQ this is solvable by doing an ARRAY_AGG(SRUCT(line_items)) and grouping by table1_id which can then be joined on table1. Is there a smart way to do so in beam with python?
(Assuming it's something with groupby by haven't been able to get it working)
I propose you a full code to implement your solution in an unit test :
from typing import List, Dict, Tuple, Any
import apache_beam as beam
import pytest
from apache_beam import Create
from apache_beam.pvalue import AsList
from apache_beam.testing.test_pipeline import TestPipeline
def test_pipeline(self):
with TestPipeline() as p:
ids = [
{
'ID': 1,
'total': 10
},
{
'ID': 2,
'total': 15
},
{
'ID': 3,
'total': 5
}
]
items = [
{
'table1_id': 1,
'item_name': 'a',
'item_price': 2
},
{
'table1_id': 1,
'item_name': 'b',
'item_price': 8
},
{
'table1_id': 2,
'item_name': 'c',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'd',
'item_price': 5
},
{
'table1_id': 2,
'item_name': 'e',
'item_price': 5
},
{
'table1_id': 3,
'item_name': 'f',
'item_price': 7
}
]
ids_side_inputs = p | 'Side input IDs' >> Create(ids)
result = (p
| 'Input items' >> Create(items)
| beam.GroupBy(lambda i: i['table1_id'])
| beam.Map(self.to_item_tuple_with_total, ids=AsList(ids_side_inputs))
| beam.Map(self.to_item_result)
)
result | "Print outputs" >> beam.Map(print)
def to_item_tuple_with_total(self, item_tuple: Tuple[int, Any], ids: List[Dict]) -> Tuple[Dict, List[Dict]]:
table_id = item_tuple[0]
total = next(id_element for id_element in ids if id_element['ID'] == table_id)['total']
return {'id': table_id, 'total': total}, item_tuple[1]
def to_item_result(self, item_tuple: Tuple[Dict, Any]) -> Dict:
item_key = item_tuple[0]
return {'id': item_key['id'], 'total': item_key['total'], 'item': item_tuple[1]}
The result is :
{
'id': 1,
'total': 10,
'item':
[
{'table1_id': 1, 'item_name': 'a', 'item_price': 2},
{'table1_id': 1, 'item_name': 'b', 'item_price': 8}
]
}
{
'id': 2,
'total': 15,
'item':
[
{'table1_id': 2, 'item_name': 'c', 'item_price': 5},
{'table1_id': 2, 'item_name': 'd', 'item_price': 5},
{'table1_id': 2, 'item_name': 'e', 'item_price': 5}
]
}
{
'id': 3,
'total': 5,
'item':
[
{'table1_id': 3, 'item_name': 'f', 'item_price': 7}
]
}
Some explanations :
I simulated the items input PCollection from BigQuery
I sumulated the ids side input PCollection from BigQuery
I added a GroupBy on table1_id from item PCollection
I added a Map with side input list IDs to link total to items
The last Map returns a Dict with expected fields before to save the result to BigQuery

Handle missing data when flattening nested array field in pandas dataframe

We need to flatten this into a standard 2D DataFrame:
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Here's our code, which is working just fine for this dummy example
for i in range(len(mydf)):
output_list = []
for i in range(len(mydf)):
team1 = mydf['arr'][i][0]
team2 = mydf['arr'][i][1]
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
output_list.append(zed)
output_df = pd.DataFrame(output_list)
final_df = pd.concat([mydf, output_df], axis=1)
final_df.pop('arr')
final_df
name t1 t2
0 nick ORL ATL
1 tom   NYK BOS
Our source of data is not reliable and ma have missing values, and our code seems fraught with structural weaknesses. In particular, errors are thrown when either of these are the raw data (missing field, missing dict):
# missing dict
arr = [
[{ 'id': 3, 'abbr': 'ORL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
# missing "abbr" field
arr = [
[{ 'id': 3, 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 5, 'abbr': 'ATL', 'record': { 'win': 3, 'loss': 7 }}],
[{ 'id': 7, 'abbr': 'NYK', 'record': { 'win': 3, 'loss': 7 }},
{ 'id': 9, 'abbr': 'BOS', 'record': { 'win': 3, 'loss': 7 }}]
]
mydf = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
Is it possible to (a) replace the for-loop with a more structurally sound approach (apply), and (b) handle the missing data concerns?
The main issue with your code is that "abbr" key may not exist. You could account for that using dict.get method. If you replace:
zed = { 't1': team1['abbr'], 't2': team2['abbr'] }
with
zed = { 't1': team1.get('abbr', np.nan), 't2': team2.get('abbr', np.nan) }
it will work as expected.
An alternative approach that doesn't use explicit loop:
You could explode and str.get the abbr field; convert it to a list; build a DataFrame with it and join it back to df:
df = pd.DataFrame(data = {'name': ['nick', 'tom'], 'arr': arr })
out = (df.join(pd.DataFrame((df['arr']
.explode()
.str.get('abbr')
.groupby(level=0)
.agg(list)
.tolist()),
columns=['t1','t2'], index=df.index))
.drop(columns='arr')
.fillna(np.nan))
For the sample that works in your code:
name t1 t2
0 nick ORL ATL
1 tom NYK BOS
For the first sample that doesn't work:
name t1 t2
0 nick ORL NaN
1 tom NYK BOS
For the second sample that doesn't work:
name t1 t2
0 nick NaN ATL
1 tom NYK BOS

Pandas to_dict data structure, using column as dictionary index

This is just a very specific data structure transformation that I'm trying to achieve with pandas, so if you know how to do it, please share :)
Imagine I have a dataframe that looks like this
id
value
date
1
1
2021-04-01
1
5
2021-04-02
1
10
2021-04-03
2
3
2021-04-01
2
4
2021-04-02
2
11
2021-04-03
Now I want to transform this into an object, where the keys are the ids, and the values are arrays of information about that id. So it would look like this...
{
'1': [
{ 'value': 1, 'date': '2021-04-01' },
{ 'value': 5, 'date': '2021-04-02' },
{ 'value': 10, 'date': '2021-04-03' }
],
'2': [
{ 'value': 3, 'date': '2021-04-01' },
{ 'value': 4, 'date': '2021-04-02' },
{ 'value': 11, 'date': '2021-04-03' }
],
}
I imagine I have to use .to_dict() somehow, but I can't quite figure out how to do it?
Thoughts?
Edit: I've already figured out a brute-force way of doing it, I'm looking for something more elegant ;)
You can use groupby() on id and then apply() to_dict() on each group:
df.groupby('id').apply(lambda x: x[['value', 'date']].to_dict(orient='records')).to_dict()
{1: [{'value': 1, 'date': '2021-04-01'}, {'value': 5, 'date': '2021-04-02'}, {'value': 10, 'date': '2021-04-03'}], 2: [{'value': 3, 'date': '2021-04-01'}, {'value': 4, 'date': '2021-04-02'}, {'value': 11, 'date': '2021-04-03'}]}
You can use list comprehension after converting the dataframe to dict object.
But here's a more Pandas-way,if your id column is a real column of the dataframe,
df = df.set_index('id').T.to_dict()
If you meant id as the index of dataframe, just use,
df = df.T.to_dict()

Pandas json_normalize on recursively nested json

I have a json file with a deeply nested recursive structure:
{"children": [
"val" = x
"data" = y
"children": [{
"val" = x
"data" = y
"children": [{
....
"val" = x
"data" = y
"children": [{
"val" = x
"data" = y
"children": [{
....
Using pandas json_normalize as follows:
json_normalize(data = self.data["children"], record_path="children")
gives dataframe where the first level is flattened but the deepers levels remain json strings within the dataframe.
How can i flatten my dataframe such that the entire json tree is unpacked and flattened?
Providing your json is well formatted and has the same structure at all levels you can extract all the data by passing a List of keywords to json_normalize from each level.
json = {'children': [{
'val': 1,
'data': 2,
'children': [{
'val': 3,
'data' : 4,
'children': [{'val' : 4,
'data' : 5}],
}],
},{
'val' : 6,
'data' : 7,
'children': [{
'val' : 8,
'data' : 9,
'children': [{'val' : 10,
'data' : 11}],
}]
}]}
for i in range(1,3):
print( json_normalize(data = json,record_path=['children']*i) )
This gives the following output, which you can use recursively add into a single DataFrame if you wish.
children data val
0 [{'val': 3, 'data': 4, 'children': [{'val': 4,... 2 1
1 [{'val': 8, 'data': 9, 'children': [{'val': 10... 7 6
children data val
0 [{'val': 4, 'data': 5}] 4 3
1 [{'val': 10, 'data': 11}] 9 8
data val
0 5 4
1 11 10

Converting to Pandas MultiIndex

I have a dataframe of the form:
SpeciesName 0
0 A [[Year: 1, Quantity: 2],[Year: 3, Quantity: 4...]]
1 B [[Year: 1, Quantity: 7],[Year: 2, Quantity: 15...]]
2 C [[Year: 2, Quantity: 9],[Year: 4, Quantity: 13...]]
I'm attempting to try and create a MultiIndex that uses the SpeciesName and the year as the index:
SpeciesName Year
A 1 Data
2 Data
B 1 Data
2 Data
I have not been able to get pandas.MultiIndex(..) to work and my attempts at iterating through the dataset and manually creating a new object have not been very fruitful. Any insights would be greatly appreciated!
I'm going to assume your data is list of dictionaries... because if I don't, what you've written makes no sense unless they are strings and I don't want to parse strings
df = pd.DataFrame([
['A', [dict(Year=1, Quantity=2), dict(Year=3, Quantity=4)]],
['B', [dict(Year=1, Quantity=7), dict(Year=2, Quantity=15)]],
['C', [dict(Year=2, Quantity=9), dict(Year=4, Quantity=13)]]
], columns=['SpeciesName', 0])
df
SpeciesName 0
0 A [{'Year': 1, 'Quantity': 2}, {'Year': 3, 'Quantity': 4}]
1 B [{'Year': 1, 'Quantity': 7}, {'Year': 2, 'Quantity': 15}]
2 C [{'Year': 2, 'Quantity': 9}, {'Year': 4, 'Quantity': 13}]
Then the solution is obvious
pd.DataFrame.from_records(
*zip(*(
[d, s]
for s, l in zip(
df['SpeciesName'], df[0].values.tolist())
for d in l
))
).set_index('Year', append=True)
Quantity
Year
A 1 2
3 4
B 1 7
2 15
C 2 9
4 13

Categories