Python pandas grouping a dataframe by the unique value of a column - python

I have a dataframe in this format
A B
1990-02 1
1990-03 1
1999-05 1
1992-08 2
1996-12 2
2020-01 2
1990-05 3
1995-08 3
1999-11 3
2021-12 3
How can i convert this dataframe into groups base on the unique values of Column B
So my results should be in this format
[[[1990-02, 1],[1990-03, 1],[1999-05, 1]],
[[1992-08, 2],[1996-12, 2],[2020-01, 2]],
[[1990-05, 3],[1995-08, 3],[1999-11, 3],[2021-12, 3]]
]

This should make the job
import pandas as pd
data = {"A": ["1990-02", "1990-03","1999-05","1992-08","1996-12",
"2020-01","1990-05","1995-08","1999-11", "2021-12"],
"B": [1,1,1,2,2,2,3,3,3,3]}
df = pd.DataFrame(data=data)
out = df.groupby("B")['A'].apply(list)
output = [[[date, b_value] for date in block]
for b_value, block in zip(out.index, out.values)]
print(output)

Here's one way to get an equivalent structure with arrays:
>>> df.groupby("B").apply(pd.DataFrame.to_numpy).values
[array([['1990-02', 1],
['1990-03', 1],
['1999-05', 1]], dtype=object)
array([['1992-08', 2],
['1996-12', 2],
['2020-01', 2]], dtype=object)
array([['1990-05', 3],
['1995-08', 3],
['1999-11', 3],
['2021-12', 3]], dtype=object)]

Here is one way to get exactly what you want:
df.assign(l=df.agg(list, axis=1)).groupby('B')['l'].agg(list).tolist()
output:
[[['1990-02', 1], ['1990-03', 1], ['1999-05', 1]],
[['1992-08', 2], ['1996-12', 2], ['2020-01', 2]],
[['1990-05', 3], ['1995-08', 3], ['1999-11', 3], ['2021-12', 3]]]

Related

How to apply Max function between rows on 2D list in pandas grouped dataframe

I have a dataframe similar to the following where "data" is a 2D array:
id grouping_val data
1 a [[0, 1], [1, 0]]
2 a [[1, 0], [0, 1]]
3 b [[2, 0], [3, 0]]
4 b [[0, 4], [4, 5]]
How can I group them by "grouping_val" and taking the max value at each index in the "data" column across all the rows. Resulting in the following dataframe:
id grouping_val data
1 a [[1, 1], [1, 1]]
2 b [[2, 4], [4, 5]]
You can np.stack() the grouped arrays and take their max() along axis=0:
df = (df.groupby('grouping_val').data
.apply(lambda x: np.stack(x).max(axis=0))
.reset_index())
# grouping_val data
# 0 a [[1, 1], [1, 1]]
# 1 b [[2, 4], [4, 5]]
df = (
df.groupby("grouping_val")["data"]
.apply(lambda x: [[*map(max, zip(*subl))] for subl in zip(*x)])
.reset_index()
)
print(df)
Prints:
grouping_val data
0 a [[1, 1], [1, 1]]
1 b [[2, 4], [4, 5]]

How to make a dataframe with one key to multiple list values to a dictionary in python?

I have a dataframe like this
ID A B
1 3 5
1 4 2
1 0 4
2 2 1
2 4 5
2 9 3
3 2 1
3 4 6
I tried code to convert them from other posts in stackoverflow
df.set_index('ID').T.to_dict('list')
But it gives me a return with each ID with only one list value
{'1': [3,5], '2': [2,1], '3': [2,1]}
Is it possible to make a dict like this?
{'1': ([3,5],[4,2],[0,4]), '2': ([2,1],[4,5],[9,3]), '3': ([2,1],[4,6])}
Dictionary keys return IDs, every ID combines with a list of tuples and every tuple contains two values.
In [150]: df.groupby('ID')['A','B'].apply(lambda x: x.values.tolist()).to_dict()
Out[150]:
{'1': [[3, 5], [4, 2], [0, 4]],
'2': [[2, 1], [4, 5], [9, 3]],
'3': [[2, 1], [4, 6]]}
defaultdict
This is a good approach. It might have a for loop, require an import, and be multiple lines (all the things that discourage upvotes). But it is actually a good solution and very fast.
from collections import defaultdict
d = defaultdict(list)
for i, a, b in df.values.tolist():
d[i].append([a, b])
dict(d)
{1: [[3, 5], [4, 2], [0, 4]], 2: [[2, 1], [4, 5], [9, 3]], 3: [[2, 1], [4, 6]]}
Alternative
Getting a tad creative with numpy.ndarray
BTW: please don't actually do this
pd.Series(
df[['A', 'B']].values[:, None].tolist(),
df.ID.values
).sum(level=0).to_dict()
{1: [[3, 5], [4, 2], [0, 4]], 2: [[2, 1], [4, 5], [9, 3]], 3: [[2, 1], [4, 6]]}

Extend lists within a pandas Series

I have a pandas series that looks like this:
group
A [1,0,5,4,6,...]
B [2,2,0,1,9,...]
C [3,5,2,0,6,...]
I have similar series that I would like to add to the existing series by extending each of the lists. How can I do this?
I tried
for x in series:
x.extend(series[series.index[x]])
but this isn't working.
Consider the series s
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s
A [1, 0]
B [2, 2]
C [4, 1]
Name: group, dtype: object
You can extend each list with a similar series simply by adding them. pandas will use the underlying objects __add__ method to combine the pairwise elements. In the case of a list, the __add__ method concatenates the lists.
s + s
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
However, this would not work if the elements were numpy.array
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s = s.apply(np.array)
In this case, I'd make sure they are lists
s.apply(list) + s.apply(list)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
Solution with add function (borrowed data sample from piRSquared):
s1 = s.add(s)
print (s1)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
EDIT:
If some index values are different, it is more complicated, because need reindex of union of all index values and replace NaN by empty lists by combine_first:
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s1 = pd.Series([[3, 9], [6, 4]], list('AD'), name='group')
idx = s.index.union(s1.index)
s = s.reindex(idx).combine_first(pd.Series([[]], index=idx))
s1 = s1.reindex(idx).combine_first(pd.Series([[]], index=idx))
s2 = s.add(s1)
print (s2)
A [1, 0, 3, 9]
B [2, 2]
C [4, 1]
D [6, 4]
Name: group, dtype: object

Updating a NumPy array with another

Seemingly simple question: I have an array with two columns, the first represents an ID and the second a count. I'd like to update it with another, similar array such that
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
a.update(b) # ????
>>> np.array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3]])
Is there a way to do this with indexing/slicing such that I don't simply have to iterate over each row?
Generic case
Approach #1: You can use np.add.at to do such an ID-based adding operation like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Find indices where the first columns of a,b are placed in out_id
_,a_idx = np.where(a[:,None,0]==out_id)
_,b_idx = np.where(b[:,None,0]==out_id)
# Place second column of a into out_id & add in second column of b
out_count[a_idx] = a[:,1]
np.add.at(out_count, b_idx,b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
To find a_idx and b_idx, as probably a faster alternative, np.searchsorted could be used like so -
a_idx = np.searchsorted(out_id, a[:,0], side='left')
b_idx = np.searchsorted(out_id, b[:,0], side='left')
Sample input-output :
In [538]: a
Out[538]:
array([[1, 2],
[4, 2],
[3, 1],
[5, 5]])
In [539]: b
Out[539]:
array([[3, 7],
[1, 1],
[4, 0],
[2, 3],
[6, 2]])
In [540]: out
Out[540]:
array([[1, 3],
[2, 3],
[3, 8],
[4, 2],
[5, 5],
[6, 2]])
Approach #2: You can use np.bincount to do the same ID based adding -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Get all IDs and counts in a single arrays
id_arr = np.concatenate((a[:,0],b[:,0]))
count_arr = np.concatenate((a[:,1],b[:,1]))
# Get binned summations
summed_vals = np.bincount(id_arr,count_arr)
# Get mask of valid bins
mask = np.in1d(np.arange(np.max(out_id)+1),out_id)
# Mask valid summed bins for final counts array output
out_count = summed_vals[mask]
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Specific case
If the ID columns in a and b are sorted, it becomes easier, as we can just use masks with np.in1d to index into the output ID array created with np.union like so -
# First column of output array as the union of first columns of a,b
out_id = np.union1d(a[:,0],b[:,0])
# Masks of first columns of a and b matches in the output ID array
mask1 = np.in1d(out_id,a[:,0])
mask2 = np.in1d(out_id,b[:,0])
# Initialize second column of output array
out_count = np.zeros_like(out_id)
# Place second column of a into out_id & add in second column of b
out_count[mask1] = a[:,1]
np.add.at(out_count, np.where(mask2)[0],b[:,1])
# Stack the ID and count arrays into a 2-column format
out = np.column_stack((out_id,out_count))
Sample run -
In [552]: a
Out[552]:
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[8, 5]])
In [553]: b
Out[553]:
array([[2, 2],
[3, 1],
[4, 0],
[5, 3],
[6, 2],
[8, 2]])
In [554]: out
Out[554]:
array([[1, 2],
[2, 4],
[3, 2],
[4, 5],
[5, 3],
[6, 2],
[8, 7]])
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
array([[1, 2],
[2, 2],
[3, 1],
[4, 5],
[5, 3]])
Note that if you want the result become sorted you can use np.lexsort :
result[np.lexsort((result[:,0],result[:,0]))]
Explanation :
First you can find the unique ids with following command :
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> col
array([1, 2, 3, 4, 5])
Then find the different between the ids if a and all of ids :
>>> dif=np.setdiff1d(col,a[:,0])
>>> dif
array([5])
Then find the items within b with the ids in diff :
>>> val=b[np.in1d(b[:,0],dif)]
>>> val
array([[5, 3]])
And at last concatenate the result with list a:
>>> np.concatenate((a,val))
consider another example with sorting :
>>> a = np.array([[1, 2],
... [2, 2],
... [3, 1],
... [7, 5]])
>>>
>>> b = np.array([[2, 2],
... [3, 1],
... [4, 0],
... [5, 3]])
>>>
>>> col=np.unique(np.hstack((b[:,0],a[:,0])))
>>> dif=np.setdiff1d(col,a[:,0])
>>> val=b[np.in1d(b[:,0],dif)]
>>> result=np.concatenate((a,val))
>>> result[np.lexsort((result[:,0],result[:,0]))]
array([[1, 2],
[2, 2],
[3, 1],
[4, 0],
[5, 3],
[7, 5]])
That's an old question but here is a solution with pandas (that could be generalized for other aggregation functions than sum). Also sorting will occur automatically:
import pandas as pd
import numpy as np
a = np.array([[1, 2],
[2, 2],
[3, 1],
[4, 5]])
b = np.array([[2, 2],
[3, 1],
[4, 0],
[5, 3]])
print((pd.DataFrame(a[:, 1], index=a[:, 0])
.add(pd.DataFrame(b[:, 1], index=b[:, 0]), fill_value=0)
.astype(int))
.reset_index()
.to_numpy())
Output:
[[1 2]
[2 4]
[3 2]
[4 5]
[5 3]]

Conditional filtering in numpy arrays or pandas DataFrame

Assume I have the following data which can be either numpy array or pandas DataFrame:
array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=int64)
I would like to get an array containing the minimal values in each category (2nd column). I could loop over each unique values perform the min operation and store the results but I was wondering whether there is a faster and cleaner way to do it.
The output would look like the following:
array([[4092, 3],
[4095, 4],
[4124, 1],
[4128, 0],
[4131, 5],
[4133, 2]], dtype=int64)
In pandas it would be done by performing a groupby and then calling min() on the 1st column, here my df has column names 0 and 1, I then call reset_index to restore the grouped index back as a column, as the ordering is now a bit messed up I use ix and 'fancy indexing' to get the order you desire:
In [22]:
result = df.groupby(1)[0].min().reset_index()
result.ix[:,[0,1]]
Out[22]:
0 1
0 4128 0
1 4124 1
2 4133 2
3 4092 3
4 4095 4
5 4131 5
The above methods are vectorised as such they will be much faster and scale much better than iterating over each row
I created the dataframe using the following code:
In [4]:
import numpy as np
a = np.array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=np.int64)
a
Out[4]:
array([[4092, 3],
[4095, 4],
[4097, 4],
[4124, 1],
[4128, 0],
[4129, 0],
[4131, 5],
[4132, 5],
[4133, 2],
[4134, 2]], dtype=int64)
In [23]:
import pandas as pd
df = pd.DataFrame(a)
df
Out[23]:
0 1
0 4092 3
1 4095 4
2 4097 4
3 4124 1
4 4128 0
5 4129 0
6 4131 5
7 4132 5
8 4133 2
9 4134 2

Categories