Pandas groupby on a column of lists - python

I have a pandas dataframe with a column that contains lists:
df = pd.DataFrame({'List': [['once', 'upon'], ['once', 'upon'], ['a', 'time'], ['there', 'was'], ['a', 'time']], 'Count': [2, 3, 4, 1, 2]})
Count List
2 [once, upon]
3 [once, upon]
4 [a, time]
1 [there, was]
2 [a, time]
How can I combine the List columns and sum the Count columns? The expected result is:
Count List
5 [once, upon]
6 [a, time]
1 [there, was]
I've tried:
df.groupby('List')['Count'].sum()
which results in:
TypeError: unhashable type: 'list'

One way is to convert to tuples first. This is because pandas.groupby requires keys to be hashable. Tuples are immutable and hashable, but lists are not.
res = df.groupby(df['List'].map(tuple))['Count'].sum()
Result:
List
(a, time) 6
(once, upon) 5
(there, was) 1
Name: Count, dtype: int64
If you need the result as lists in a dataframe, you can convert back:
res = df.groupby(df['List'].map(tuple))['Count'].sum()
res['List'] = res['List'].map(list)
# List Count
# 0 [a, time] 6
# 1 [once, upon] 5
# 2 [there, was] 1

Related

Extract all elements from sets in pandas DataFrame

I have a pandas DataFrame where each cell is a set of numbers. I would like to go through the DataFrame and run each number along with the row index in a function. What's the most pandas-esque and efficient way to do this? Here's an example of one way to do it with for-loops, but I'm hopeful that there's a better approach.
def my_func(a, b):
pass
d = {"a": [{1}, {4}], "b": [{1, 2, 3}, {2}]}
df = pd.DataFrame(d)
for index, item in df.iterrows():
for j in item:
for a in list(j):
my_func(index, a)
Instead of iterating we can reshape the values into 1 column using stack then explode into separate rows:
s:
df.stack().explode()
0 a 1
b 1
b 2
b 3
1 a 4
b 2
dtype: object
We can further droplevel if we don't want the old column labels:
s = df.stack().explode().droplevel(1)
s:
0 1
0 1
0 2
0 3
1 4
1 2
dtype: object
reset_index can be used to create a DataFrame instead of a Series:
new_df = df.stack().explode().droplevel(1).reset_index()
new_df.columns = ['a', 'b'] # Rename columns to whatever
new_df:
a b
0 0 1
1 0 1
2 0 2
3 0 3
4 1 4
5 1 2
If i fully understood your problem. This might be one way of doing it:
[list(item) for sublist in df.values.tolist() for item in sublist]
The output will look like this:
[[1], [1, 2, 3], [4], [2]]
Since this is a nested list, you can flatten it if your requirement is a single list.

how do you filter a Pandas dataframe by a multi-column set?

Is there a way to filter a large dataframe by comparing multiple columns against a set of tuples where each element in the tuple corresponds to a different column value?
For example, is there a .isin() method that compares multiple columns of the DataFrame against a set of tuples?
Example:
df = pd.DataFrame({
'a': [1, 1, 1],
'b': [2, 2, 0],
'c': [3, 3, 3],
'd': ['not', 'relevant', 'column'],
})
# Filter the DataFrame by checking if the values in columns [a, b, c] match any tuple in value_set
value_set = set([(1,2,3), (1, 1, 1)])
new_df = ?? # should contain just the first two rows of df
You can use Series.isin, but first is necessary create tuples by first 3 columns:
print (df[df[['a','b','c']].apply(tuple, axis=1).isin(value_set)])
Or convert columns to index and use Index.isin:
print (df[df.set_index(['a','b','c']).index.isin(value_set)])
a b c d
0 1 2 3 not
1 1 2 3 relevant
Another idea is use inner join of DataFrame.merge by helper DataFrame by same 3 columns names, then on parameter should be omit, because join by intersection of columns names of both df:
print (df.merge(pd.DataFrame(value_set, columns=['a','b','c'])))
a b c d
0 1 2 3 not
1 1 2 3 relevant

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

count unique lists in dataframe

I have a pandas dataframe with a column of lists, and I would like to find a way to return a dataframe with the lists in one column and the total counts in another. My problem is finding a way to add together list that contain the same values, for example I want to find the total of ['a', 'b'] and ['b', 'a'] in the end.
So for example the dataframe:
Lists Count
['a','b'] 2
['a','c'] 4
['b','a'] 3
would return:
Lists Count
['a','b'] 5
['a','c'] 4
list are unhashable. so, sort and convert to tuple,
In [80]: df
Out[80]:
count lists
0 2 [a, b]
1 4 [a, c]
2 3 [b, a]
In [82]: df['lists'] = df['lists'].map(lambda x: tuple(sorted(x)))
In [83]: df
Out[83]:
count lists
0 2 (a, b)
1 4 (a, c)
2 3 (a, b)
In [76]: df.groupby('lists').sum()
Out[76]:
count
lists
(a, b) 5
(a, c) 4
You can also use sets (after coercing them to strings).
df = pd.DataFrame({'Lists': [['a', 'b'], ['a', 'c'], ['b', 'a']],
'Value': [2, 4, 3]})
df['Sets'] = df.Lists.apply(set).astype(str)
>>> df.groupby(df.Sets).Value.sum()
Sets
set(['a', 'b']) 5
set(['a', 'c']) 4
Name: Value, dtype: int64

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories