count unique lists in dataframe - python

I have a pandas dataframe with a column of lists, and I would like to find a way to return a dataframe with the lists in one column and the total counts in another. My problem is finding a way to add together list that contain the same values, for example I want to find the total of ['a', 'b'] and ['b', 'a'] in the end.
So for example the dataframe:
Lists Count
['a','b'] 2
['a','c'] 4
['b','a'] 3
would return:
Lists Count
['a','b'] 5
['a','c'] 4

list are unhashable. so, sort and convert to tuple,
In [80]: df
Out[80]:
count lists
0 2 [a, b]
1 4 [a, c]
2 3 [b, a]
In [82]: df['lists'] = df['lists'].map(lambda x: tuple(sorted(x)))
In [83]: df
Out[83]:
count lists
0 2 (a, b)
1 4 (a, c)
2 3 (a, b)
In [76]: df.groupby('lists').sum()
Out[76]:
count
lists
(a, b) 5
(a, c) 4

You can also use sets (after coercing them to strings).
df = pd.DataFrame({'Lists': [['a', 'b'], ['a', 'c'], ['b', 'a']],
'Value': [2, 4, 3]})
df['Sets'] = df.Lists.apply(set).astype(str)
>>> df.groupby(df.Sets).Value.sum()
Sets
set(['a', 'b']) 5
set(['a', 'c']) 4
Name: Value, dtype: int64

Related

How to concatenate two lists into pandas DataFrame?

Hey I have two different lists:
One is list of the strings:
['A',
'B',
'C',
'D',
'E']
Second list contains floats:
[(-0.07154222477384509, 0.03681057318023705),
(-0.23678194754416643, 3.408617573881597e-12),
(-0.24277881018771763, 6.991906304566735e-13),
(-0.16858465905189185, 7.569580517034595e-07),
(-0.21850787663602167, 1.1718560531238815e-10)]
I want have one DataFrame with three columns that look like this:
var_name val1 val2
A -0.07154222477384509 0.03681057318023705
Best if the new DataFrame dont have scientific notation so I dont want them as strings.
Use list copmprehension with zip for list of tuples and pass toDataFrame constructor:
a = ['A',
'B',
'C',
'D',
'E']
b = [(-0.07154222477384509, 0.03681057318023705),
(-0.23678194754416643, 3.408617573881597e-12),
(-0.24277881018771763, 6.991906304566735e-13),
(-0.16858465905189185, 7.569580517034595e-07),
(-0.21850787663602167, 1.1718560531238815e-10)]
df = pd.DataFrame([(a, *b) for a, b in zip(a,b)])
print (df)
0 1 2
0 A -0.071542 3.681057e-02
1 B -0.236782 3.408618e-12
2 C -0.242779 6.991906e-13
3 D -0.168585 7.569581e-07
4 E -0.218508 1.171856e-10
With set columns names:
df = pd.DataFrame([(a, *b) for a, b in zip(a,b)],
columns=['var_name','val1','val2'])
print (df)
var_name val1 val2
0 A -0.071542 3.681057e-02
1 B -0.236782 3.408618e-12
2 C -0.242779 6.991906e-13
3 D -0.168585 7.569581e-07
4 E -0.218508 1.171856e-10

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

pandas filter Series with a list

I have a Series and a list like this
$ import pandas as pd
$ s = pd.Series(data=[1, 2, 3, 4], index=['A', 'B', 'C', 'D'])
$ filter_list = ['A', 'C', 'D']
$ print(s)
A 1
B 2
C 3
D 4
How can I create a new Series with row B removed using s and filter_list?
I mean I want to create a Series new_s with the following content
$ print(new_s)
A 1
C 3
D 4
s.isin(filter_list) doesn't work. Because I want to filter based on the index of the Series, not the values of the Series.
Use Series.loc if all values of list exist in index:
new_s = s.loc[filter_list]
print (new_s)
A 1
C 3
D 4
dtype: int64
If possible some not exist use Index.intersection or isin like #Yusuf Baktir solution:
filter_list = ['A', 'C', 'D', 'E']
new_s = s.loc[s.index.intersection(filter_list)]
print (new_s)
A 1
C 3
D 4
dtype: int64
Another alternative with numpy.in1d:
filter_list = ['A', 'C', 'D', 'E']
new_s = s[np.in1d(s.index, filter_list)]
print (new_s)
A 1
C 3
D 4
dtype: int64
Basically, those are the index values. So, filtering on index will work
s[s.index.isin(filter_list)]
for i in filter_list:
print(i,s[i])
A 1
C 3
D 4

Pandas groupby on a column of lists

I have a pandas dataframe with a column that contains lists:
df = pd.DataFrame({'List': [['once', 'upon'], ['once', 'upon'], ['a', 'time'], ['there', 'was'], ['a', 'time']], 'Count': [2, 3, 4, 1, 2]})
Count List
2 [once, upon]
3 [once, upon]
4 [a, time]
1 [there, was]
2 [a, time]
How can I combine the List columns and sum the Count columns? The expected result is:
Count List
5 [once, upon]
6 [a, time]
1 [there, was]
I've tried:
df.groupby('List')['Count'].sum()
which results in:
TypeError: unhashable type: 'list'
One way is to convert to tuples first. This is because pandas.groupby requires keys to be hashable. Tuples are immutable and hashable, but lists are not.
res = df.groupby(df['List'].map(tuple))['Count'].sum()
Result:
List
(a, time) 6
(once, upon) 5
(there, was) 1
Name: Count, dtype: int64
If you need the result as lists in a dataframe, you can convert back:
res = df.groupby(df['List'].map(tuple))['Count'].sum()
res['List'] = res['List'].map(list)
# List Count
# 0 [a, time] 6
# 1 [once, upon] 5
# 2 [there, was] 1

For every row in Pandas dataframe determine if a column value exists in another column

I have a pandas data frame like this:
df = pd.DataFrame({'category' : ['A', 'B', 'C', 'A'], 'category_pred' : [['A'], ['B','D'], ['A','B','C'], ['D']]})
print(df)
category category_pred
0 A [A]
1 B [B, D]
2 C [A, B, C]
3 A [D]
I would like to have an output like this:
category category_pred count
0 A [A] 1
1 B [B, D] 1
2 C [A, B, C] 1
3 A [D] 0
That is, for every row, determine if the value in 'category' appears in 'category_pred'. Note that 'category_pred' can contain multiple values.
I can do a for-loop like this one, but it is really slow.
for i in df.index:
if df.category[i] in df.category_pred[i]:
df['count'][i] = 1
I am looking for an efficient way to do this operation. Thanks!
You can make use of the DataFrame's apply method.
df['count'] = df.apply(lambda x: 1 if x.category in x.category_pred else 0, axis = 1)
This will add the new column as you want

Categories