Dataframe group by only groups with 2 or more rows

Dataframe group by only groups with 2 or more rows - python

Is there a way to groupby only groups with 2 or more rows?
Or can I delete groups from a grouped dataframe that contains only 1 row?
Thank you very much for your help!

Yes there is a way. Here is an example below
df = pd.DataFrame(
np.array([['A','A','B','C','C'],[1,2,1,1,2]]).T
, columns= ['type','value']
)
groups = df.groupby('type')
groups_without_single_row_df = [g for g in groups if len(g[1]) > 1]
groupby return a list of tuples.
Here, 'type' (A, b or C) is the first element of the tuple and the subdataframe the second element.
You can check length of each subdataframe with len() as in [g for g in groups if len(g[1]) > 1] where we check the lengh of the second element of the tuple.
If the the len() is greater than 1, it is include in the ouput list.
Hope it helps

Related

Rename None in a list under pandas column

Let's say I have the following dataframe:
Value
[None, A, B, C]
[None]
I would like to replace None value in the column with none but it seems I couldn't figure out it.
I used this but not working.
df['Value'] = df['Value'].str.replace('None','none')

None is a built-in type in Python, so if you want to make it lowercase, you have to convert it to a string.
There is no built-in way in Pandas to replace values in lists, but you can use explode to expand all the lists so that each individual item of each list gets its own row in the column, then replace, then group back together into the original list format:
df['Value'] = df['Value'].explode().replace({None: 'none'}).groupby(level=0).apply(list)
Output:
>>> df
Value
0 [none, A, B, C]
1 [none]

Here is a way using map()
df['Value'] = df['Value'].map(lambda x: ['none' if i == None else i for i in x])
Output:
Value
0 [none, A, B, C]
1 [none]

Get unique values from multiple lists in Pandas column

How can I join the multiple lists in a Pandas column 'B' and get the unique values only:
A B
0 10 [x50, y-1, sss00]
1 20 [x20, MN100, x50, sss00]
2 ...
Expected output:
[x50, y-1, sss00, x20, MN100]

You can do this simply by list comprehension and sum() method:
result=[x for x in set(df['B'].sum())]
Now If you print result you will get your desired output:
['y-1', 'x20', 'sss00', 'x50', 'MN100']

If in input data are not lists, but strings first create lists:
df.B = df.B.str.strip('[]').str.split(',')
Or:
import ast
df.B = df.B.apply(ast.literal_eval)
Use Series.explode for one Series from lists with Series.unique for remove duplicates if order is important:
L = df.B.explode().unique().tolist()
#alternative
#L = df.B.explode().drop_duplicates().tolist()
print (L)
['x50', 'y-1', 'sss00', 'x20', 'MN100']
Another idea if order is not important use set comprehension with flatten lists:
L = list(set([y for x in df.B for y in x]))
print (L)
['x50', 'MN100', 'x20', 'sss00', 'y-1']

Intersection in single pandas Series

0
0 [g,k]
1 [e,g]
2 [e]
3 [k,e]
4 [s]
5 [g]
I am trying to get the value which appears once in the data column, in this example the solution should be 's'.
But I can only find methods to solve this problem while having two series or two dataframe columns.
I can't do it in one column, because if the value is part of a combination unique won't work as far as I know.

If need test if one value only is possible use Series.explode with Series.value_counts and then filter index by 1 in boolean indexing:
s = df[0].explode().value_counts()
L = s.index[s == 1].tolist()
print (L)
['s']
Or use pure python solution with Counter and flatten nested lists in Series in list comprehension:
from collections import Counter
L = [k for k, v in Counter([y for x in df[0] for y in x]).items() if v == 1]
print (L)
['s']

Search values from a list in dataframe cell list and add another column with results

I am trying to create a column with the result of a comparison between a Dataframe cell list and a list
I have this dataframe with list values:
df = pd.DataFrame({'A': [['KB4525236', 'KB4485447', 'KB4520724', 'KB3192137', 'KB4509091']], 'B': [['a', 'b']]})
and a list with this value:
findKBs = ['KB4525236','KB4525202']
The expected result :
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
I don´t know how to iterate my list with the cell list and find the non matches, can you help me?

You should simply compare the 2 lists like this: Loop through the values of findKBs and assign them to new list if they are not in df['A'][0]
df['C'] = [[x for x in findKBs if x not in df['A'][0]]]
Result:
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]

There's probably a pandas-centric way you could do it,but this appears to work:
df['C'] = [list(filter(lambda el: True if el not in df['A'][0] else False, findKBs))]

Check to see if column values exist in dictionary [pandas]

Can a data frame column (Series) of lists be used as a conditional check within a dictionary?
I have a column of lists of words (split up tweets) that I'd like to feed to a vocab dictionary to see if they all exist - if one does not exist, I'd like to skip it, continue on and then run a function over the existing words.
This code produces the intended result for one row in the column, however, I get a "unhashable type list" error if I try to apply it to more than one column.
w2v_sum = w2v[[x for x in train['words'].values[1] if x in w2v.vocab]].sum()
Edit with reproducible example:
df = pd.DataFrame(data={'words':[['cow','bird','cat'],['red','blue','green'],['low','high','med']]})
d = {'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3}
Desired output is total (sum of the words within dictionary):
total words
0 5 [cow, bird, cat]
1 3 [red, blue, green]
2 9 [low, high, med]

This should do what you want:
import pandas as pd
df = pd.DataFrame(data={'words':[['cow','bird','cat'],['red','blue','green'],['low','high','med']]})
d = {'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3}
EDIT:
To reflect the lists inside the column, see this nested comprehension:
list_totals = [[d[x] for x in y if x in d] for y in df['words'].values]
list_totals = [sum(x) for x in list_totals]
list_totals
[5, 3, 9]
You can then add list_totals as a column to your pd.

One solution is to use collections.Counter and a list comprehension:
from collections import Counter
d = Counter({'cow':1,'bird':4,'red':1,'blue':1,'green':1,'high':6,'med':3})
df['total'] = [sum(map(d.__getitem__, L)) for L in df['words']]
print(df)
words total
0 [cow, bird, cat] 5
1 [red, blue, green] 3
2 [low, high, med] 9
Alternatively, if you always have a fixed number of words, you can split into multiple series and use pd.DataFrame.applymap:
df['total'] = pd.DataFrame(df['words'].tolist()).applymap(d.get).sum(1).astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe group by only groups with 2 or more rows - python

Is there a way to groupby only groups with 2 or more rows? Or can I delete groups from a grouped dataframe that contains only 1 row? Thank you very much for your help!

Related

Rename None in a list under pandas column

Get unique values from multiple lists in Pandas column

Intersection in single pandas Series

Search values from a list in dataframe cell list and add another column with results

Check to see if column values exist in dictionary [pandas]

Categories

Resources