Drop rows that only relate to one value in other columns pandas - python

imagine I have dataframe like this:
item name gender
banana tom male
banana kate female
apple kate female
kiwi jim male
apple tom male
banana kimmy female
kiwi kate female
banana tom male
Is there any way to drop rows that the person only relate(buy) less than 2 item? Also I don't want to drop duplicates. So the output I want like this:
item name gender
banana tom male
banana kate female
apple kate female
apple tom male
kiwi kate female
banana tom male

#sammywemmy's solution:
df.loc[df.groupby('name').item.transform('size').ge(2)]
groupby to group rows with the same name together
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
index item name gender
name
jim 0 3 kiwi jim male
kate 0 1 banana kate female
1 2 apple kate female
2 6 kiwi kate female
kimmy 0 5 banana kimmy female
tom 0 0 banana tom male
1 4 apple tom male
2 7 banana tom male
transform to get a value in every row that represents the group size. (Number of rows)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
This could have been done on any column in this case:
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
Notice how now each row has the corresponding group size at the end. tom has 3 instances so every name == tom row has 3 in group_size.
ge Convert to Boolean Index based on relational operator
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
item name gender group_size should_keep
0 banana tom male 3 True
1 banana kate female 3 True
2 apple kate female 3 True
3 kiwi jim male 1 False
4 apple tom male 3 True
5 banana kimmy female 1 False
6 kiwi kate female 3 True
7 banana tom male 3 True
loc use Boolean Index to get the desired rows
print(df.groupby('name')['item'].transform('size').ge(2))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 True
Name: item, dtype: bool
loc will include any index that is True, any index that is False will be excluded. (indexes 3 and 5 are False so they will not be included)
All together:
import pandas as pd
df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
3: 'kiwi', 4: 'apple', 5: 'banana',
6: 'kiwi', 7: 'banana'},
'name': {0: 'tom', 1: 'kate', 2: 'kate',
3: 'jim', 4: 'tom', 5: 'kimmy',
6: 'kate', 7: 'tom'},
'gender': {0: 'male', 1: 'female',
2: 'female', 3: 'male',
4: 'male', 5: 'female',
6: 'female', 7: 'male'}})
print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
item name gender
0 banana tom male
1 banana kate female
2 apple kate female
4 apple tom male
6 kiwi kate female
7 banana tom male

Related

How to add a suffix to the first N columns in pandas?

I want to add a suffix to the first N columns. But I can't.
This is how to add a suffix to all columns:
import pandas as pd
df = pd.DataFrame( {"name" : ["John","Alex","Kate","Martin"], "surname" : ["Smith","Morgan","King","Cole"],
"job": ["Engineer","Dentist","Coach","Teacher"],"Age":[25,20,25,30],
"Id": [1,2,3,4]})
df.add_suffix("_x")
And this is the result:
name_x surname_x job_x Age_x Id_x
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
But I want to add the first N columns so let's say the first 3. Desired output is:
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
Work with the indices and take slices to modify a subset of them:
df.columns = (df.columns[:3]+'_x').union(df.columns[3:], sort=False)
print(df)
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
This should work:
N=3
cols=[i for i in df.columns[:N]]
new_cols=[i+'_x' for i in df.columns[:N]]
dict_cols=dict(zip(cols,new_cols))
df.rename(dict_cols,axis=1)
set the column labels using a list comprehension:
n = 3
df.columns = [f'{c}_x' if i < n else c for i, c in enumerate(df.columns)]
results in
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

Adding rows to a column for every element in the list for every unique value in another column in python pandas

I have two lists of unequal length:
Name = ['Tom', 'Jack', 'Nick', 'Juli', 'Harry']
bId= list(range(0,3))
I want to build a data frame that would look like below:
'Name' 'bId'
Tom 0
Tom 1
Tom 2
Jack 0
Jack 1
Jack 2
Nick 0
Nick 1
Nick 2
Juli 0
Juli 1
JUli 2
Harry 0
Harry 1
Harry 2
Please suggest.
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(Name, bId), columns=['Name','bId'])
print (df)
Name bId
0 Tom 0
1 Tom 1
2 Tom 2
3 Jack 0
4 Jack 1
5 Jack 2
6 Nick 0
7 Nick 1
8 Nick 2
9 Juli 0
10 Juli 1
11 Juli 2
12 Harry 0
13 Harry 1
14 Harry 2

Insert rows in pandas where one column misses some value in groupby

Here's my dataframe:
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 2 ....
I want to make sure that every user1-user2 pair has a row corresponding to each category (there are three: 0,1,2). If not, I want to insert a row, and set the other columns to zero.
user1 user2 cat quantity + other quantities
----------------------------------------------------
Alice Bob 0 ....
Alice Bob 1 ....
Alice Bob 2 ....
Alice Carol 0 ....
Alice Carol 1 <SET ALL TO ZERO>
Alice Carol 2 ....
what I have so far is the list of all user1-user2 which has less than 3 values for cat:
df.groupby(['user1','user2']).agg({'cat':'count'}).reset_index()[['user1','user2']]
I could iterate over these users, but that will take a long time (there are >1M such pairs). I've checked at other solutions for inserting rows in pandas based on some condition (like Pandas/Python adding row based on condition and Insert row in Pandas Dataframe based on a condition) but they're not exactly the same.
Also, since this is a huge dataset, the solution has to be vectorized. How should I proceed?
Use set_index with reindex by MultiIndex.from_product:
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 2 3 4
df = df.set_index(['user1','user2', 'cat'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2 4
1 Alice Bob 1 3 4
2 Alice Bob 2 4 4
3 Alice Carol 0 6 4
4 Alice Carol 1 0 0
5 Alice Carol 2 3 4
Another solution is create new Dataframe by all combinations of unique values of columns and merge with right join:
from itertools import product
df1 = pd.DataFrame(list(product(df['user1'].unique(),
df['user2'].unique(),
df['cat'].unique())), columns=['user1','user2', 'cat'])
df = df.merge(df1, how='right').fillna(0)
print (df)
user1 user2 cat quantity a
0 Alice Bob 0 2.0 4.0
1 Alice Bob 1 3.0 4.0
2 Alice Bob 2 4.0 4.0
3 Alice Carol 0 6.0 4.0
4 Alice Carol 2 3.0 4.0
5 Alice Carol 1 0.0 0.0
EDIT2:
df['user1'] = df['user1'] + '_' + df['user2']
df = df.set_index(['user1', 'cat']).drop('user2', 1)
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
df[['user1','user2']] = df['user1'].str.split('_', expand=True)
print (df)
user1 cat quantity a user2
0 Alice 0 2 4 Bob
1 Alice 1 3 4 Bob
2 Alice 2 4 4 Bob
3 Alice 0 6 4 Carol
4 Alice 1 0 0 Carol
5 Alice 2 3 4 Carol
EDIT3:
cols = df.columns.difference(['user1','user2'])
df = (df.groupby(['user1','user2'])[cols]
.apply(lambda x: x.set_index('cat').reindex(df['cat'].unique(), fill_value=0))
.reset_index())
print (df)
user1 user2 cat a quantity
0 Alice Bob 0 4 2
1 Alice Bob 1 4 3
2 Alice Bob 2 4 4
3 Alice Carol 0 4 6
4 Alice Carol 1 0 0
5 Alice Carol 2 4 3

Counting elements in Pandas

Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1

Drop duplicates for rows with interchangeable name values (Pandas, Python)

I have a DataFrame of form
person1, person2, ..., someMetric
John, Steve, ..., 20
Peter, Larry, ..., 12
Steve, John, ..., 20
Rows 0 and 2 are interchangeable duplicates, so I'd want to drop the last row. I can't figure out how to do this in Pandas.
Thanks!
Here's a NumPy based solution -
df[~(np.triu(df.person1.values[:,None] == df.person2.values)).any(0)]
Sample run -
In [123]: df
Out[123]:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
2 Steve John 19
3 Peter Parker 5
4 Larry Peter 7
In [124]: df[~(np.triu(df.person1.values[:,None] == df.person2.values)).any(0)]
Out[124]:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
3 Peter Parker 5
an approach in pandas
df = pd.DataFrame(
{'person2': {0: 'Steve', 1: 'Larry', 2: 'John', 3: 'Parker', 4: 'Peter'},
'person1': {0: 'John', 1: 'Peter', 2: 'Steve', 3: 'Peter', 4: 'Larry'},
'someMetric': {0: 20, 1: 13, 2: 19, 3: 5, 4: 7}})
print(df)
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
2 Steve John 19
3 Peter Parker 5
4 Larry Peter 7
df['ordered-name'] = df.apply(lambda x: '-'.join(sorted([x['person1'],x['person2']])),axis=1)
df = df.drop_duplicates(['ordered-name'])
df.drop(['ordered-name'], axis=1, inplace=True)
print df
which gives:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
3 Peter Parker 5

Categories