Removing values from column within groups based on conditions - python

I am really struggling with this even though I feel like it should be extremely easy.
I have a dataframe that looks like this:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Yes
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
I am first grouping by Title, and then Release Date and then observing the values of Released and In Stores. If both Released and In Stores have a value of "Yes" in the same Release Date year, then remove the In Stores value.
So in the above dataframe, the category Seinfeld --> 1999 would have the "Yes" removed from In Stores, but the "Yes" would stay in the In Stores category for "2004" since it is the only "Yes" in the Friends --> 2004 category.
I am starting by using
df.groupby(['Title', 'Release Date'])['Released', 'In Stores].count()
But I cannot figure out the syntax of removing values from In_Stores.
Desired output:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
EDIT: I have tried this line given in the top comment:
flag = (df.groupby(['Title', 'Release Date']).transform(lambda x: (x == 'Yes').any()) .all(axis=1))
but the kernel runs indefinitely.

You can use groupby.transform to flag rows where In Stores needs to be removed, based on whether the row's ['Title', 'Release Date'] group has at least one value of 'Yes' in column Released, and also in column In Stores.
flag = (df.groupby(['Title', 'Release Date'])
.transform(lambda x: (x == 'Yes').any())
.all(axis=1))
print(flag)
0 False
1 True
2 True
3 False
4 False
5 False
dtype: bool
df.loc[flag, 'In Stores'] = np.nan
Result:
Title
Release Date
Released
In Stores
Seinfeld
1995
nan
nan
Seinfeld
1999
Yes
nan
Seinfeld
1999
nan
nan
Friends
2000
Yes
nan
Friends
2004
nan
Yes
Friends
2004
nan
nan

Related

Filling column based on conditions

In the DataFrame below
df = pd.DataFrame([('Ve_Paper', 'Buy', '-','Canada',np.NaN),
('Ve_Gasoline', 'Sell', 'Done','Britain',np.NaN),
('Ve_Water', 'Sell','-','Canada,np.NaN),
('Ve_Plant', 'Buy', 'Good','China',np.NaN),
('Ve_Soda', 'Sell', 'Process','Germany',np.NaN)], columns=['Name', 'Action','Status','Country','Value'])
I am trying to update the Value column based on the following conditions, if Action is Sell check if the Status is not - if both the conditions are true then the first two characters of the Country needs to be updated as the Value column else if Status column is - and the Action column is Sell the Value column needs to be updated with the Name column without the characters Ve_, if Action is not Sell leave the Value column as np.NaN
But the output I am expecting is
Name Action Status Country Value
Ve_Paper Buy - Canada np.NaN # Because Action is not Sell
Ve_Gasoline Sell Done Britain Br # The first two characters of Country Since Action is sell and Status is not "-"
Ve_Water Sell - Canada Water # The Name value without 'Ve_' since Action is Sell and the Status is '-'
Ve_Plant Buy Good China np.NaN
Ve_Soda Sell Process Germany Ge
I have tried with np.where and df.loc both didn't work. Please do help me because I am out of options now
What I have tried so far is
import numpy as np
df['Value'] = np.where(df['Action']== 'Sell',df['Country'].str[:2] if df['Status'].str != '-' else df['Name'].str[3:],df['Value'])
but I am getting the output as <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0> wherever Iam trying to extract substrings
so the output looks like this
Name Action Status Country Value
Ve_Paper Buy - Canada np.NaN
Ve_Gasoline Sell Done Britain <pandas.core.strings.StringMethods object at 662B0>
Ve_Water Sell - Canada <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0>
Ve_Plant Buy Good China np.NaN
Ve_Soda Sell Process Germany <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0>
You have two conditions ,we can do it with np.select
conda = df.Action.eq('Sell')
condb = df.Status.eq('-')
df['value'] = np.select([conda&condb, conda&~condb],
[df.Name.str.split('_').str[1],df.Country.str[:2]],
default = np.nan)
df
Out[343]:
Name Action Status Country Value value
0 Ve_Paper Buy - Canada NaN NaN
1 Ve_Gasoline Sell Done Britain NaN Br
2 Ve_Water Sell - Canada NaN Water
3 Ve_Plant Buy Good China NaN NaN
4 Ve_Soda Sell Process Germany NaN Ge

Clear column based on a value appearing in another column

I feel like this should be really simple but I have been stuck on this all day.
I have a dataframe that looks like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005 Yes
Robert 2005 Yes
I am grouping by Name and Date. In each group, if "Yes" appears in Graduated, then clear the Really? column. And if "Yes" doesn't appear in the group, then leave as is. The output should look like:
Name Date Graduated Really?
Bob 2014 Yes
Bob 2020 Yes
Sally 1995 Yes
Sally 1999
Sally 1999 No
Robert 2005
Robert 2005 Yes
I keep trying different variations of mask = df.groupby(['Name','Date'])['Graduated'].isin('Yes') before doing df.loc[mask, "Really?"] = Nonebut receive AttributeError (I assume my syntax is incorrect).
Edited expected output.
Try this:
s = df['Graduated'].eq('Yes').groupby([df['Name'],df['Date']]).transform('any')
df['Really?'] = df['Really?'].mask(s,'')

DataFrame non-NaN series assignment results in NaN

I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0

Can I copy values from other rows and column and automatically replace the missing values?

So, my dataframe is
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0 automatic SUV NaN True 2018-06-23 19
1 25500 NaN ford f-150 good 6.0 gas 88705.0 automatic pickup white True 2018-10-19 50
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red False 2019-02-07 79
3 1500 2003.0 ford f-150 fair 8.0 gas NaN automatic pickup NaN False 2019-03-22 9
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black False 2019-04-02 28
As you can see, row 1's model is the same as row 3's, but row 1's model year is missing. It would naturally follow I can replace row 1's model year with row 3's so there isn't NaN there, and I'm aware I can manually change it, but the dataframe is over 50,000 rows long and there are many more values just like that Is there an automated way I can go about replacing these values like that?
Edit: After looking over the df just now, I've realized that I can't really replace the model year like that as it can change even within the same model, although I would still love to know how it's done if possible for future reference
You can merge dataframe with itself and fillna it.
df_want = df.merge(df[['model_year','model']].dropna().drop_duplicates(),on='model',how='left')
df_want['model_year'] = df_want['model_year_x'].fillna(df_want['model_year_y']
df_want = df_want.drop(['model_year_x','model_year_y'],axis=1)
Yes, you can replace all NaN model years with the non-nan entry like this:
models = df['model'].unique()
for m in models:
year = df.loc[(df['model_year'].notna()) & (df['model'] == m)]['model_year'].values[0]
df.at[(df['model_year'].isna()) & (df['model'] == m), 'model_year'] = year

How do I turn objects or values in pandas df into boolean?

What I want to do is change values in a column into boolean.
What I am looking at: I have a dataset of artists with a column named "Death Year".
Within that column, it has the Death Year or Nan which I changed to Alive. I want to make this column where it turns the death year into false and alive value as True. dType for this column is Object.
Reproducible Example:
df = pd.DataFrame({'DeathYear':[2005,2003,np.nan,1993]})
DeathYear
0 2005.0
1 2003.0
2 NaN
3 1993.0
which you turned into
df['DeathYear'] = df['DeathYear'].fillna('Alive')
DeathYear
0 2005
1 2003
2 Alive
3 1993
You can just use
df['BoolDeathYear'] = df['DeathYear'] == 'Alive'
DeathYear BoolDeathYear
0 2005 False
1 2003 False
2 Alive True
3 1993 False
Notice that, if your final goal is to have the bool column, you don't have to fill the NaNs at all.
Can just do
df['BoolDeathYear'] = df['DeathYear'].isnull()

Categories