In the DataFrame below
df = pd.DataFrame([('Ve_Paper', 'Buy', '-','Canada',np.NaN),
('Ve_Gasoline', 'Sell', 'Done','Britain',np.NaN),
('Ve_Water', 'Sell','-','Canada,np.NaN),
('Ve_Plant', 'Buy', 'Good','China',np.NaN),
('Ve_Soda', 'Sell', 'Process','Germany',np.NaN)], columns=['Name', 'Action','Status','Country','Value'])
I am trying to update the Value column based on the following conditions, if Action is Sell check if the Status is not - if both the conditions are true then the first two characters of the Country needs to be updated as the Value column else if Status column is - and the Action column is Sell the Value column needs to be updated with the Name column without the characters Ve_, if Action is not Sell leave the Value column as np.NaN
But the output I am expecting is
Name Action Status Country Value
Ve_Paper Buy - Canada np.NaN # Because Action is not Sell
Ve_Gasoline Sell Done Britain Br # The first two characters of Country Since Action is sell and Status is not "-"
Ve_Water Sell - Canada Water # The Name value without 'Ve_' since Action is Sell and the Status is '-'
Ve_Plant Buy Good China np.NaN
Ve_Soda Sell Process Germany Ge
I have tried with np.where and df.loc both didn't work. Please do help me because I am out of options now
What I have tried so far is
import numpy as np
df['Value'] = np.where(df['Action']== 'Sell',df['Country'].str[:2] if df['Status'].str != '-' else df['Name'].str[3:],df['Value'])
but I am getting the output as <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0> wherever Iam trying to extract substrings
so the output looks like this
Name Action Status Country Value
Ve_Paper Buy - Canada np.NaN
Ve_Gasoline Sell Done Britain <pandas.core.strings.StringMethods object at 662B0>
Ve_Water Sell - Canada <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0>
Ve_Plant Buy Good China np.NaN
Ve_Soda Sell Process Germany <pandas.core.strings.StringMethods object at 0x000001EDB8F662B0>
You have two conditions ,we can do it with np.select
conda = df.Action.eq('Sell')
condb = df.Status.eq('-')
df['value'] = np.select([conda&condb, conda&~condb],
[df.Name.str.split('_').str[1],df.Country.str[:2]],
default = np.nan)
df
Out[343]:
Name Action Status Country Value value
0 Ve_Paper Buy - Canada NaN NaN
1 Ve_Gasoline Sell Done Britain NaN Br
2 Ve_Water Sell - Canada NaN Water
3 Ve_Plant Buy Good China NaN NaN
4 Ve_Soda Sell Process Germany NaN Ge
Related
I am trying to convert an excel sheet in a particular format to another one.
Current Format
enter image description here
Expected Format
enter image description here
The Map Criteria list is exhaustive and values may not be present in all cases.
Though I am able to do it in excel itself, need to schedule it to run on a recurring basis. That's why trying to solve it using python.
I tried dataframe aggregate and melt functions which is notgiving the intended result.
You're looking for:
df.pivot(index=['Map ID','Map Name'], columns=['Map Criteria'], values='Map Values')
Output:
Brand Country Department Gender Product
Map ID Map Name
1 AAA Brand A United KIngdom Marketing Male Laptop
2 BBB Brand B NaN Finance NaN NaN
3 CCCC NaN United Kindgom NaN NaN NaN
4 DDD NaN NaN NaN NaN Mobile
5 DDD NaN NaN NaN Female NaN
Input data:
df = pd.DataFrame({
'Map ID': [1,1,1,1,1, 2,2, 3,4,5],
'Map Name': [*['AAA']*5,*['BBB']*2,'CCCC',*['DDD']*2],
'Map Criteria': ['Brand','Department','Country','Product','Gender']*2,
'Map Values': ['Brand A','Marketing','United KIngdom','Laptop','Male','Brand B','Finance','United Kindgom','Mobile','Female']
})
I am really struggling with this even though I feel like it should be extremely easy.
I have a dataframe that looks like this:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Yes
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
I am first grouping by Title, and then Release Date and then observing the values of Released and In Stores. If both Released and In Stores have a value of "Yes" in the same Release Date year, then remove the In Stores value.
So in the above dataframe, the category Seinfeld --> 1999 would have the "Yes" removed from In Stores, but the "Yes" would stay in the In Stores category for "2004" since it is the only "Yes" in the Friends --> 2004 category.
I am starting by using
df.groupby(['Title', 'Release Date'])['Released', 'In Stores].count()
But I cannot figure out the syntax of removing values from In_Stores.
Desired output:
Title
Release Date
Released
In Stores
Seinfeld
1995
Seinfeld
1999
Yes
Seinfeld
1999
Friends
2000
Yes
Friends
2004
Yes
Friends
2004
EDIT: I have tried this line given in the top comment:
flag = (df.groupby(['Title', 'Release Date']).transform(lambda x: (x == 'Yes').any()) .all(axis=1))
but the kernel runs indefinitely.
You can use groupby.transform to flag rows where In Stores needs to be removed, based on whether the row's ['Title', 'Release Date'] group has at least one value of 'Yes' in column Released, and also in column In Stores.
flag = (df.groupby(['Title', 'Release Date'])
.transform(lambda x: (x == 'Yes').any())
.all(axis=1))
print(flag)
0 False
1 True
2 True
3 False
4 False
5 False
dtype: bool
df.loc[flag, 'In Stores'] = np.nan
Result:
Title
Release Date
Released
In Stores
Seinfeld
1995
nan
nan
Seinfeld
1999
Yes
nan
Seinfeld
1999
nan
nan
Friends
2000
Yes
nan
Friends
2004
nan
Yes
Friends
2004
nan
nan
I have a data frame named df_cp which has the data as below,
I need to insert a new project name for CompanyID 'LCM' at the first empty cell in a row with index 1. I have found the index of the row which is of my interest using this,
index_row = df_cp[df_cp['CompanyID']=='LCM'].index
How can I iterate within a row with index_row as 1, the task is to replace the first NaN at index 1 with "Healthcare".
Please help with this.
IIUC, you can use isna and idxmax:
df.loc[1, df.loc[1].isna().idxmax()] = 'Healthcare'
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN
Note: idxmax returns the index of the first occurrence of the maximum value.
More, generalized:
m = df['CompanyID'] == 'LCM'
df.loc[m, df[m].isna().idxmax(axis=1)] = 'Healthcare'
df
Output:
CompanyID Project01 Project02 Project03 Project04 Project05
0 134 oil furniture NaN NaN NaN
1 LCM oil furniture car Healthcare NaN
2 Z01 oil furniture NaN NaN NaN
3 453 oil furniture agro meat NaN
I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2
I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')