Pandas deleting rows based on same sting in columns - python

Manufacturer Buy Box Seller
0 Goli Goli Nutrition Inc.
1 Hanes 3rd Street Brands
2 NaN Inspiring Life
3 Sports Research Sports Research
4 Beckham Luxury Linen Thalestris Co.
Hello i am using pandas DataFrame to clean this file and want to delete rows which contains the manufacturers name in the buy-box seller column. For example row 1 will be deleted because it contains the string 'Goli' in Buy-Box seller Column.

There are misisng values so first replace them by DataFrame.fillna and then test if match values between columns by not in statement in DataFrame.apply with axis=1 and filter in boolean indexing:
mask = (df.fillna('Missing vals')
.apply(lambda x: x['Manufacturer'] not in x['Buy Box Seller'], axis=1))
df = df[mask]

Related

Groupby 2 Columns, ['year_range', 'director']. Where director is not in in a year range show 0

I have a dataframe df. With different columns including ['year_range', 'popularity', 'director']. I want to do a groupby() to see the mean score of 'popularity' column
for each director per 'year_range'. Some values 'director' don't fall in some 'year_range' so it doesn't return anything for those 'year_range'. I want to return 0 for those 'year_range' where the 'director' was not active
df.groupby(['year_range', 'director'].popularity.mean()
If a director is not active 2010s should return 0 James Cameron.

Reference df1 to check validity in df2 and create a new column in pandas

I have a dataframe with three columns denoting three zones of countries a user can be subscribed.In each of the three columns there is a list of countries (some countries are in all three columns)
In another dataframe I have a list of users and the countries they are in.
The objective is to identify what zone the user is in, if any and remark that they are or are not allowed to use the service in that country.
df1 contains the country the user is in and the zone the user is subscribed to, as well as other fields.
df2 contains the zones available and the list of countries for that zone, as well as other fields.
df1.head()
name alias3 status_y country
Thetis Z1 active Romania
Demis Z1 active No_country
Donis Z1 active Sweden
Rhona Z3 active Germany
Theau Z2 active Bangladesh
df2.head()
Zone 1 Zone 2 Zone 3
ALBANIA ALBANIA ALBANIA
BELGIUM BELGIUM BELGIUM
BULGARIA AUSTRIA AUSTRIA
NaN CROATIA CROATIA
NaN NaN DENMARK
I have written conditions listing one of the three zones the user is subscribed to.
I have written values that select the country the user is in, and checks if that country is in the zone the user is subscribed to.
conditions = [
(df1['alias3']=='Z1'),
(df1['alias3']=='Z2'),
(df1['alias3']=='Z3')
]
values = [
df1['country'].str.upper().isin(country_zone['Zone 1']),
df1['country'].str.upper().isin(country_zone['Zone 2']),
df1['country'].str.upper().isin(country_zone['Zone 3'])
]
df1['valid_country'] = np.select(conditions, values)
Is there a better way to do this in pandas?
One easy way would be:
def valid(sdf):
zone = sdf.alias3.iat[0][-1]
sdf["valid_country"] = sdf.country.str.upper().isin(df2[f"Zone {zone}"])
return sdf
df1 = df1.groupby("alias3").apply(valid)
groupby df1 over the alias3s and then
apply a function to the groups that checks if the uppered country names in the group's country column are in the respective column of df2 and stores the result in a column named valid_country
Another way would be to alter df2 a bit:
df2.columns = df2.columns.str.replace("one ", "")
df2 = (
df2.melt(var_name="alias3", value_name="country")
.dropna()
.assign(valid_country=True)
)
df2.country = df2.country.str.capitalize()
Transforming the column names from 'Zone 1/2/3' to 'Z1/2/3'
melt-ing: the Zone-column names into a column named alias3 with the respective country names in a column named country
Dropping the NaNs
Adding a column named valid_country all True
Capitalizing the country names
And then:
df1 = df1.merge(df2, on=["alias3", "country"], how="left")
df1.valid_country[df1.valid_country.isna()] = False
Left-merge-ing the result with df1 on the columns alias3 and country
Filling in the missing False in the column valid_country

Add lists to DF / Expand string to new column and data

I have a CSV file with a column that contains a string with vehicle options.
Brand Options
Toyota Color:Black,Wheels:18
Toyota Color:Black,
Chevy Color:Red,Wheels:16,Style:18
Honda Color:Green,Personalization:"Customer requested detailing"
Chevy Color:Black,Wheels:16
I want to expand the "Options" string to new columns with the appropriate name. The dataset is considerably large so I am trying to name the columns programmatically (ie: Color, Wheels, Personalization) then apply the respective value to the row or a null value.
Adding new data
import pandas as pd
Cars = pd.read_csv("Cars.CSV") # Loads cars into df
split = Cars["Options"].str.split(",", expand = True) # Data in form of {"Color:Black", "Wheels:16"}
split[0][0].split(":") # returns ['Color', 'Black']
What is an elegant way to concat these lists to the original dataframe Cars without specified columns?
You can prepare for a clean split by first using rstrip to avoid a null column, since you have one row with a comma at the end. Then, after splitting, explode to multiple rows and split again by :, this time using expand=True. Then, pivot the dataset into the desired format and concat back to the original dataframe:
pd.concat([df,
df['Options'].str.rstrip(',')
.str.split(',')
.explode()
.str.split(':', expand=True)
.pivot(values=1, columns=0)],
axis=1).drop('Options', axis=1)
Out[1]:
Brand Color Personalization Style Wheels
0 Toyota Black NaN NaN 18
1 Toyota Black NaN NaN NaN
2 Chevy Red NaN 18 16
3 Honda Green "Customer requested detailing" NaN NaN
4 Chevy Black NaN NaN 16

Pandas: sum of values in one dataframe based on the group in a different dataframe

I have a dataframe such contains companies with their sectors
Symbol Sector
0 MCM Industrials
1 AFT Health Care
2 ABV Health Care
3 AMN Health Care
4 ACN Information Technology
I have another dataframe that contains companies with their positions
Symbol Position
0 ABC 1864817
1 AAP -3298989
2 ABV -1556626
3 AXC 2436387
4 ABT 878535
What I want is to get a dataframe that contains the aggregate positions for sectors. So sum the positions of all the companies in a given sector. I can do this individually by
df2[df2.Symbol.isin(df1.groupby('Sector').get_group('Industrials')['Symbol'].to_list())]
I am looking for a more efficient pandas approach to do this rather than looping over each sector under the group_by. The final dataframe should look like the following:
Sector Sum Position
0 Industrials 14567232
1 Health Care -329173249
2 Information Technology -65742234
3 Energy 6574352342
4 Pharma 6342387658
Any help is appreciated.
If I understood the question correctly, one way to do it is joining both data frames and then group by sector and sum the position column, like so:
df_agg = df1.join(df2['Position']).drop('Symbol', axis=1)
df_agg.groupby('Sector').sum()
Where, df1 is the df with Sectors and df2 is the df with Positions.
You can map the Symbol column to sector and use that Series to group.
df2.groupby(df2.Symbol.map(df1.set_index('Symbol').Sector)).Position.sum()
let us just do merge
df2.merge(df1,how='left').groupby('Sector').Position.sum()

Pandas, join row from target file based on condition

I need to merge a row from a target DataFrame into my source DataFrame on a fuzzy matching condition that has already been developed, let's call the method fuzzyTest. If fuzzy test returns True, I want to merge the row from the target file into my source file when matched.
So basically do a left join where the TARGET COMPANY passes the fuzzyTest when compared to the SOURCE COMPANY.
Source DataFrame
SOURCE COMPANY
0 Cool Company
1 BigPharma
2 Tod Kompany
3 Wallmart
Target DataFrame
TARGET COMPANY
0 Kool Company
1 Big farma
2 Todd's Company
3 C-Mart
4 SuperMart
5 SmallStore
6 ShopRus
Hopefully after mapping through fuzzyTest the output would be:
SOURCE COMPANY TARGET COMPANY
0 Cool Company Kool Company
1 BigPharma Big farma
2 Tod Kompany Todd's Company
3 Wallmart NaN
So if your fuzzy logic only compare the two strings on each row, just wrap it as a function that takes in column source and column target.
Make both columns in one dataframe then run:
def FuzzyTest(source,target):
.....
if ...:
return target
else:
return None
df['Target Company'] = df.apply(lambda x: FuzzyTest(x['Source'],x['Target'])

Categories