conditional merge with pandas

conditional merge with pandas - python

I have 3 excel files. Lets say file A.xlsx is the main file where I work and in it I have 2 columns: Material, Location.
In file B.xlsx I have Material and Seasonality columns.
Same in fil C.xlsx, Material and Seasonality.
In the Location I have only two countries. ( France, Spain, blank values).
The file B only has the information on seasonality about France.
and file C only has the information on seasonality about Spain.
I would like to get the seasonality information from files B and C and put in a separate column (named "Seasonality) in file A corresponding to the Location value.
df1
Material Location
2527 France
2528 Spain
2627 Spain
2628 NaN
2725 France
df2
**
Material Seasonality
2527 Summer
2725 Autumn
**
df3
Material Seasonality
2528 Winter
2627 Spring
desired output
Material Location Seasonality
2527 France Summer
2528 Spain Winter
2627 Spain Spring
2628 NaN Nan
2725 France Autumn
The code I have tried is this. I know its totally wrong but I saw someone recommended to someone something like this
import pandas as pd
df= pd.read_excel('A.xlsx')
df1 = pd.read_excel('B.xlsx')
df3 = pd.read_excel('C.xlsx')
df4 = pd.merge(df,df1[['Material','Seasonality']], on='Material', how='left') #merge for France
df5=pd.merge(df,df3[['Material', 'Seasonality']], on = 'Material', how='left') if df['Location']== 'Spain'
print(df5)
Concatonating B and C files are not possible
p.s. I dont know if it will helps or not but I already have merged the file A and B with the seasonality information on France and dataframe has now values corresponding France and NaNs (corresponding to Spain and Blank values from from Location column). Could you please help?

Let us try concat then groupby
out = pd.concat([df1,df2,df3]).groupby('Material',as_index=False).first()
Out[90]:
Material Location Seasonality
0 2527 France Summer
1 2528 Spain Winter
2 2627 Spain Spring
3 2628 None None
4 2725 France Autumn

Related

Is there a way to take all of my row values and make them col headers?

I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?

Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN

Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil

7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN

How to create a column from another df according to matching columns?

I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common

You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.

How to calculate the percentage of the sum value of the column?

I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.

You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.

We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509

Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()

Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).

You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.

I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.

One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!

Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Pandas merge result missing rows when joining on strings

I have a data set that I've been cleaning and to clean it I needed to put it into a pivot table to summarize some of the data. I'm now putting it back into a dataframe so that I can merge it with some other dataframes. df1 looks something like this:
Count Region Period ACV PRJ
167 REMAINING US WEST 3/3/2018 5 57
168 REMAINING US WEST 3/31/2018 10 83
169 SAN FRANCISCO 1/13/2018 99 76
170 SAN FRANCISCO 1/20/2018 34 21
df2 looks something like this:
Count MKTcode Region
11 RSMR0 REMAINING US SOUTH
12 RWMR0 REMAINING US WEST
13 SFR00 SAN FRANCISCO
I've tried merging them with this code:
df3 = pd.merge(df1, df2, on='Region', how='inner')
but for some reason pandas is not interpreting the Region columns as the same data and the merge is turning up NaN data in the MKTcode column and it seems to be appending df2 to df1, like this:
Count Region Period ACV PRJ MKTcode
193 WASHINGTON, D.C. 3/3/2018 36 38 NaN
194 WASHINGTON, D.C. 3/31/2018 12 3 NaN
195 ATLANTA NaN NaN NaN ATMR0
196 BOSTON NaN NaN NaN B2MRN
I've tried inner and outer joins, but the real problem seems to be that pandas is interpreting the Region column of each dataframe as different elements.
The MKTcode column and Region column in df2 has only 12 observations and each observation occurs only once, whereas df1 has several repeating instances in the Region column (multiples of the same city). Is there a way where I can just create a list of the 12 MKTcodes that I need and perform a merge where it matches with each region that I designate? Like a one to many match?
Thanks.

When a merge isn't working as expected, the first thing to do is look at the offending columns.
The biggest culprit in most cases is trailing/leading whitespaces. These are usually introduced when DataFrames are incorrectly read from files.
Try getting rid of extra whitespace characters by stripping them out. Assuming you need to join on the "Region" column, use
for df in (df1, df2):
# Strip the column(s) you're planning to join with
df['Region'] = df['Region'].str.strip()
Now, merging should work as expected,
pd.merge(df1, df2, on='Region', how='inner')
Count_x Region Period ACV PRJ Count_y MKTcode
0 167 REMAINING US WEST 3/3/2018 5 57 12 RWMR0
1 168 REMAINING US WEST 3/31/2018 10 83 12 RWMR0
2 169 SAN FRANCISCO 1/13/2018 99 76 13 SFR00
3 170 SAN FRANCISCO 1/20/2018 34 21 13 SFR00
Another possibility if you're still getting NaNs, could be because of a difference in whitespace characters between words. For example, 'REMAINING US WEST' will not compare as equal with 'REMAINING US WEST'.
This time, the fix is to use str.replace:
for df in (df1, df2):
df['Region'] = df['Region'].str.replace(r'\s+', ' ')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.