Merge dataframes - python

I'm trying to merge this two dataframes:
df1=
pais ano cantidad
0 Chile 2000 10
1 Chile 2001 11
2 Chile 2002 12
df2=
pais ano cantidad
0 Chile 1999 0
1 Chile 2000 0
2 Chile 2001 0
3 Chile 2002 0
4 Chile 2003 0
I'm trying to merge df1 into df2 and replace the existing año rows with those from df1. This is the code that I'm trying right now and what I'm getting:
df=df1.combine_first(df2)
df=
pais ano cantidad
0 Chile 2000.0 10.0
1 Chile 2001.0 11.0
2 Chile 2002.0 12.0
3 Chile 2002.0 0.0
4 Chile 2003.0 0.0
As you can see, row corresponding to 1999 is missing and the one for 2002 with 'cantidad'= 0 shoudn't be there. My desired output is this:
df=
pais ano cantidad
0 Chile 1999 0
1 Chile 2000 10
2 Chile 2001 11
3 Chile 2002 12
4 Chile 2003 0
Any ideas? Thank you!

Add how='outer param to the merge.
By default, merge works with "inner", which means it takes only values which are in both dataframe (intersection) while you want union of those sections.
Also, you may want to add on="ano" to declare on which column you want to merge. It may not be needed on your case, but it's worth to check it out.
Please check Pandas Merging 101 for more details

You can perform a left join on df2 and fillna missing values from df2.cantidad. I'm joining on pais and ano because I assume in your real dataframe are more countries than 'chile'.
df = df2[['pais','ano']].merge(df1, on=['pais','ano'], how='left').fillna({'cantidad': df2.cantidad})
df.cantidad = df.cantidad.astype('int')
df
Out:
pais ano cantidad
0 Chile 1999 0
1 Chile 2000 10
2 Chile 2001 11
3 Chile 2002 12
4 Chile 2003 0

Related

getting new dataframe from existing in dataframe with conditions on multiple columns

I am trying to sort a pandas dataframe. The data looks like-
year
state
district
Party
rank
share in votes
2010
haryana
kaithal
Winner
1
40.12
2010
haryana
kaithal
bjp
2
30.52
2010
haryana
kaithal
NOTA
3
29
2010
goa
panji
Winner
3
10
2010
goa
panji
INC
2
40
2010
goa
panji
BJP
1
50
2013
up
meerut
Winner
2
40
2013
up
meerut
SP
1
60
2015
haryana
kaithal
Winner
2
15
2015
haryana
kaithal
BJP
3
35
2015
haryana
kaithal
INC
1
50
This data is for multiple states for multiple years.
In this dataset, there are multiple values for each district. I want to calculate the margin of share for each district in this manner. I have tried this, but not able to write fully. I am not able to write code for defining the margin of share and get a dataframe with only one (margin of share) value corresponding to each district instead of party wise shares.
for year in df['YEAR']:
for state in df['STATE']:
for district in df['DISTRICT']:
for rank in df['RANK']:
for party in df['PARTY']:
if rank==1 and party=='WINNER':
then margin of share =Share of Winner-Share of party at rank 2. If share WINNER does not have rank 1 then Margin of Share= share of winner - share of party at rank 1.
I am basically trying to get this output-
| year | state |district| margin of share|
|---------------|-------------|--------|----------------|
| 2010 | haryana |kaithal | 9.6 |
| 2010 | goa |panji | -40 |
| 2010 | up |kaithal | -20 |
| 2015 | haryana |kaithal | -35 |
I wish to have create a different data frame with columns Year, State, District and margin of SHARE.
Create MultiIndex by first 3 columns by DataFrame.set_index, create masks, filter with DataFrame.loc and subtract values, last use Series.fillna for replace not matched values by condition m3:
df1 = df.set_index(['year', 'state', 'district'])
m1 = df1.Party=='Winner'
m2 = df1['rank']==2
m3 = df1['rank']==1
s1 = (df1.loc[m1 & m3,'share in votes']
.sub(df1.loc[m2,'share in votes']))
print (s1)
year state district
2010 goa panji NaN
haryana kaithal 9.6
2013 up meerut NaN
2015 haryana kaithal NaN
Name: share in votes, dtype: float64
s2 = (df1.loc[m1,'share in votes']
.sub(df1.loc[m3,'share in votes']))
print (s2)
year state district
2010 haryana kaithal 0.0
goa panji -40.0
2013 up meerut -20.0
2015 haryana kaithal -35.0
Name: share in votes, dtype: float64
df = s1.fillna(s2).reset_index()
print (df)
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
use groupby and where with conditions
g = df.groupby(['year', 'state', 'district'])
cond1 = df['Party'].eq('Winner')
cond2 = df['rank'].eq(1)
cond3 = df['rank'].eq(2)
df1 = g['share in votes'].agg(lambda x: (x.where(cond1).sum() - x.where(cond3).sum()) if x.where(cond1 & cond2).sum() != 0 else (x.where(cond1).sum() - x.where(cond2).sum())).reset_index()
result(df1):
year state district share in votes
0 2010 goa panji -40.0
1 2010 haryana kaithal 9.6
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0
if you want sort like df use following code:
df.iloc[:, :3].drop_duplicates().merge(df1)
result:
year state district share in votes
0 2010 haryana kaithal 9.6
1 2010 goa panji -40.0
2 2013 up meerut -20.0
3 2015 haryana kaithal -35.0

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

How to add null value rows into pandas dataframe for missing years in a multi-line chart plot

I am building a chart from a dataframe with a series of yearly values for six countries. This table is created by an SQL query and then passed to pandas with read_sql command...
country date value
0 CA 2000 123
1 CA 2001 125
2 US 1999 223
3 US 2000 235
4 US 2001 344
5 US 2002 355
...
Unfortunately, not every year has a value in each country, nevertheless the chart tool requires each country to have the same number of years in the dataframe. Years that have no values need a Nan (null) row added.
In the end, I want the pandas dataframe to look as follows for all six countries....
country date value
0 CA 1999 Nan
1 CA 2000 123
2 CA 2001 125
3 CA 2002 Nan
4 US 1999 223
5 US 2000 235
6 US 2001 344
7 US 2002 355
8 DE 1999 Nan
9 DE 2000 Nan
10 DE 2001 423
11 DE 2002 326
...
Are there any tools or shortcuts for determining min-max dates and then ensuring a new nan row is created if needed?
Use Series.unstack with DataFrame.stack trick:
df = df.set_index(['country','date']).unstack().stack(dropna=False).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0
Another idea with DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['country'].unique(),
range(df['date'].min(), df['date'].max() + 1)],
names=['country','date'])
df = df.set_index(['country','date']).reindex(mux).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0

Wide to Long data frame returning NaN instead of float values

I have a large data frame that looks like this:
Country 2010 2011 2012 2013
0 Germany 4.625e+10 4.814e+10 4.625e+10 4.593e+10
1 France 6.178e+10 6.460e+10 6.003e+10 6.241e+10
2 Italy 4.625e+10 4.625e+10 4.625e+10 4.625e+10
I want to reshape the data so that the Country, Years, and Values are all columns. I used the melt method
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=[2010,2011,2012,2013,2014,2015,2016,2016,2017],
var_name ='Year', value_name='Total')
I was able to attain:
Country Year Total
0 Germany 2010 NaN
1 France 2010 NaN
2 Italy 2010 NaN
My issue is that the float values turns into NaN and I don't how to reshape the data frame to keep the values as floats.
Omit the value_vars argument and it works:
pd.melt(dftotal, id_vars='Country', var_name ='Year', value_name='Total')
Country Year Total
0 Germany 2010 4.625000e+10
1 France 2010 6.178000e+10
2 Italy 2010 4.625000e+10
3 Germany 2011 4.814000e+10
4 France 2011 6.460000e+10
5 Italy 2011 4.625000e+10
6 Germany 2012 4.625000e+10
7 France 2012 6.003000e+10
8 Italy 2012 4.625000e+10
9 Germany 2013 4.593000e+10
10 France 2013 6.241000e+10
11 Italy 2013 4.625000e+10
The problem is probably that your column names are not ints but strings, so you could do:
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=['2010','2011','2012','2013','2014','2015','2016','2016','2017'],
var_name ='Year', value_name='Total')
And it would also work.
Alternatively, using stack:
dftotal = (dftotal.set_index('Country').stack()
.reset_index()
.rename(columns={'level_1':'Year',0:'Total'})
.sort_values('Year'))
Will get you the same output (but less succinctly)

Applying a condition to a df to get the aggregate counts

I have this df structured like this, where each year has the same rows/entries:
Year Name Expire
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Bob 2002
2002 Tim 2003
2002 Will 2004
2003 Bob 2002
2003 Tim 2003
2003 Will 2004
I have subsetted the df (df[df['Expire']> df['Year'])
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Tim 2003
2002 Will 2004
2003 Will 2004
Now I want to return the count for each year the amount of names that expired, something like:
Year count
2001 0
2002 1
2003 1
How can I accomplish this? I can't do (df[df['Expire']<= df['Year'])['name'].groupby('Year').agg(['count']), because that would return unnecessary rows for me. Any way to count only the last instance only?
You can use groupby with boolean mask and aggregate sum:
print (df['Expire']<= df['Year'])
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 True
8 False
dtype: bool
df=(df['Expire']<=df['Year']).groupby(df['Year']).sum().astype(int).reset_index(name='count')
print (df)
Year count
0 2001 0
1 2002 1
2 2003 2
Verifying:
print (df[df['Expire']<= df['Year']])
Year Name Expire
3 2002 Bob 2002
6 2003 Bob 2002
7 2003 Tim 2003
IIUC : You can use .apply and sum of true values i.e
df.groupby('Year').apply(lambda x: (x['Expire']<=x['Year']).sum())
Output:
Year
2001 0
2002 1
2003 2

Categories