join on index AND column - python

I would like to join two dataframes based on two conditions:
1. Join via Index
2. If two column headers are in both dataframes, join them aswell
To give an example, lets imagine I have these two dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'date': [2010, 2011, 2012],
'a': [np.NaN, 30, np.NaN],
'b': [55, np.NaN, np.NaN],
'c': [55, 40, 84]})
df1 = df1.set_index("date")
df2 = pd.DataFrame({'date': [2010, 2011, 2012],
'a': [10, np.NaN, 30],
'b': [np.NaN, 80, 84],
'd': [55, 40, 84]})
df2 = df2.set_index("date")
if I now join the the two via pd.concat, I get columns such as "a" twice:
pd.concat([df1, df2], axis=1)
a b c a b d
date
2010 NaN 55.0 55 10.0 NaN 55
2011 30.0 NaN 40 NaN 80.0 40
2012 NaN NaN 84 30.0 84.0 84
But I would rather have:
a b c d
date
2010 10.0 55.0 55 55
2011 30.0 80.0 40 40
2012 30.0 84.0 84 84
Thanks in advance!

Try this, add
print(df1.set_index('date').add(df2.set_index("date"), fill_value=0))
a b c d
date
2010 10.0 55.0 55.0 55.0
2011 30.0 80.0 40.0 40.0
2012 30.0 84.0 84.0 84.0

Related

Join two dataframes on multiple conditions in python

I have the following problem: i am trying to join df1 = ['ID, 'Earnings', 'WC, 'Year'] and df2 = ['ID', 'F1_Earnings', 'df2_year']. So for example: the 'F1_Earnings' of a particular company, e.g. with ID = 1 and year = 1996, in df2 (aka. the Forward Earnings) should get joined on df1 in a way that they show up in df1 under ID = 1 and year = 1995.
I have no clue how to specify a join on two conditions, of course they need to join on "ID", but how do I add a second condition which specifies that they also join on "df1_year = df2_year - 1"?
d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350], 'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52], 'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998]}
df1 = pd.DataFrame(data=d1)
d2 = {'ID': [1, 2, 3, 4], 'F1_Earnings': [120, 220, 420, 280], 'WC': [23, 37, 40, 52], 'Year': [1996, 1997, 1998, 1999]}
df2 = pd.DataFrame(data=d2)
I did the following, but I guess there miust be a smarter way? I am afraid it wont work for larger datasets...:
df3 = pd.merge(df1, df2, how='left', on = 'ID')
df3.loc[df3['Year_x'] == df3['Year_y'] - 1]
You can use a Series as key in merge:
df1.merge(df2, how='left',
left_on=['ID', 'Year'],
right_on=['ID', df2['Year'].sub(1)])
output:
ID Year Earnings WC_x Year_x F1_Earnings WC_y Year_y
0 1 1995 100 20 1995 120.0 23.0 1996.0
1 1 1996 200 40 1996 NaN NaN NaN
2 1 1997 400 35 1997 NaN NaN NaN
3 2 1996 250 55 1996 220.0 37.0 1997.0
4 2 1997 300 60 1997 NaN NaN NaN
5 2 1998 350 65 1998 NaN NaN NaN
6 3 1995 400 30 1995 NaN NaN NaN
7 3 1997 550 28 1997 420.0 40.0 1998.0
8 3 1998 700 32 1998 NaN NaN NaN
9 4 1996 259 45 1996 NaN NaN NaN
10 4 1997 300 60 1997 NaN NaN NaN
11 4 1998 350 52 1998 280.0 52.0 1999.0
Or change the Year to Year-1, before the merge:
df1.merge(df2.assign(Year=df2['Year'].sub(1)),
how='left', on=['ID', 'Year'])
output:
ID Earnings WC_x Year F1_Earnings WC_y
0 1 100 20 1995 120.0 23.0
1 1 200 40 1996 NaN NaN
2 1 400 35 1997 NaN NaN
3 2 250 55 1996 220.0 37.0
4 2 300 60 1997 NaN NaN
5 2 350 65 1998 NaN NaN
6 3 400 30 1995 NaN NaN
7 3 550 28 1997 420.0 40.0
8 3 700 32 1998 NaN NaN
9 4 259 45 1996 NaN NaN
10 4 300 60 1997 NaN NaN
11 4 350 52 1998 280.0 52.0

Impute missing dates and values using Python

I have a file named df that has the following values:
Date Available Used Total Free
06072019 5 19 24 5
06202019 14 10 24 6
07072019 6 16 24 6
07202019 20 4 24 20
08072019 23 1 24 23
I am missing the date 08202019 and am looking to impute the missing values with the average of the existing data that I have.
This is what I am currently doing:
import numpy as np
import datetime as dt
df.groupby([df.index.date]).transform(lambda x: x.fill(x.mean()))
However, I know that some of the syntax is not correct here and would like some suggestions.
Use list comprehension to get the means of all the columns excluding the first date column and create a transposed dataframe, which will eseentially be one row. Then concat this new 1-row dataframe with the main dataframe.
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
s.columns = df.columns[1:]
s['Date'] = '08-20-2019'
df_new = pd.concat([df, s])
df_new
Out[1]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
In regards to your comment, you could create a missing_dates list and everything else will be automatic with the loop:
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
t = s
missing_dates = ['08-20-2019' , '08-30-2019']
for i in range(len(missing_dates) - 1):
t = t.append(s)
t.columns = df.columns[1:]
t['Date'] = missing_dates
df_new = pd.concat([df, t])
df_new
Out[2]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
0 08-30-2019 13.6 10.0 24.0 12.0
Consider your dataframe:
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"], format="%m%d%Y")
df = df.set_index("Date")
print(df)
Available Used Total Free
Date
2019-06-07 5 19 24 5
2019-06-20 14 10 24 6
2019-07-07 6 16 24 6
2019-07-20 20 4 24 20
2019-08-07 23 1 24 23
You can create a list of missing dates, convert it to a pandas datetime array, create a new index that you'll then fill the na with the column means.
missing_dates = ["2019-08-20", "2019-09-07", "2019-09-20"]
missing_dates = pd.to_datetime(missing_dates)
new_index = df.index.union(missing_dates)
df = df.reindex(new_index).fillna(df.mean(numeric_only=True))
print(df)
Available Used Total Free
2019-06-07 5.0 19.0 24.0 5.0
2019-06-20 14.0 10.0 24.0 6.0
2019-07-07 6.0 16.0 24.0 6.0
2019-07-20 20.0 4.0 24.0 20.0
2019-08-07 23.0 1.0 24.0 23.0
2019-08-20 13.6 10.0 24.0 12.0
2019-09-07 13.6 10.0 24.0 12.0
2019-09-20 13.6 10.0 24.0 12.0
In case you want to check for dates that are non-existent in the data, you can try this:
# suppose data series goes from '2020-09-30' to '2020-10-04' and data on some dates may be missing.
df = pd.DataFrame.from_dict(
dict(datetime=['09302020','10012020','10042020'],
val1 = [1,3,5],
val2 = [6,10,12]))
df.datetime = pd.to_datetime(df.datetime, format='%m%d%Y')
print(df)
dates_missing = pd.date_range(start = '2020-09-30', end = '2020-10-04' ).difference(df.datetime)
val_means = {col: df[col].mean() for col in list(df.columns) if col != 'datetime'}
df = df.append(pd.DataFrame.from_dict(
dict(datetime=dates_missing, **val_means)))
df = df.sort_values(by=['datetime'])
df

Get last non-null value of a row and its column in pandas DataFrame

I want to get the last non-null value (rightmost) of row C in this DataFrame.
With that, I also want to get its Year (column name).
Here is my DataFrame :
df = pd.DataFrame(np.random.randint(0,100,size=(4, 5)),
columns=['2016', '2017', '2018', '2019', '2020'],
index=['A', 'B', 'C', 'D'])
df.iloc[2, 2:5] = np.NaN
print(df)
2016 2017 2018 2019 2020
A 41 69 63.0 85.0 16.0
B 12 99 88.0 87.0 13.0
C 80 15 NaN NaN NaN
D 42 27 3.0 76.0 6.0
Result should look like {'year' : 2017, 'value' : 15}.
What's the best way of achieving that result ?
Something like this should solve it
In [1]: import pandas as pd
...: import numpy as np
...: df = pd.DataFrame(np.random.randint(0,100,size=(4, 5)),
...: columns=['2016', '2017', '2018', '2019', '2020'],
...: index=['A', 'B', 'C', 'D'])
...: df.iloc[2, 2:5] = np.NaN
...: print(df)
2016 2017 2018 2019 2020
A 13 78 9.0 13.0 98.0
B 35 3 32.0 6.0 42.0
C 26 24 NaN NaN NaN
D 77 91 96.0 60.0 94.0
In [2]: value = int(df.loc['C'][~df.loc['C'].isna()][-1])
In [3]: year = df.loc['C'][df.loc['C'] == value].index.values[0]
In [4]: result = {'year': year, 'value': value}
In [5]: result
Out[5]: {'year': '2017', 'value': 24}
You can break the expressions above part by part to better understand how each functionality is getting used together here to yield the desired output.

Dropping values before date X for each column in DataFrame

Say I have the following DataFrame:
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
A B
2010-01-01 20.0 5
2010-04-01 0.5 10
2010-07-01 40.0 6
2010-10-01 45.0 8
2011-01-01 40.0 9
2011-04-01 35.0 7
2011-07-01 20.0 5
2011-10-01 25.0 8
Also, assume I have the following series of dates:
D = d.idxmax()
A 2010-10-01
B 2010-04-01
dtype: datetime64[ns]
What I'm trying to do, is to essentially "drop" the values in the DataFrame, d, that occur before the dates in the series D for each column
That is, what I'm looking for is:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0
Notice that all the values in column A before 2010-10-01 are dropped and all the values in column B are dropped before 2010-04-01.
It's fairly simply to iterate through the columns to do this, but the DataFrame i'm working with is extremely large and this process takes a lot of time.
Is there a simpler way that does this in bulk, rather than column by column?
Thanks
Not sure that this is the most elegant answer, but since there aren't any other answers yet, I figured I would offer a working solution:
import pandas as pd
import numpy as np
import datetime
d = pd.DataFrame({'A': [20, 0.5, 40, 45, 40, 35, 20, 25],
'B' : [5, 10, 6, 8, 9, 7, 5, 8]},
index = pd.date_range(start = "2010Q1", periods = 8, freq = 'QS'))
D = d.idxmax()
for column in D.index:
d.loc[d.index < D[column], column] = np.nan
Output:
A B
2010-01-01 NaN NaN
2010-04-01 NaN 10.0
2010-07-01 NaN 6.0
2010-10-01 45.0 8.0
2011-01-01 40.0 9.0
2011-04-01 35.0 7.0
2011-07-01 20.0 5.0
2011-10-01 25.0 8.0

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})
df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})
With this code, I get the mean for lat. I would also like to find the mean for long.
I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces
AttributeError: 'DataFrame' object has no attribute 'long'
If I just do avg_long, the code works as well.
df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})
In[2]: df2
Out[42]:
avg_long
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
Is there a way to do this in one step or is this something I have to do separately and join back later?
I think more simplier is use GroupBy.mean:
print df.groupby(['ser_no', 'CTRY_NM']).mean()
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
Ir you need define columns for aggregating:
print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
lat long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0
More info in docs.
EDIT:
If you need rename column names - remove multiindex in columns, you can use list comprehension:
import pandas as pd
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'date':pd.date_range(pd.to_datetime('2016-02-24'),
pd.to_datetime('2016-02-28'), freq='10H')})
print df
CTRY_NM date lat long ser_no
0 a 2016-02-24 00:00:00 1 21 1
1 a 2016-02-24 10:00:00 2 22 1
2 b 2016-02-24 20:00:00 3 23 1
3 e 2016-02-25 06:00:00 4 24 2
4 e 2016-02-25 16:00:00 5 25 2
5 a 2016-02-26 02:00:00 6 26 2
6 b 2016-02-26 12:00:00 7 27 2
7 b 2016-02-26 22:00:00 8 28 3
8 b 2016-02-27 08:00:00 9 29 3
9 d 2016-02-27 18:00:00 10 30 3
df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
lat_mean date_min date_max date_count \
ser_no CTRY_NM
1 a 1.5 2016-02-24 00:00:00 2016-02-24 10:00:00 2
b 3.0 2016-02-24 20:00:00 2016-02-24 20:00:00 1
2 a 6.0 2016-02-26 02:00:00 2016-02-26 02:00:00 1
b 7.0 2016-02-26 12:00:00 2016-02-26 12:00:00 1
e 4.5 2016-02-25 06:00:00 2016-02-25 16:00:00 2
3 b 8.5 2016-02-26 22:00:00 2016-02-27 08:00:00 2
d 10.0 2016-02-27 18:00:00 2016-02-27 18:00:00 1
long_mean
ser_no CTRY_NM
1 a 21.5
b 23.0
2 a 26.0
b 27.0
e 24.5
3 b 28.5
d 30.0
You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)
would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
avg_lat avg_long
ser_no CTRY_NM
1 a 1.5 21.5
b 3.0 23.0
2 a 6.0 26.0
b 7.0 27.0
e 4.5 24.5
3 b 8.5 28.5
d 10.0 30.0

Categories