I have a dataframe with [Year] & [Week] columns sometimes missing. I have another dataframe that is a calendar for reference from which I can get these missing values. How to fill these missing columns using pandas?
I have tried using reindex to set them up, but I am getting the following error
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7,8,9,10]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
df1 = df1.reindex(df2, fill_value=0)
print(df1)
You should adding index so df2.index
df1.reindex(df2.index,fill_value=0)
Out[851]:
Value
Year Week
2019 1 20
2 40
3 0
4 60
5 0
6 75
7 90
df2.index.difference(df1.index)
Out[854]:
MultiIndex(levels=[[2019], [3, 5]],
labels=[[0, 0], [0, 1]],
names=['Year', 'Week'],
sortorder=0)
Update
s=df1.reindex(df2.index)
s[s.bfill().notnull().values].fillna(0)
Out[877]:
Value
Year Week
2019 1 20.0
2 40.0
3 0.0
4 60.0
5 0.0
6 75.0
7 90.0
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
fill_value = df1['Value'].mean() #value to fill `NaN` rows with - can choose another logic if you do not want the mean
df1 = df1.join(df2, how='right')
df1.fillna(value=fill_value,axis=1) # Fill missing data here
print(df1)
Related
I have two dataframes:
df1:
date score perf
0 2021-08-01 2 4
1 2021-08-02 4 5
2 2021-08-03 6 7
df2:
date score perf
0 2021-08-01 2 7
1 2021-08-02 4 8
2 2021-08-03 6 7
I want to return df1, df2, and a variation in perf of df1 and df2 as a third dataframe together as shown in this picture:
The illustration of tables you shared does not match with the values of df1and df2.
But anyway, you can use pandas.merge to bring the column(s) you need to do the substraction.
import pandas as pd
df1 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [4, 5, 7]})
df2 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [7, 8, 7]})
out = df1.merge(df2[['date','perf']], on='date', how='left')
out['perf'] = out['perf_x'] - out['perf_y']
out = out.drop(['perf_x','perf_y'], axis=1)
>>> print(out)
Note: In case you want to substract all the columns of both dataframes, you can use pandas.DataFrame.sub instead to substract one dataframe from another.
Edit :
Now, if you want to display the three dataframes in a Excel sheet (as shown in your illustation), you can use this function:
list_df = [df1, df2, df3]
df1.name = 'Table 1'
df2.name = 'Table 2'
df3.name = 'Table 3 (tabl1perf - tabl2perf)'
def display_results(dfs):
sr = 1
with pd.ExcelWriter('your_excel_name.xlsx') as writer:
for df in dfs:
df.to_excel(writer, startrow=sr, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
title = df.name
worksheet.write(sr-1, 0, title)
sr += (df.shape[0] + 3)
display_results(list_df)
>>> Output (in Excel)
I have two data frames
df1:
ID Date Value
0 9560 07/3/2021 25
1 9560 03/03/2021 20
2 9712 12/15/2021 15
3 9712 08/30/2021 10
4 9920 4/11/2021 5
df2:
ID Value
0 9560
1 9712
2 9920
In df2, I want to get the latest value from "Value" column of df1 with respect to ID.
This is my expected output:
ID Value
0 9560 25
1 9712 15
2 9920 5
How could I achieve it?
Based on Daniel Afriyie's approach, I came up with this solution:
import pandas as pd
# Setup for demo
df1 = pd.DataFrame(
columns=['ID', 'Date', 'Value'],
data=[
[9560, '07/3/2021', 25],
[9560, '03/03/2021', 20],
[9712, '12/15/2021', 15],
[9712, '08/30/2021', 10],
[9920, '4/11/2021', 5]
]
)
df2 = pd.DataFrame(
columns=['ID', 'Value'],
data=[[9560, None], [9712, None], [9920, None]]
)
## Actual solution
# Casting 'Date' column to actual dates
df1['Date'] = pd.to_datetime(df1['Date'])
# Sorting by dates
df1 = df1.sort_values(by='Date', ascending=False)
# Dropping duplicates of 'ID' (since it's ordered by date, only the newest of each ID will be kept)
df1 = df1.drop_duplicates(subset=['ID'])
# Merging the values from df1 into the the df2
pf2 = pd.merge(df2[['ID']], df1[['ID', 'Value']]))
output:
ID Value
0 9560 25
1 9712 15
2 9920 5
I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()
I've got two data frames that represent similar data but I want to merge after changing the col names. There are a few ways to achieve this but given the size of my actual data frames, I'd like to use the following method. I'm returning nan values for the second df.
import pandas as pd
df1 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df2 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df1['time'] = pd.to_datetime(df1['time'])
df2['time'] = pd.to_datetime(df2['time'])
df1.columns.values[1:4] = ['first_' + str(x) for x in df1.columns[1:4]]
df2.columns.values[1:4] = ['second_' + str(x) for x in df2.columns[1:4]]
df3 = pd.merge(df1, df2, on = 'time')
print(df3)
time first_Val first_Val2 first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 NaN NaN NaN
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 NaN NaN NaN
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 NaN NaN NaN
Intended output:
time first_Val first_Val2 first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 1,2,3 1 1.1
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 1,2,3 2 2.1
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 1,2,3 3 3.1
The issue is slice assignment of the column names.
df1.columns.values[1:4] = new values
Fails in pandas 1.1.1 and 1.1.2
Works in 1.0.1 and 1.0.5
'time' is set as the index, then reset, after changing the column names in a list-comprehension.
This demonstrates, it's okay to rename the columns with a list comprehension, but not by slicing df.columns.
.reset_index() can be removed, to leave 'time' as the index, in which case, use df.join, instead of pd.merge.
The options are to set the column, which won't have a new name, as the index, or use .rename for the specific columns.
df1 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'first_Val': ['1,2,3','1,2,3','1,2,3'],
'first_Val2': [1,2,3],
'first_Val3': [1.1,2.1,3.1]
})
df1['time'] = pd.to_datetime(df1['time'])
df1.set_index('time', inplace=True)
df1.columns = ['first_' + str(x) for x in df1.columns]
df1.reset_index(inplace=True)
df2 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df2['time'] = pd.to_datetime(df2['time'])
df2.set_index('time', inplace=True)
df2.columns = ['second_' + str(x) for x in df2.columns]
df2.reset_index(inplace=True)
# merge
df3 = pd.merge(df1, df2, on = 'time', how='left')
time first_first_Val first_first_Val2 first_first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 1,2,3 1 1.1
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 1,2,3 2 2.1
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 1,2,3 3 3.1
Okay, let's try this a different way:
df1 = df1.set_index('time').add_prefix('first_')
df2 = df2.set_index('time').add_prefix('second_')
df3 = pd.merge(df1, df2, on = 'time')
print(df3)
Am trying to do something where I calculate a new dataframe which is dataframe1 divided by dataframe2 where columnname match and date index matches bases on closest date nonexact match)
idx1 = pd.DatetimeIndex(['2017-01-01','2018-01-01','2019-01-01'])
idx2 = pd.DatetimeIndex(['2017-02-01','2018-03-01','2019-04-01'])
df1 = pd.DataFrame(index = idx1,data = {'XYZ': [10, 20, 30],'ABC': [15, 25, 30]})
df2 = pd.DataFrame(index = idx2,data = {'XYZ': [1, 2, 3],'ABC': [3, 5, 6]})
#looking for some code
#df3 = df1/df2 on matching column and closest matching row
This should produce a dataframe which looks like this
XYZ ABC
2017-01-01 10 5
2018-01-01 10 5
2019-01-01 10 5
You can use an asof merge to do a match on a "close" row. Then we'll group over the columns axis and divide.
df3 = pd.merge_asof(df1, df2, left_index=True, right_index=True,
direction='nearest')
# XYZ_x ABC_x XYZ_y ABC_y
#2017-01-01 10 15 1 3
#2018-01-01 20 25 2 5
#2019-01-01 30 30 3 6
df3 = (df3.groupby(df3.columns.str.split('_').str[0], axis=1)
.apply(lambda x: x.iloc[:, 0]/x.iloc[:, 1]))
# ABC XYZ
#2017-01-01 5.0 10.0
#2018-01-01 5.0 10.0
#2019-01-01 5.0 10.0