I am trying to figure out a way in which I can combine two dfs in pandas/python into one based on a couple of factors.
There is an i.d field that exists in both dfs
Each df has a timestamp, df_1 can have one or multiple timestamps associated with an i.d.
df_2 only has one timestamp associated with an I.D
df_2 timestamp will always be the earliest or first timestamp compared to timestamps in df_1
I want to combine both dataframes where the df_2 timestamp is the first timestamp in a column, and each subsequent timestamp from df_1 comes after.
so the output will look something like
I.D | Timestamp
E4242 earliest_timestamp from df_2
E4242 next_timestamp from df_1
E4242 next_timestamp from df_1
Thanks for looking!
If it's always true that df2 only contains one date per ID, and that date is always the earliest date for that ID, could you simply concatenate df1 and df2, then sort by ID and timestamp? For example:
# Generate example data
df1 = pd.DataFrame({'id': [1, 1, 2, 3, 3, 3],
'timestamp': pd.to_datetime(['2019-01-01',
'2019-01-02',
'2019-01-15',
'2019-01-17',
'2019-02-01',
'2019-02-03'])})
df2 = pd.DataFrame({'id': [1, 2, 3],
'timestamp': pd.to_datetime(['1959-06-01',
'1989-12-01',
'1999-01-25'])})
df = pd.concat([df1, df2])
df = df.sort_values(by=['id', 'timestamp']).reset_index(drop=True)
df
id timestamp
0 1 1959-06-01
1 1 2019-01-01
2 1 2019-01-02
3 2 1989-12-01
4 2 2019-01-15
5 3 1999-01-25
6 3 2019-01-17
7 3 2019-02-01
8 3 2019-02-03
Related
I have two dataframes:
df1:
date score perf
0 2021-08-01 2 4
1 2021-08-02 4 5
2 2021-08-03 6 7
df2:
date score perf
0 2021-08-01 2 7
1 2021-08-02 4 8
2 2021-08-03 6 7
I want to return df1, df2, and a variation in perf of df1 and df2 as a third dataframe together as shown in this picture:
The illustration of tables you shared does not match with the values of df1and df2.
But anyway, you can use pandas.merge to bring the column(s) you need to do the substraction.
import pandas as pd
df1 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [4, 5, 7]})
df2 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [7, 8, 7]})
out = df1.merge(df2[['date','perf']], on='date', how='left')
out['perf'] = out['perf_x'] - out['perf_y']
out = out.drop(['perf_x','perf_y'], axis=1)
>>> print(out)
Note: In case you want to substract all the columns of both dataframes, you can use pandas.DataFrame.sub instead to substract one dataframe from another.
Edit :
Now, if you want to display the three dataframes in a Excel sheet (as shown in your illustation), you can use this function:
list_df = [df1, df2, df3]
df1.name = 'Table 1'
df2.name = 'Table 2'
df3.name = 'Table 3 (tabl1perf - tabl2perf)'
def display_results(dfs):
sr = 1
with pd.ExcelWriter('your_excel_name.xlsx') as writer:
for df in dfs:
df.to_excel(writer, startrow=sr, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
title = df.name
worksheet.write(sr-1, 0, title)
sr += (df.shape[0] + 3)
display_results(list_df)
>>> Output (in Excel)
I am struggling with the following: I have on dataset and have transposed it. After transposing, the first column was set as an index automatically, and from now one, this "index" column is not recognized as a variable. Here is an example of what I mean;
df = Date A B C
1/1/2021 1 2 3
1/2/2021 4 5 6
input: df_T = df.t
output: index 1/1/2021 1/2/2021
A 1 4
B 2 5
C 3 6
I would like to have a variable, and name if it is possible, instead of the generated "index".
To reproduce this dataset, I have used this chunk of code:
data = [['1/1/2021', 1, 2, 3], ['3/1/2021', 4, 5, 6]]
df = pd.DataFrame(data)
df.columns = ['Date', 'A', 'B', 'C']
df.set_index('Date', inplace=True)
To have meaningfull column inplace of index, the next line can be run:
df_T = df.T.reset_index()
To rename the column 'index', method rename can be used:
df_T.rename(columns={'index':'Variable'})
A Pandas Data Frame typically has names both for the colmns and for the rows. The list of names for the rows is called the "Index". When you do a transpose, rows and columns switch places. In your case, the dates column is for some reason the index, so it becomes the column names for the new data frame. You need to create a new index and turn the "Date" column into a regular column. As #sophocies wrote above, this is achived with df.reset_index(...). I hope the code example below will be helpful.
import pandas as pd
df = pd.DataFrame(columns=['Date', 'A', 'B', 'C'], data=[['1/1/2021', 1, 2,3], ['1/2/2021', 4,5,6]])
df.set_index('Date', inplace=True)
print(df.transpose(copy=True))
#Recreated the problem
df.reset_index(inplace=True)
print(df.transpose())
Output
0
1
Date
1/1/2021
1/2/2021
A
1
4
B
2
5
C
3
6
I hope this is what you wanted!
I have two data frames
df1:
ID Date Value
0 9560 07/3/2021 25
1 9560 03/03/2021 20
2 9712 12/15/2021 15
3 9712 08/30/2021 10
4 9920 4/11/2021 5
df2:
ID Value
0 9560
1 9712
2 9920
In df2, I want to get the latest value from "Value" column of df1 with respect to ID.
This is my expected output:
ID Value
0 9560 25
1 9712 15
2 9920 5
How could I achieve it?
Based on Daniel Afriyie's approach, I came up with this solution:
import pandas as pd
# Setup for demo
df1 = pd.DataFrame(
columns=['ID', 'Date', 'Value'],
data=[
[9560, '07/3/2021', 25],
[9560, '03/03/2021', 20],
[9712, '12/15/2021', 15],
[9712, '08/30/2021', 10],
[9920, '4/11/2021', 5]
]
)
df2 = pd.DataFrame(
columns=['ID', 'Value'],
data=[[9560, None], [9712, None], [9920, None]]
)
## Actual solution
# Casting 'Date' column to actual dates
df1['Date'] = pd.to_datetime(df1['Date'])
# Sorting by dates
df1 = df1.sort_values(by='Date', ascending=False)
# Dropping duplicates of 'ID' (since it's ordered by date, only the newest of each ID will be kept)
df1 = df1.drop_duplicates(subset=['ID'])
# Merging the values from df1 into the the df2
pf2 = pd.merge(df2[['ID']], df1[['ID', 'Value']]))
output:
ID Value
0 9560 25
1 9712 15
2 9920 5
I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()
I have a dataframe (DF1) as such - each Personal-ID will have 3 dates associated w/that ID:
I have created a dataframe (DF_ID) w/1 row for each Personal-ID & Column for Each Respective Date (which is currently blank) and would like load/loop the 3 dates/Personal-ID (DF1) into the respective date columns the final dataframe to look as such:
I am trying to learn python and have tried a number of codinging script to accomplish such as:
{for index, row in df_bnp_5.iterrows():
df_id['Date-1'] = (row.loc[0,'hv_lab_test_dt'])
df_id['Date-2'] = (row.loc[1,'hv_lab_test_dt'])
df_id['Date-3'] = (row.loc[2,'hv_lab_test_dt'])
for i in range(len(df_bnp_5)) :
df_id['Date-1'] = df1.iloc[i, 0], df_id['Date-2'] = df1.iloc[i, 2])}
Any assistance would be appreciated.
Thank You!
Here is one way. I created a 'helper' column to arrange the dates for each Personal-ID.
import pandas as pd
# create data frame
df = pd.DataFrame({'Personal-ID': [1, 1, 1, 5, 5, 5],
'Date': ['10/01/2019', '12/28/2019', '05/08/2020',
'01/19/2020', '06/05/2020', '07/19/2020']})
# change data type
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# create grouping key
df['x'] = df.groupby('Personal-ID')['Date'].rank().astype(int)
# convert to wide table
df = df.pivot(index='Personal-ID', columns='x', values='Date')
# change column names
df = df.rename(columns={1: 'Date-1', 2: 'Date-2', 3: 'Date-3'})
print(df)
x Date-1 Date-2 Date-3
Personal-ID
1 2019-10-01 2019-12-28 2020-05-08
5 2020-01-19 2020-06-05 2020-07-19