I have a dataframe (DF1) as such - each Personal-ID will have 3 dates associated w/that ID:
I have created a dataframe (DF_ID) w/1 row for each Personal-ID & Column for Each Respective Date (which is currently blank) and would like load/loop the 3 dates/Personal-ID (DF1) into the respective date columns the final dataframe to look as such:
I am trying to learn python and have tried a number of codinging script to accomplish such as:
{for index, row in df_bnp_5.iterrows():
df_id['Date-1'] = (row.loc[0,'hv_lab_test_dt'])
df_id['Date-2'] = (row.loc[1,'hv_lab_test_dt'])
df_id['Date-3'] = (row.loc[2,'hv_lab_test_dt'])
for i in range(len(df_bnp_5)) :
df_id['Date-1'] = df1.iloc[i, 0], df_id['Date-2'] = df1.iloc[i, 2])}
Any assistance would be appreciated.
Thank You!
Here is one way. I created a 'helper' column to arrange the dates for each Personal-ID.
import pandas as pd
# create data frame
df = pd.DataFrame({'Personal-ID': [1, 1, 1, 5, 5, 5],
'Date': ['10/01/2019', '12/28/2019', '05/08/2020',
'01/19/2020', '06/05/2020', '07/19/2020']})
# change data type
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# create grouping key
df['x'] = df.groupby('Personal-ID')['Date'].rank().astype(int)
# convert to wide table
df = df.pivot(index='Personal-ID', columns='x', values='Date')
# change column names
df = df.rename(columns={1: 'Date-1', 2: 'Date-2', 3: 'Date-3'})
print(df)
x Date-1 Date-2 Date-3
Personal-ID
1 2019-10-01 2019-12-28 2020-05-08
5 2020-01-19 2020-06-05 2020-07-19
Related
I have two dataframes:
df1:
date score perf
0 2021-08-01 2 4
1 2021-08-02 4 5
2 2021-08-03 6 7
df2:
date score perf
0 2021-08-01 2 7
1 2021-08-02 4 8
2 2021-08-03 6 7
I want to return df1, df2, and a variation in perf of df1 and df2 as a third dataframe together as shown in this picture:
The illustration of tables you shared does not match with the values of df1and df2.
But anyway, you can use pandas.merge to bring the column(s) you need to do the substraction.
import pandas as pd
df1 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [4, 5, 7]})
df2 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [7, 8, 7]})
out = df1.merge(df2[['date','perf']], on='date', how='left')
out['perf'] = out['perf_x'] - out['perf_y']
out = out.drop(['perf_x','perf_y'], axis=1)
>>> print(out)
Note: In case you want to substract all the columns of both dataframes, you can use pandas.DataFrame.sub instead to substract one dataframe from another.
Edit :
Now, if you want to display the three dataframes in a Excel sheet (as shown in your illustation), you can use this function:
list_df = [df1, df2, df3]
df1.name = 'Table 1'
df2.name = 'Table 2'
df3.name = 'Table 3 (tabl1perf - tabl2perf)'
def display_results(dfs):
sr = 1
with pd.ExcelWriter('your_excel_name.xlsx') as writer:
for df in dfs:
df.to_excel(writer, startrow=sr, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
title = df.name
worksheet.write(sr-1, 0, title)
sr += (df.shape[0] + 3)
display_results(list_df)
>>> Output (in Excel)
I am struggling with the following: I have on dataset and have transposed it. After transposing, the first column was set as an index automatically, and from now one, this "index" column is not recognized as a variable. Here is an example of what I mean;
df = Date A B C
1/1/2021 1 2 3
1/2/2021 4 5 6
input: df_T = df.t
output: index 1/1/2021 1/2/2021
A 1 4
B 2 5
C 3 6
I would like to have a variable, and name if it is possible, instead of the generated "index".
To reproduce this dataset, I have used this chunk of code:
data = [['1/1/2021', 1, 2, 3], ['3/1/2021', 4, 5, 6]]
df = pd.DataFrame(data)
df.columns = ['Date', 'A', 'B', 'C']
df.set_index('Date', inplace=True)
To have meaningfull column inplace of index, the next line can be run:
df_T = df.T.reset_index()
To rename the column 'index', method rename can be used:
df_T.rename(columns={'index':'Variable'})
A Pandas Data Frame typically has names both for the colmns and for the rows. The list of names for the rows is called the "Index". When you do a transpose, rows and columns switch places. In your case, the dates column is for some reason the index, so it becomes the column names for the new data frame. You need to create a new index and turn the "Date" column into a regular column. As #sophocies wrote above, this is achived with df.reset_index(...). I hope the code example below will be helpful.
import pandas as pd
df = pd.DataFrame(columns=['Date', 'A', 'B', 'C'], data=[['1/1/2021', 1, 2,3], ['1/2/2021', 4,5,6]])
df.set_index('Date', inplace=True)
print(df.transpose(copy=True))
#Recreated the problem
df.reset_index(inplace=True)
print(df.transpose())
Output
0
1
Date
1/1/2021
1/2/2021
A
1
4
B
2
5
C
3
6
I hope this is what you wanted!
I have a column in a dataframe that contains time in the below format.
Dataframe: df
column: time
value: 07:00:00, 13:00:00 or 14:00:00
The column will have only one of these three values in each row. I want to convert these to 0, 1 and 2. Can you help replace the times with these numeric values?
Current:
df['time'] = [07:00:00, 13:00:00, 14:00:00]
Expected:
df['time'] = [0, 1, 2]
Thanks in advance.
You can use map to do this:
import datetime
mapping = {datetime.time(07,00,00):0, datetime.time(13,00,00):1, datetime.time(14,00,00):2}
df['time']=df['time'].map(mapping)
One approach is to use map
Ex:
val = {"07:00:00":0, "13:00:00":1, "14:00:00":2}
df = pd.DataFrame({'time':["07:00:00", "13:00:00", "14:00:00"] })
df["time"] = df["time"].map(val)
print(df)
Output:
time
0 0
1 1
2 2
I am trying to figure out a way in which I can combine two dfs in pandas/python into one based on a couple of factors.
There is an i.d field that exists in both dfs
Each df has a timestamp, df_1 can have one or multiple timestamps associated with an i.d.
df_2 only has one timestamp associated with an I.D
df_2 timestamp will always be the earliest or first timestamp compared to timestamps in df_1
I want to combine both dataframes where the df_2 timestamp is the first timestamp in a column, and each subsequent timestamp from df_1 comes after.
so the output will look something like
I.D | Timestamp
E4242 earliest_timestamp from df_2
E4242 next_timestamp from df_1
E4242 next_timestamp from df_1
Thanks for looking!
If it's always true that df2 only contains one date per ID, and that date is always the earliest date for that ID, could you simply concatenate df1 and df2, then sort by ID and timestamp? For example:
# Generate example data
df1 = pd.DataFrame({'id': [1, 1, 2, 3, 3, 3],
'timestamp': pd.to_datetime(['2019-01-01',
'2019-01-02',
'2019-01-15',
'2019-01-17',
'2019-02-01',
'2019-02-03'])})
df2 = pd.DataFrame({'id': [1, 2, 3],
'timestamp': pd.to_datetime(['1959-06-01',
'1989-12-01',
'1999-01-25'])})
df = pd.concat([df1, df2])
df = df.sort_values(by=['id', 'timestamp']).reset_index(drop=True)
df
id timestamp
0 1 1959-06-01
1 1 2019-01-01
2 1 2019-01-02
3 2 1989-12-01
4 2 2019-01-15
5 3 1999-01-25
6 3 2019-01-17
7 3 2019-02-01
8 3 2019-02-03
I have dict that I would like to turn into a DataFrame with MultiIndex.
The dict is:
dikt = {'bloomberg': Timestamp('2009-01-26 10:00:00'),
'investingcom': Timestamp('2009-01-01 09:00:00')}
I construct a MultiIndex such as follow:
MI= MultiIndex(levels=[['Existing Home Sales MoM'], ['investingcom', 'bloomberg']],
labels=[[0, 0], [0, 1]],
names=['indicator', 'source'])
Then a DataFrame as such:
df = pd.DataFrame(index = MI, columns=["datetime"],data =np.full((2,1),np.NaN))
Then lastly I fill the df with data stored in a dict such :
for key in ['indicator', 'source']:
df.loc[('Existing Home Sales MoM',key), "datetime"] = dikt[key]
and get the expected result:
But would there be a more concise way of doing so by passing the dikt directly into the construction of the df such as
df = pd.DataFrame(index = MI, columns=["datetime"],data =dikt)
so as to combine the 2 last steps in 1?
You can create a datframe from a dictionary using from_dict:
pd.DataFrame.from_dict(dikt, orient='index')
0
bloomberg 2009-01-26 10:00:00
investingcom 2009-01-01 09:00:00
You can chain the column and index definitions to get the result you're after in 1 step:
pd.DataFrame.from_dict(dikt, orient='index') \
.rename(columns={0: 'datetime'}) \
.set_index(MI)