Setting DataFrame value using a datetime as index - python

I have two data frames, one with 3 rows and 4 columns + date as index dataframeA
TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0
and another empty one with 100's of dates as index and two columns dataframeB
CAR BOAT
2010-01-01 Nan 0.0
2010-01-02 Nan 0.0
2010-01-03 Nan 0.0
2010-01-04 Nan 0.0
2010-01-05 -69.00 0.0
.....
2010-06-03 Nan 20.00
...
2010-07-01 105.00 0.0
I need to read each row from the first data frame , find the corresponding date and based on the unit type assign it the corresponding percentage or reduction on the second data frame.
I was reading about not iterating when dealing with dataframes? not sure how else?. how can i evaluate each row and then set the value on dataframeB ?
I tried doing the following :
for index, row in dataframeA.iterrows():
type = row['TYPE']
unit = row['UNIT']
price = row['PRICE']
percent = row['PERCENT']
then here with basic math come up with the reduction or
increase and assign to dataframeB do the same for the others
My question is, is this the right approach and also how do i assign the value i come up to the other dataframeB ?

If your first dataframe is limited to just the variables stated, you can do this. Not terribly elegant, but works. If you have many more combinations in the dataframe, it'd have to be rethought. See comments inline.
df = pd.read_csv(io.StringIO(''' date TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0'''), sep='\s+', engine='python').set_index('date')
df1 = pd.read_csv(io.StringIO('''date
2010-01-01
2010-01-02
2010-01-03
2010-01-04
2010-01-05
2010-06-03
2010-07-01'''), engine='python').set_index('date')
# calculate your changes in first dataframe
df.loc[df.TYPE == 'REDUCE', 'Change'] = - df['PRICE'] * df['PERCENT'] / 100
df.loc[df.TYPE == 'INCREASE', 'Change'] = df['PRICE'] * df['PERCENT'] / 100
#merge the Changes into car and boat dataframes; rename columns
df_car = df[['Change']].loc[df.UNIT == 'CAR'].merge(df1, right_index=True, left_index=True, how='right')
df_car.rename(columns={'Change':'Car'}, inplace=True)
df_boat = df[['Change']].loc[df.UNIT == 'BOAT'].merge(df1, right_index=True, left_index=True, how='right')
df_boat.rename(columns={'Change':'Boat'}, inplace=True)
# merge car and boat
dfnew = df_car.merge(df_boat, right_index=True, left_index=True, how='right')
dfnew
Car Boat
date
2010-01-01 NaN NaN
2010-01-02 NaN NaN
2010-01-03 NaN NaN
2010-01-04 NaN NaN
2010-01-05 -69.000 NaN
2010-06-03 NaN 20.000
2010-07-01 105.000 NaN

Related

How to create a new metric column based on 1 year lag of a date column?

I would like create a new column which references a date column - 1 year and displays the corresponding values:
import pandas as pd
Input DF
df = pd.DataFrame({'consumption': [0,1,3,5], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption date
0 2017-04-01
1 2017-04-02
3 2018-04-01
5 2018-04-02
Expected DF
df = pd.DataFrame({'consumption': [0,1,3,5],
'prev_year_consumption': [np.NAN,np.NAN,0,1],
'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02'),
pd.to_datetime('2018-04-01'),
pd.to_datetime('2018-04-02')]})
>>> df
consumption prev_year_consumption date
0 NAN 2017-04-01
1 NAN 2017-04-02
3 0 2018-04-01
5 1 2018-04-02
So prev_year_consumption is are simply values from the consumption column where 1 year is subtracted from date dynamically.
in SQL I would probably do something like
SELECT df_past.consumption as prev_year_consumption, df_current.consumption
FROM df as df_current
LEFT JOIN ON df df_past ON year(df_current.date) = year(df_past.date) - 1
Appreciate any hints
The notation in pandas is similar. We are still doing a self merge however we need to specify that the right_on (or left_on) has a DateOffset of 1 year:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
)
new_df:
date consumption_x date_x consumption_y date_y
0 2017-04-01 0 2017-04-01 NaN NaT
1 2017-04-02 1 2017-04-02 NaN NaT
2 2018-04-01 3 2018-04-01 0.0 2017-04-01
3 2018-04-02 5 2018-04-02 1.0 2017-04-02
We can further drop and rename columns to get exact output:
new_df = df.merge(
df,
left_on='date',
right_on=df['date'] + pd.offsets.DateOffset(years=1),
how='left'
).drop(columns=['date_x', 'date_y']).rename(columns={
'consumption_y': 'prev_year_consumption'
})
new_df:
date consumption_x prev_year_consumption
0 2017-04-01 0 NaN
1 2017-04-02 1 NaN
2 2018-04-01 3 0.0
3 2018-04-02 5 1.0

Correlation between columns of different dataframes

I have many dataframes. They all share the same column structure "date", "open_position_profit", "more columns...".
date open_position_profit col2 col3
0 2008-04-01 -260.0 1 290.0
1 2008-04-02 -340.0 1 -60.0
2 2008-04-03 100.0 1 40.0
3 2008-04-04 180.0 1 -90.0
4 2008-04-05 0.0 0 0.0 0.0 1
Although "date" is present in all dataframes, they might or might not have the same count (some dates might be in one dataframe but not the other).
I want to compute a correlation matrix of the columns "open_position_profit" of all these dataframes.
I've tried this
dfs = [df1[["date", "open_position_profit"]], df2[["date", "open_position_profit"]], ...]
pd.concat(dfs).groupby('date', as_index=False).corr()
But this gives me a series of the correlation for each cell:
open_position_profit
0 open_position_profit 1.0
1 open_position_profit 1.0
2 open_position_profit 1.0
3 open_position_profit 1.0
4 open_position_profit NaN
I want the correlation for the entire time series, not each single cell. How can I do this?
If I understand your intention correctly, it is necessary to do outer join first. The following code does outer join by date key. The missing value can be represented by NaN.
df = pd.merge(df1, df2, on='date', how='outer')
date open_position_profit_x open_position_profit_y ... ...
0 2019-01-01 ...
1 2019-01-02 ...
2 2019-01-03 ...
3 2019-01-04 ...
Then you can calculate the correlation with the new DataFrame.
df.corr()
open_position_profit_x open_position_profit_y ... ...
open_position_profit_x 1.000000 0.866025
open_position_profit_y 0.866025 1.000000
... 1.000000 1.000000
... 1.000000 1.000000
See: pd.merge

How to create a timeseries from a dataframe of event durations?

I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).
I have calculated the duration of the bookings, and reindexed the dataframe daily.
Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.
Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows.
Any idea how I can do that?
Here is my code:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
I have this:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
I expect to have:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values
#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)
ts = ts.ffill().where(m)
First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.
If you want to include the row with the check out date, change m from < to <=
I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
you can do something like this:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.
EDIT:
to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)

Pandas dataframe puts NaN and NaT

What I'm doing is I have generated a DataFrame with pandas:
df_output = pd.DataFrame(columns={"id","Payout date", "Amount"}
In column 'Payout date' is a datetime, and in 'Amount' a float. I'm taking the values for each row from a csv:
df=pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False)
but when I assign the values:
df_output.loc[df_output['id'] == index, 'Payout date'].iloc[0]=(parsed_date)
pay=payments.get()
ref=refunds.get()
df_output.loc[df_output['id'] == index, 'Amount'].iloc[0]=(pay+ref-for_next_day)
and I print it the columns 'Payout date' and 'Amount' it only prints the id correctly, and NaT for the payouts and NaN for the amount, even when casting them to floats, or using
df_output['Amount']=pd.to_numeric(df_output['Amount'])
df_output['Payout date'] = pd.to_datetime(df_output['Payout date'])
I've also tried casting the values before passing them to the DataFrame, with no luck, so what I'm getting is this:
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
Instead, I'm looking for something like this:
id Payout date Amount
1 2019-03-11 3.2
2 2019-03-11 3.2
3 2019-03-11 3.2
4 2019-03-11 3.2
5 2019-03-11 3.2
EDIT
print(df_output.head(5))
print(df.head(5))
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
id Created (UTC) Type Currency Amount Fee Net
1 2016-07-27 13:28:00 charge mxn 672.0 31.54 640.46
2 2016-07-27 15:21:00 charge mxn 146.0 9.58 136.42
3 2016-07-27 16:18:00 charge mxn 200.0 11.83 188.17
4 2016-07-27 17:18:00 charge mxn 146.0 9.58 136.42
5 2016-07-27 18:11:00 charge mxn 286.0 15.43 270.57
Probably the easiest thing to do would be just to rename the columns of the dataframe you're loading:
df = pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False, index_col='id')
df.columns(rename={"Created (UTC)":'Payout Date'}, inplace=True)
df_output = df[['Payout Date', 'Amount']]
EDIT:
if you're trying to assign a column in one dataframe to the column of another just do this:
output_df['Amount'] = df['Amount']

Pandas Error when creating a pivot table (KeyError)

I am trying to create a pivot table from a Dataframe using Pandas. Given below is the view of my Dataframe.
category,date,type1,type2,total
PROD_A,2018-10-01,2,2,4
PROD_A,2018-10-02,2,0,2
PROD_B,2018-10-01,0,0,0
PROD_A,2018-10-03,0,0,0
I am trying to create a pivot and save the output to an excel file
Summary = pd.pivot_table(df, values=['total'], index=['category'], columns='date')
Summary.to_excel(writer, sheet_name='Summary')
I get the below error
KeyError : 'total'
Could anyone guide me where am I gong wrong with this. Thanks
Updating on the datatype:
category object
date object
type1 int64
type2 int64
total float64
dtype: object
Output of df.head():
category,date,type1,type2,total
PROD_A,2018-10-01,2,2,4
PROD_A,2018-10-02,2,0,2
PROD_B,2018-10-01,0,0,0
PROD_A,2018-10-03,0,0,0
PROD_B,2018-10-03,2,3,5
Problem is ['total'], it create MultiIndex in columns:
Summary = pd.pivot_table(df, values=['total'], index=['category'], columns='date')
print (Summary)
total
date 2018-10-01 2018-10-02 2018-10-03
category
PROD_A 4.0 2.0 0.0
PROD_B 0.0 NaN NaN
Solution is use remove it:
Summary = pd.pivot_table(df, values='total', index='category', columns='date')
print (Summary)
date 2018-10-01 2018-10-02 2018-10-03
category
PROD_A 4.0 2.0 0.0
PROD_B 0.0 NaN NaN
Last convert index to column by reset_index:
Summary = (pd.pivot_table(df, values='total', index='category', columns='date')
.reset_index(drop=True))
print (Summary)
date 2018-10-01 2018-10-02 2018-10-03
0 4.0 2.0 0.0
1 0.0 NaN 5.0

Categories