How to convert time data to numeric value? - python

I have a dataframe out:
dates min max wh
0 2005-09-06 07:41:18 21:59:57 14:18:39
1 2005-09-12 14:49:22 14:49:22 00:00:00
2 2005-09-19 11:08:56 11:24:05 00:15:09
3 2005-09-21 21:19:21 21:20:15 00:00:54
4 2005-09-22 19:41:52 19:41:52 00:00:00
5 2005-10-13 11:22:07 21:05:41 09:43:34
6 2005-11-22 11:53:12 21:21:22 09:28:10
7 2005-11-23 00:07:01 14:08:50 14:01:49
8 2005-11-30 13:42:48 23:59:19 10:16:31
9 2005-12-01 00:05:16 10:24:12 10:18:56
10 2005-12-21 17:38:43 19:26:03 01:47:20
11 2005-12-22 09:20:07 11:25:40 02:05:33
12 2006-01-23 07:46:20 08:01:52 00:15:32
13 2006-04-27 16:27:54 19:29:52 03:01:58
14 2006-05-11 12:48:34 23:10:44 10:22:10
15 2006-05-15 10:14:59 22:28:12 12:13:13
16 2006-05-16 01:14:07 23:55:51 22:41:44
17 2006-05-17 01:12:45 23:57:56 22:45:11
18 2006-05-18 02:42:08 21:48:49 19:06:41
and I want the average workhours per day (which presents the column wh) per month.
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out2=out.groupby('month')['wh'].mean().reset_index(name='wh2')
I used this so far, but the values in wh are no numeric data so I can't build the mean. How can I convert the whole column wh build the mean?
My wh was made by the following:
df = pd.read_csv("Testordner2/"+i, parse_dates=True)
df['new_time'] = pd.to_datetime(df['new_time'])
df['dates']= df['new_time'].dt.date
df['time'] = df['new_time'].dt.time
out = df.groupby(df['dates']).agg({'time': ['min', 'max']}) \
.stack(level=0).droplevel(1)
out['min_as_time_format'] = pd.to_datetime(out['min'], format="%H:%M:%S")
out['max_as_time_format'] = pd.to_datetime(out['max'], format="%H:%M:%S")
out['wh'] = out['max_as_time_format'] - out['min_as_time_format']
out['wh'].astype(str).str[-18:-10]

One possible solution is convert timedeltas to native format, aggregate mean and then convert back to timedeltas:
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out['wh'] = pd.to_timedelta(out['wh']).astype(np.int64)
out2=pd.to_timedelta(out.groupby('month')['wh'].mean()).reset_index(name='wh2')
print (out2)
month wh2
0 2005-09 02:54:56.400000
1 2005-10 09:43:34
2 2005-11 11:15:30
3 2005-12 04:43:56.333333
4 2006-01 00:15:32
5 2006-04 03:01:58
6 2006-05 17:25:47.800000

Related

Add a row for missing period and for the corresponding period calculate the average of last 3 Months

I am trying to write a code which adds missing periods to the dataframe and calculates their respective averages. Refer to the below example:
Invoice Date Amount
9 01/2020 227500
4 02/2020 56000
0 03/2020 22000
1 05/2020 25000
5 06/2020 75000
2 07/2020 27000
6 08/2020 48000
3 09/2020 35000
7 10/2020 115000
8 12/2020 85000
In the above dataframe, we see that there's a record missing for '11/2020'. I am trying to add the record for the period of 11/2020 and calculate it's mean for the last three months i.e., if 11/2020 is missing, take the amounts of 12/2020,10/2020 and 9/2020 and calculate its Mean and add/append it to the dataframe.
Expected output:
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 75000.00
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67
9 12/2020 85000.00
Please note that, I am able to arrive at the above result with the following code:
import pandas as pd
FundAdmin = {
'Invoice Date': ['03/2020', '05/2020', '07/2020', '09/2020', '02/2020', '04/2020', '06/2020', '08/2020', '10/2020', '12/2020',
'01/2020'
],
'Amount': [22000, 25000, 27000, 35000, 56000, 75000, 48000, 115000, 77000, 85000, 227500]
}
expected_dates = ['01/2020', '02/2020', '03/2020', '04/2020', '05/2020', '06/2020', '07/2020', '08/2020', '09/2020', '10/2020', '11/2020',
'12/2020'
]
df = pd.DataFrame(FundAdmin, columns = ['Invoice Date', 'Amount'])
current_dates = df['Invoice Date']
missing_dates = list(set(expected_dates) - set(current_dates))
sorted_df = df.sort_values(by = 'Invoice Date')
for i in missing_dates:
Top_3_Rows = sorted_df.tail(3)# print(Top_3_Rows)
Top_3_Rows_Amount = round(Top_3_Rows.mean(), 2)
CalcDF = {
'Invoice Date': i,
'Amount': float(Top_3_Rows_Amount)
}
FullDF = df.append(CalcDF, ignore_index = True)
print(FullDF)
However, my code is not able to handle the calculation for missing records in the middle of the dataframe. Meaning, it adds missing period to dataframe, but is not able to pick up the values of the previous 3months and it is adding the same mean amount to all the missing periods. Example: If there's a record for 4/2020 missing, code should be able to add a new record for 4/2020 and assign the value of the mean generated out of 1/2020,2/2020 and 3/2020 to 4/2020. Instead, it is assigning the Mean value of other missing period. Please refer to the below:
Expected Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 101833.33 <---- New Record Inserted for 4/2020 through the calculation the mean for 3/2020,2/2020,1/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <---- New Record Inserted for 11/2020 through the calculation the mean for 12/2020,10/2020,9/2020
9 12/2020 85000.00
My Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 65666.67 <--- Value same as 11/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <--- This works fine.
9 12/2020 85000.00
From my observation, I found that my code is not able to fetch the last 3 records if the missing period occurs to be in the middle of the dataframe, as I am using tail() method and it is fetching the records of 9/2020,10/2020 and 12/2020, caluclating its mean and assigning the same value to 4/2020. I am a complete beginner to python and if any assistance provided to resolve the above issue is greatly appreciated.
Would this work for you?
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from random import randint
df_len = 100
df = pd.DataFrame({
'Invoice': [randint(1, 10) for _ in range(df_len)],
'Dates' : [(datetime.today() - pd.DateOffset(months=mnths_ago)).date()
for mnths_ago in range(df_len)],
'Amount': [randint(1, 100000) for _ in range(df_len)],
})
# Drop 10 random rows
drop_indices = np.random.choice(df.index, 10, replace=False)
df = df.drop(drop_indices)
df
Invoice Dates Amount
0 1 2020-05-19 23797
1 6 2020-04-19 54101
2 10 2020-03-19 91522
3 5 2020-02-19 48762
4 1 2020-01-19 54497
.. ... ... ...
93 1 2012-08-19 56834
94 10 2012-07-19 21382
95 2 2012-06-19 33056
96 1 2012-05-19 93336
98 7 2012-03-19 12406
from dateutil import relativedelta
def get_prev_mean(date):
return df[:df.loc[df.Dates == date].index[0]].tail(3)['Amount'].mean()
r = relativedelta.relativedelta(df.Dates.min(), df.Dates.max())
n_months = -(r.years * 12) + r.months
all_months = [(df.Dates.max() - pd.DateOffset(months=mnths_ago)).date() for mnths_ago in range(n_months)]
missing_months = [mnth for mnth in all_months if mnth in list(df.Dates)]
dct = {mnth: get_prev_mean(mnth) for mnth in missing_months}
to_merge = pd.DataFrame(data=dct.values(), index=dct.keys()).reset_index()
to_merge.columns = ['Dates', 'Amount']
out = pd.concat([df, to_merge], sort=False).sort_values(by='Dates').reset_index(drop=True)
out
Invoice Dates Amount
0 7.0 2012-03-19 12406.0
1 1.0 2012-05-19 93336.0
2 2.0 2012-06-19 33056.0
3 10.0 2012-07-19 21382.0
4 1.0 2012-08-19 56834.0
.. ... ... ...
171 10.0 2020-03-19 91522.0
172 NaN 2020-04-19 23797.0
173 6.0 2020-04-19 54101.0
174 NaN 2020-05-19 NaN
175 1.0 2020-05-19 23797.0

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

Correlation between two dataframes column with matched headers

I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64

pandas dataframe column means [duplicate]

I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.
You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)

stepwise time series pandas

I have the following problem in pandas where I have a time series with specific time stamps and values:
ts1 = DatetimeIndex(['1995-05-26', '1995-05-30', '1995-05-31', '1995-06-01',
'1995-06-02', '1995-06-05', '1995-06-06', '1995-06-08',
'1995-06-09', '1995-06-12'],
dtype='datetime64[ns]', freq=None, tz=None)
Then I have a time index that contains these timestamps, and some other timestamps in between. How do I create a stepwise function (forward fill) that fills forward the same constant value from [T-1, T) for T in ts1?
Something like this?:
dfg1 = pd.DataFrame(range(len(ts1)), index=ts1)
idx = pd.DatetimeIndex(start=min(ts1), end=max(ts1), freq='D')
>>> dfg1.reindex(index=idx).ffill()
0
1995-05-26 0
1995-05-27 0
1995-05-28 0
1995-05-29 0
1995-05-30 1
1995-05-31 2
1995-06-01 3
1995-06-02 4
1995-06-03 4
1995-06-04 4
1995-06-05 5
1995-06-06 6
1995-06-07 6
1995-06-08 7
1995-06-09 8
1995-06-10 8
1995-06-11 8
1995-06-12 9

Categories