pandas dataframe column means [duplicate] - python

I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.

You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)

Related

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

How to convert time data to numeric value?

I have a dataframe out:
dates min max wh
0 2005-09-06 07:41:18 21:59:57 14:18:39
1 2005-09-12 14:49:22 14:49:22 00:00:00
2 2005-09-19 11:08:56 11:24:05 00:15:09
3 2005-09-21 21:19:21 21:20:15 00:00:54
4 2005-09-22 19:41:52 19:41:52 00:00:00
5 2005-10-13 11:22:07 21:05:41 09:43:34
6 2005-11-22 11:53:12 21:21:22 09:28:10
7 2005-11-23 00:07:01 14:08:50 14:01:49
8 2005-11-30 13:42:48 23:59:19 10:16:31
9 2005-12-01 00:05:16 10:24:12 10:18:56
10 2005-12-21 17:38:43 19:26:03 01:47:20
11 2005-12-22 09:20:07 11:25:40 02:05:33
12 2006-01-23 07:46:20 08:01:52 00:15:32
13 2006-04-27 16:27:54 19:29:52 03:01:58
14 2006-05-11 12:48:34 23:10:44 10:22:10
15 2006-05-15 10:14:59 22:28:12 12:13:13
16 2006-05-16 01:14:07 23:55:51 22:41:44
17 2006-05-17 01:12:45 23:57:56 22:45:11
18 2006-05-18 02:42:08 21:48:49 19:06:41
and I want the average workhours per day (which presents the column wh) per month.
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out2=out.groupby('month')['wh'].mean().reset_index(name='wh2')
I used this so far, but the values in wh are no numeric data so I can't build the mean. How can I convert the whole column wh build the mean?
My wh was made by the following:
df = pd.read_csv("Testordner2/"+i, parse_dates=True)
df['new_time'] = pd.to_datetime(df['new_time'])
df['dates']= df['new_time'].dt.date
df['time'] = df['new_time'].dt.time
out = df.groupby(df['dates']).agg({'time': ['min', 'max']}) \
.stack(level=0).droplevel(1)
out['min_as_time_format'] = pd.to_datetime(out['min'], format="%H:%M:%S")
out['max_as_time_format'] = pd.to_datetime(out['max'], format="%H:%M:%S")
out['wh'] = out['max_as_time_format'] - out['min_as_time_format']
out['wh'].astype(str).str[-18:-10]
One possible solution is convert timedeltas to native format, aggregate mean and then convert back to timedeltas:
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out['wh'] = pd.to_timedelta(out['wh']).astype(np.int64)
out2=pd.to_timedelta(out.groupby('month')['wh'].mean()).reset_index(name='wh2')
print (out2)
month wh2
0 2005-09 02:54:56.400000
1 2005-10 09:43:34
2 2005-11 11:15:30
3 2005-12 04:43:56.333333
4 2006-01 00:15:32
5 2006-04 03:01:58
6 2006-05 17:25:47.800000

Equivalent of 'mutate_at' dplyr function in Python pandas

and thank you in advanced for the help.
I am looking to create multiple new columns in a pandas dataframe, by dividing a subset of existing columns by another existing column, dynamically named with a suffix. Below is dummy code illustrating the general gist of what i want to do, except for 25+ columns with various transformations.
R code
library(dplyr)
player = c('John','Peter','Michael')
min = c(20, 23, 35)
points = c(10,12,14)
rebounds = c(5,7,9)
assists = c(4,6,7)
df = data.frame(player,min,points,rebounds,assists)
df = df %>%
mutate_at(vars(points:assists),.funs=funs(per_min=./min))
Expected output
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
1 John 20 10 5 4 0.5000000 0.2500000 0.2000000
2 Peter 23 12 7 6 0.5217391 0.3043478 0.2608696
3 Michael 35 14 9 7 0.4000000 0.2571429 0.2000000
I know that I can reproduce the above in pandas as follows:
import pandas as pd
data = pd.DataFrame({'player':['John','Peter','Michael'],
'min':[20,23,35],
'points':[10,12,14],
'rebounds':[5,7,9],
'assists':[4,6,7]})
df = pd.DataFrame(data)
df['points_per_minute'] = df['points']/df['min']
df['rebounds_per_minute'] = df['rebounds']/df['min']
df['assists_per_minute'] = df['assists']/df['min']
df.head()
player min points rebounds assists points_per_minute rebounds_per_minute assists_per_minute
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
However, I have to do this for 25+ columns, with different transformations, and explicitly naming every column and operation will become rather cumbersome. Is there any pandas replication of this?
Similar to base R, assign by block of columns with basic arithmetic. Often base R translates better to Numpy/Pandas.
R
cols <- c("points", "rebounds", "assists")
df[paste0(cols, "_per_min")] <- df[cols] / df$min
Python pandas
cols = ["points", "rebounds", "assists"]
df[[col+'_per_min' for col in cols]] = df[cols].div(df['min'], axis='index')
Method1:
Take the list of columns(if you dont have a list of columns and want to get all columns after the min column , use cols=df.iloc[:,df.columns.get_loc('min')+1:].columns)
cols=['points','rebounds','assists']
create a copy of the subset of those columns by df.loc[] and add_suffix as _per_minute, then divide them with the min column.
m=df.loc[:,cols].add_suffix('_per_minute')
df[m.columns]=m.div(df['min'],axis=0)
print(df)
Method2: concat:
cols=['points','rebounds','assists']
df=pd.concat([df,df.loc[:,cols].add_suffix('_per_minute').div(df['min'],axis=0)],axis=1)
Method3:
directly assign them with string formatting using same logic:
cols=['points','rebounds','assists']
df[[f"{i}_per_minute" for i in cols]]=df.loc[:,cols].div(df['min'],axis=0)
print(df)
player min points rebounds assists points_per_minute \
0 John 20 10 5 4 0.500000
1 Peter 23 12 7 6 0.521739
2 Michael 35 14 9 7 0.400000
rebounds_per_minute assists_per_minute
0 0.250000 0.20000
1 0.304348 0.26087
2 0.257143 0.20000
mutate_at is superseded by mutate and across.
Here is how you can do it in a dplyr way in python:
>>> from datar.all import c, f, tibble, mutate, across
>>>
>>> player = c('John','Peter','Michael')
>>> min = c(20, 23, 35)
>>> points = c(10,12,14)
>>> rebounds = c(5,7,9)
>>> assists = c(4,6,7)
>>>
>>> df = tibble(player,min,points,rebounds,assists)
>>>
>>> df = df >> mutate(
... # f.min passed to lambda as y
... across(f[f.points:f.assists], {'per_min': lambda x, y: x / y}, f.min)
... )
>>> df
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
<object> <int64> <int64> <int64> <int64> <float64> <float64> <float64>
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
I am the author of the datar package. Feel free to submit issues if you have any questions.
With the specific goal of making this feel more like dplyr, I really prefer method-chaining solutions because of their syntactic similarity to piped dplyr code.
This solution uses pandas.DataFrame.assign and dictionary unpacking.
updated_data = data.assign(**{f"{col}_per_minute": lambda x: x[col] / x["min"]
for col in ["points", "rebounds", "assists"]})

Forward filling missing dates into Python Pandas Dataframe

I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})

stepwise time series pandas

I have the following problem in pandas where I have a time series with specific time stamps and values:
ts1 = DatetimeIndex(['1995-05-26', '1995-05-30', '1995-05-31', '1995-06-01',
'1995-06-02', '1995-06-05', '1995-06-06', '1995-06-08',
'1995-06-09', '1995-06-12'],
dtype='datetime64[ns]', freq=None, tz=None)
Then I have a time index that contains these timestamps, and some other timestamps in between. How do I create a stepwise function (forward fill) that fills forward the same constant value from [T-1, T) for T in ts1?
Something like this?:
dfg1 = pd.DataFrame(range(len(ts1)), index=ts1)
idx = pd.DatetimeIndex(start=min(ts1), end=max(ts1), freq='D')
>>> dfg1.reindex(index=idx).ffill()
0
1995-05-26 0
1995-05-27 0
1995-05-28 0
1995-05-29 0
1995-05-30 1
1995-05-31 2
1995-06-01 3
1995-06-02 4
1995-06-03 4
1995-06-04 4
1995-06-05 5
1995-06-06 6
1995-06-07 6
1995-06-08 7
1995-06-09 8
1995-06-10 8
1995-06-11 8
1995-06-12 9

Categories