Summing up values from one column based on values in other column - python

I have a dataframe something like below,
Timestamp count
20180702-06:26:20 50
20180702-06:27:11 10
20180702-07:05:10 20
20180702-07:10:10 30
20180702-08:27:11 40
I want output something like below,
Timestamp Sum_of_count
20180702-06 60
20180702-07 50
20180702-08 40
Basically, I need to find sum of count for every hour.
Any help is really appreciated.

You need separate value some way - one is split and seelct first lists by str[0] and then aggregate sum:
s = df['Timestamp'].str.split(':', n=1).str[0]
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
Or convert values to datetimes by to_datetime and get values by strftime:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%Y%m%d-%H:%M:%S')
s = df['Timestamp'].dt.strftime('%Y%m%d-%H')
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
print (df1)
Timestamp Sum_of_count
0 20180702-06 60
1 20180702-07 50
2 20180702-08 40

Use
In [252]: df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
Out[252]:
Timestamp
2018-07-02-06 60
2018-07-02-07 50
2018-07-02-08 40
Name: count, dtype: int64
In [254]: (df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
.reset_index(name='Sum_of_count'))
Out[254]:
Timestamp Sum_of_count
0 2018-07-02-06 60
1 2018-07-02-07 50
2 2018-07-02-08 40

Related

use method of value for the condition of numpy where

let's say you have this data frame:
df = pd.DataFrame( data = [ '2014-04-07 10:55:35.087000+00:00',
'2014-04-07 13:59:37.251500+00:00',
'2014-04-02 13:23:59.629000+00:00',
'2014-04-07 12:17:48.182000+00:00',
'2014-04-06 17:00:23.912000+00:00'],
columns = ['timestamp'],
dtype = np.datetime64
)
and you want to create a new column where the values are 1 if the timestamp is a weekday or 0 if it is not. Then I would run something like this:
df['weekday'] = df['timestamp'].apply(lambda x: 1 if x.weekday() < 5 else 0 )
So far so good. However, in my case I have about 10 million rows of such timestamp values and it just takes forever to run. So, I looked around for vectorization options and I found numpy.where(). But, of course, this does not work: np.where(df['timestamp'].weekday() < 5, 1, 0)
So, is there a way to access the .weekday() method of the timestamps when using numpy.where or is there any other way to produce the weekday column when having 10 million rows? Thanks.
Use Series.dt.dayofweek / Series.dt.weekday with Series.lt and Series.astype:
df['weekday'] = df['timestamp'].dt.dayofweek.lt(5).astype(int)
print(df)
timestamp weekday
0 2014-04-07 10:55:35.087000 1
1 2014-04-07 13:59:37.251500 1
2 2014-04-02 13:23:59.629000 1
3 2014-04-07 12:17:48.182000 1
4 2014-04-06 17:00:23.912000 0
I recommend you see: when should I ever want to use apply in my code
We could also use np.where:
df['weekday'] = np.where(df['timestamp'].dt.dayofweek.lt(5), 1, 0)

Iterate over dates pandas

I have a pandas dataframe:
CITY DT Error%
1 A 1/1/2020 0.03436722
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757
4 B 1/8/2020 0.098921665
I want to iterate through the dataframe and check if the DT and its next week DT have a ERROR % of less than 0.05.
I want the return to be the dataframe series
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757
IIUC,
df['DT'] = pd.to_datetime(df['DT'])
idx = df[df['DT'].sub(df['DT'].shift()).gt('6 days')].index.tolist()
indices = []
for i in idx:
indices.append(i-1)
indices.append(i)
print(df.loc[df['Error%'] <= 0.05].loc[indices])
CITY DT Error%
2 A 2020-01-02 0.031902
3 B 2020-01-09 0.040219
Not particularly elegant, but it gets the job done and maybe some of the professionals here can improve on it:
First, merge the information for the day with the info for the day a week after by performing a self-join on the time-shifted DT column. we can use an inner join since we're only interested in rows that have an entry for the week after:
tmp = df.set_index(df.DT.apply(lambda x: x + pd.Timedelta('7 days'))) \
.join(df.set_index('DT'), lsuffix='_L', how='inner')
Then select the date column for those entries where both error margins are satisfied:
tmp = tmp.DT.loc[(tmp['Error%_L'] < 0.05) & (tmp['Error%'] < 0.05)]
tmp is now a pd.Series with information in the index (the shifted values) and the values (the first week). Since both dates are wanted in the output, compile the "index dates" by taking the unique values among all of them:
idx = list(set(tmp.tolist() + tmp.index.tolist()))
And finally, grab the corresponding rows from the original dataframe:
df.set_index('DT').loc[idx].reset_index()
This, however loses you the original row number. If that is needed, you'll have to save it to a column first and reset the index back to that variable after selecting the relevant rows

Pandas: Calculate remaining time in grouping

I have a requirement to sort a table by date starting from the oldest. Total field is created by grouping name and kind fields and applying sum. Now for each row I need to calculate the remaining time in the same name-kind grouping.
The csv looks like that:
date name kind duration total remaining
1-1-2017 a 1 10 100 ? should be 90
2-1-2017 b 1 5 35 ? should be 30
3-1-2017 a 2 3 50 ? should be 47
4-1-2017 b 2 1 25 ? should be 24
5-1-2017 a 1 8 100 ? should be 82
6-1-2017 b 1 2 35 ? should be 33
7-1-2017 a 2 3 50 ? should be 44
8-1-2017 b 2 6 25 ? should be 18
...
My question is how do I calculate the remaining value while having the DataFrame grouped by name and kind?
My initial approach was to shift the column and add the values from duration to each other like that:
df['temp'] = df.groupby(['name', 'kind'])['duration'].apply(lambda x: x.shift() + x)
and then:
df['duration'] = df.apply(lambda x: x['total'] - x['temp'], axis=1)
But it did not work as expected.
Is there a clean way to do it, or using the iloc, ix, loc somehow is the way to go?
Thanks.
You could do something like:
df["cumsum"] = df.groupby(['name', 'kind'])["duration"].cumsum()
df["remaining"] = df["total"] - df["cumsum"]
Being careful with resetting the index maybe.

Remove row whose timestamp is in previous sliding window via Pandas in Python

Here's a problem for cleansing my data. The dataframe looks as below:
timestamp
0 10
1 12
2 23
3 25
4 27
5 34
6 45
What I intend to do is iterating through timestamps from top to down, grab one if no previous timestamp is taken (for initialization, It'll take '10'), then omit every row whose timestamp is between [10, 10+10], including '12'. Likewise, we should take '23' and omit '25', '27' since they are between [23, 23+10]. Finally, '34' and '45' should be taken as well.
Eventually, the result would be
timestamp
0 10
2 23
5 34
6 45
Could anyone give me some idea to realize this in Pandas? Great thanks!
I don't believe there is a way to solve this custom problem using a groupby like construct, but here is a coding solution that gives you the index location and timestamp values.
stamps = [df.timestamp.iat[0]]
index = [df.index[0]]
for idx, ts in df.timestamp.iteritems():
if ts >= stamps[-1] + 10:
index.append(idx)
stamps.append(ts)
>>> index
[0, 2, 5, 6]
>>> stamps
[10, 23, 34, 45]
>>> df.iloc[index]
timestamp
0 10
2 23
5 34
6 45
I am not sure if I understood correct about the initialization, but see if this helps you:
df = pd.read_csv("data.csv")
gap = 10
actual = 0
for timestamp in df.values:
if timestamp >= (actual+gap):
print(timestamp)
actual = timestamp
if you want to create a new DF:
df = pd.read_csv("data.csv")
gap = 10
actual = 0
index = []
for i, timestamp in enumerate(df.values):
if timestamp >= (actual+gap):
actual = timestamp
else:
index.append(i)
new_df = df.drop(df.index[index])

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories