Pandas select values from each hour for each ID - python

I have a dataframe in which I have some IDs, and for each ID I have some values and timestamps (around one value each 5 minutes for 5 to 7 days in a row). I would like to select, for each hour and for each ID, the mean, median and variance of the values in that hour and store them in different columns like in the following result:
hour mean var median ID
0 2 4 4 1234
1 4 5 3 1234
...
23 2 2 3 1234
My columns are:
ID int64
Value float64
Date datetime64[ns]
dtype: object
My timestamps are in the following type:
%Y-%m-%d %H:%M:%S.%f
How do I create the final dataframe for each ID? Thank you very much
Edit:
With the following line I created a column correctly with the hours:
df['hour'] = df.Date.dt.hour
Now the problem is that I have a very long column with the hours, the same, and if I use the resample like this:
df = df.set_index('Date').resample('60T').mean().reset_index()
automatically it erases the value columns and overwrites with the mean values. I would like to keep that columns, so that i can create different columns for mean, variance and median, based on the values in the Value columns. How can I do that part?

Try this:
# Extract the hour from the Date column
h = df['Date'].dt.hour.rename('Hour')
# Group by ID and Hour
df.groupby(['ID', h]).agg({
'Value': ['mean', 'var', 'median']
})
You can replace the h series by pd.Grouper. By default pd.Grouper groups the index. You can set the key parameter so that it targets another column:
df.groupby([pd.Grouper('1H', key='Date'), 'ID').agg({
'Value': ['mean', 'var', 'median']
})

Related

python pandas: Can you perform multiple operations in a groupby?

Suppose I have the following DataFrame:
df = pd.DataFrame(
{
'year': [2015,2015,2018,2018,2020],
'total': [100,200,50,150,400],
'tax': [10,20,5,15,40]
}
)
I want to sum up the total and tax columns by year and obtain the size at the same time.
The following code gives me the sum of the two columns:
df_total_tax = df.groupby('year', as_index=False)
df_total_tax = df_total_tax[['total','tax']].apply(np.sum)
However, I can't figure out how to also include a column for size at the same time. Must I perform a different groupby, then use .size() and then append that column to df_total_tax? Or is there an easier way?
The end result would look like this:
Thanks
You can specify for each column separately aggregate function in named aggregation:
df = df.groupby('year', as_index=False).agg(total=('total','sum'),
tax=('tax','sum'),
size=('tax', 'size'))
print (df)
year total tax size
0 2015 300 30 2
1 2018 200 20 2
2 2020 400 40 1

How to create a new column in a dataframe with the value of another column in that dataframe, but within 1 hour

I have a pandas dataframe, with columns:
ID
DateTime
Capacity
I want to add a new column to this dataframe, with the capacity within an hour.
For example, I have
ID Datetime Capacity
1 20200101 12:23:10 435
So in the new column, I want the Capacity of the record of ID=1 and Datetime=13:23:10
Is there a statement to handle this?
Thanks already !
You can add 1 hour to original column, convert to strings HH:MM:SS, so possible filtering in boolean indexing:
Datetime = '13:23:10'
ID = 1
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df[(df['Datetime'] + pd.Timedelta('1H')).dt.strftime('%H:%M:%S').eq(Datetime) & df['ID'].eq(ID)]
print (df)
ID Datetime Capacity
0 1 2020-01-01 12:23:10 43

Get rows with min date associated with every ID

I have a pandas dataframe with multiple IDs and with other columns I have one date columns say : 'date1'. I want to get all the rows with minimum date associated with all the IDs. The others column values should also be retained.
What I have:
ID date1 value
1 1/1/2013 a
1 4/1/2013 a
1 8/3/2014 b
2 11/4/2013 a
2 19/5/2016 b
2 8/4/2017 b
The output I want :
ID date1 value
1 1/1/2013 a
2 11/4/2013 a
Thank you
Convert to datetime:
df = df.assign(date1 = pd.to_datetime(df.date1))
Get the label index of the minimum and subset:
df.loc[df.groupby("ID").date1.idxmin()]
ID date1 value
0 1 2013-01-01 a
3 2 2013-11-04 a
Assuming you have IDs in ID and dates in DATE:
df.groupby('ID')['DATE'].min()
Groups by your ID and then selects the minimum in each group. Returns a series. If you want a data frame for that, then call _.reset_index() on the output.
If you instead want to select only the minimum rows, I would set the output as keys and then new_df.join(old_df.set_index(['ID', 'DATE']) rather than dealing with some index-based shenanigans.

Return a pandas series of date time in chronological order by the original series' indices

I compiled a pandas series of date time like the following (the below shows part of the series as an example):
0 2002-02-03
1 1979-01-01
2 2006-12-25
3 2008-07-16
4 2005-05-30
Note: the dtype of each cell is 'pandas._libs.tslib.Timestamp'
For the above example, I would like to rank them by chronological order and return a series by the original series' indices like this (the second column):
0 1
1 0
2 3
3 4
4 2
I've tried using a mix of .order(), .sort(), and .index() to achieve this but to no avail so far. What will be the easiest way to do get a series of date time in chronological order by the original series' indices?
Thank you.
You can use Series.rank, subtract 1 and cast to int:
a = df['date'].rank(method='dense').sub(1).astype(int)
print (a)
0 1
1 0
2 3
3 4
4 2
Name: date, dtype: int32
Parameter method in Series.rank:
method : {'average', 'min', 'max', 'first', 'dense'}
average: average rank of group
min: lowest rank in group
max: highest rank in group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups
Just try to change your date time series to_datetime() or to_pydatetime() from tslib.Timestamp.
create a column for original_index (dfl['org_ind'] = np.arange(1:len(df))
And then do -
df.sort_values(by='foo', ascending=True)
you will get your dates in chronological order and original_index...

How to sum values of one column and group them by another column

I have a data frame like this:
df
time type qty
12:00 A 5
13:00 A 1
14:00 B 9
I need to sum the values of qty and group them by type. This is how I do it, but it seems to be not working, because I don't know how to add qty.
keys = df['type'].unique()
summary = pd.DataFrame()
for k in keys:
summary[k] = df[df['type']==k].sum()
GroupBy has a sum method:
In [11]: df.groupby("type").sum()
Out[11]:
qty
type
A 6
B 9
see the groupby docs.
To make sure you are summing up the column you want to:
df.groupby(by=['type'])['qty'].sum()

Categories