I have time series data with column that calculates 2-day sums. I want to get the last value in each 2 day period and write it in a column, by user id.
The data looks like (with desired output column 'new'):
df
timestamp uid cols new
2020-10-10 00:00 1 10
2020-10-10 00:00 2 5
2020-10-10 00:10 1 20
2020-10-10 00:10 2 20
2020-10-10 00:20 1 40
....
2020-10-11 23:50 1 3400
2020-10-11 23:50 2 5250
2020-10-12 00:00 1 20 3400
2020-10-12 00:00 2 15 5250
How can I achieve this?
My two dataframes:
wetter
Out[223]:
level_0 index TEMPERATURE:TOTAL SLP HOUR Time
0 0 2018-01-01 00:00:00 9.8 NaN 00 00:00
1 1 2018-01-01 01:00:00 9.8 NaN 01 01:00
2 2 2018-01-01 02:00:00 9.2 NaN 02 02:00
3 3 2018-01-01 03:00:00 8.4 NaN 03 03:00
4 4 2018-01-01 04:00:00 8.5 NaN 04 04:00
... ... ... ... ... ...
49034 49034 2018-12-31 22:40:00 8.5 NaN 22 22:40
49035 49035 2018-12-31 22:45:00 8.4 NaN 22 22:45
49036 49036 2018-12-31 22:50:00 8.4 NaN 22 22:50
49037 49037 2018-12-31 22:55:00 8.4 NaN 22 22:55
49038 49038 2018-12-31 23:00:00 8.4 NaN 23 23:00
[49039 rows x 6 columns]
df
Out[224]:
0 Time -14 -13 ... 17 18 NaN
1 00:00 1,256326635 1,218256131 ... 0,080348715 0,040194189 00:15
2 00:15 1,256564788 1,218487067 ... 0,080254367 0,039517006 00:30
3 00:30 1,260350982 1,222158528 ... 0,080219518 0,039054261 00:45
4 00:45 1,259306606 1,221145800 ... 0,080758578 0,039176953 01:00
5 01:00 1,258521518 1,220384502 ... 0,080444585 0,038164953 01:15
.. ... ... ... ... ... ... ...
92 22:45 1,253545107 1,215558891 ... 0,080164570 0,042697436 23:00
93 23:00 1,241253483 1,203639741 ... 0,078395829 0,039685235 23:15
94 23:15 1,242890274 1,205226933 ... 0,078801415 0,039170364 23:30
95 23:30 1,240459118 1,202869448 ... 0,079511294 0,039013684 23:45
96 23:45 1,236228281 1,198766818 ... 0,079186806 0,037570494 00:00
[96 rows x 35 columns]
I want to fill the SLP column of wetter based on TEMPERATURE:TOTAL and Time.
For this I want to look at the df dataframe and fill SLP depending on the columns of df, where the headers are temperatures.
So for the first TEMPERATURE:TOTAL of 9.8 at 00:00, SLP should be filled with the value of the column that is simply called 9 in row 00:00 of Time
I have tried to do this, which is why i also created the Time columns but I am stuck. I thought of some nested loops but knowing a bit of pandas I guess there is probably a two-liner solution for this?
Here is one way!
import numpy as np
import pandas as pd
This is me simulating your dataframes (you are free to skip this step) - next time please provide them.
wetter = pd.DataFrame()
df = pd.DataFrame()
wetter['TEMPERATURE:TOTAL'] = np.random.rand(10) * 10
wetter['SLP'] = np.nan * 10
wetter['Time'] = pd.date_range("00:00", periods=10, freq="H")
df['Time'] = pd.date_range("00:00", periods=10, freq="15T")
for i in range(-14, 18):
df[i] = np.random.rand(10)
Preprocess:
wetter['temp'] = np.floor(wetter['TEMPERATURE:TOTAL'])
wetter = wetter.astype({'temp': 'int'})
wetter.set_index('Time')
df.set_index('Time')
df = df.reset_index()
value_vars_ = list(range(-14, 18))
df_long = pd.melt(df, id_vars='Time', value_vars=value_vars_, var_name='temp', value_name="SLP")
Left-join two dataframes on Time and temp:
final = pd.merge(wetter.drop('SLP', axis=1), df_long, how="left", on=["Time", "temp"])
I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1
First of all thank you for your help.
I have two dataframes row indexed by date (DD-MM-YYYY HH:MM) as follows:
DF1
date temp wind
0 31-12-2002 23:00 12.3 80
1 01-01-2004 00:00 15.2 NAN
2 01-01-2004 01:00 18.4 NAN
........
DF2
date temp wind
0 31-12-2002 23:00 14.5 86
1 01-01-2003 00:00 28.7 98
2 01-01-2003 01:00 26.7 88
........
n 01-01-2004 00:00 34.5 23
m 01-01-2004 01:00 35.7 NAN
MergedDF
date temp wind
0 31-12-2002 23:00 12.3 80
1 01-01-2003 00:00 28.7 98
2 01-01-2003 01:00 26.7 88
........
n 01-01-2004 00:00 15.2 23
m 01-01-2004 01:00 18.4 NAN
In DF1 there's one whole year (2003) missing and also some NAN values in the rest of the years.
Basically I want to merge both dataframes, adding the year missing and replacing NAN values if this information is in DF2.
Someone could help me? I don't know very well how to implement this on python/pandas.
MergedDF = df1.append(df2).groupby('date', as_index=False).first()
as_index=False option of group_by is useful to keep the same table index in the aggregated output.
.first() will keep the first non-null value for each date.
I'm working with the following dataset with hourly counts in columns. The dataframe has more than 1400 columns and 100 rows.
My dataset looks like this:
CITY 2019-10-01 00:00 2019-10-01 01:00 2019-10-01 02:00 .... 2019-12-01 12:00
Wien 15 16 16 .... 14
Graz 11 11 11 .... 10
Innsbruck 12 12 10 .... 12
....
How can I convert this datatime to datetime such as this:
CITY 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
(or 1 day) (or 2 day) (or 3 day) (or 72 day)
Wien 14 15 16 .... 12
Graz 13 12 14 .... 10
Innsbruck 13 12 12 .... 12
....
I would like the average of all hours of the day to be in the column of the one day.
The data type is:
type(df.columns[0])
out: str
type(df.columns[1])
out: pandas._libs.tslibs.timestamps.Timestamp
Thanks for your help!
I would do something like this:
days = df.columns[1:].to_series().dt.normalize()
df.set_index('CITY').groupby(days, axis=1).mean()
Output:
2019-10-01 2019-12-01
CITY
Wien 15.666667 14.0
Salzburg 12.000000 14.0
Graz 11.000000 10.0
Innsbruck 11.333333 12.0