How to convert time durations to numeric in polars? - python

Is there any built-in function in polars or a better way to convert time durations to numeric by defining the time resolution (e.g.: days, hours, minutes)?
# Create a dataframe
df = pl.DataFrame(
{
"from": ["2023-01-01", "2023-01-02", "2023-01-03"],
"to": ["2023-01-04", "2023-01-05", "2023-01-06"],
}
)
# Convert to date and calculate the time difference
df = df.with_columns(
[
pl.col("from").str.strptime(pl.Date, "%Y-%m-%d").alias("from_date"),
pl.col("to").str.strptime(pl.Date, "%Y-%m-%d").alias("to_date"),
]
).with_columns((pl.col("to_date") - pl.col("from_date")).alias("time_diff"))
# Convert the time difference to int (in days)
df = df.with_columns(
((pl.col("time_diff") / (24 * 60 * 60 * 1000)).cast(pl.Int8)).alias("time_diff_int")
)

the dt accessor lets you obtain individual components, is that what you're looking for?
df["time_diff"].dt.days()
Series: 'time_diff' [i64]
[
3
3
3
]
df["time_diff"].dt.hours()
Series: 'time_diff' [i64]
[
72
72
72
]
df["time_diff"].dt.minutes()
Series: 'time_diff' [i64]
[
4320
4320
4320
]
docs: API reference, series/timeseries

Related

How to get temperature measure given location and time values in a dataframe?

I have a pandas dataframe consisting of geo-locations and a time in the past.
location_time = pd.read_csv(r'geo_time.csv')
print (geo_time)
> +---------+---------+---------+-------------------+
| latitude|longitude| altitude| start|
+---------+---------+---------+-------------------+
| 48.2393| 11.5713| 520|2020-03-12 13:00:00|
+---------+---------+---------+-------------------+
| 35.5426| 139.5975| 5|2020-07-31 18:00:00|
+---------+---------+---------+-------------------+
| 49.2466|-123.2214| 5|2020-06-23 11:00:00|
+---------+---------+---------+-------------------+
...
I want to add the temperatures at these locations and time in a new column from the Meteostat library in Python.
The library has the "Point" class. For a single location, it works like this:
location = Point(40.416775, -3.703790, 660)
You can now use this in the class "Hourly" that gives you a dataframe of different climatic variables. (normally you use like "start" and "end" to get values for every hour in this range, but using "start" twice, gives you only one row for the desired time). The output is just an example how the dataframe looks like.
data = Hourly(location, start, start).fetch()
print (data)
> temp dwpt rhum prcp ... wpgt pres tsun coco
time ...
2020-01-10 01:00:00 -15.9 -18.8 78.0 0.0 ... NaN 1028.0 NaN 0.0
What I want to do now, is to use the values from the dataframe "geo_time" as parameters for the classes to get a temperature for every row. My stupid idea was the following:
geo_time['location'] = Point(geo_time['latitude'], geo_time['longitude'], geo_time['altitude'])
data = Hourly(geo_time['location'], geo_time['start'], geo_time['start'])
Afterwards, I would add the "temp" column from "data" to "geo_time".
Does someone have an idea how to solve this problem or knows if Meteostat is even capable doing this?
Thanks in advance!
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"latitude": [48.2393, 35.5426, 49.2466],
"longitude": [11.5713, 139.5975, -123.2214],
"altitude": [520, 5, 5],
"start": ["2020-03-12 13:00:00", "2020-07-31 18:00:00", "2020-06-23 11:00:00"],
}
)
Here is one way to do it with Pandas to_datetime and apply methods:
df["start"] = pd.to_datetime(df["start"], format="%Y-%m-%d %H:%M:%S")
df["temp"] = df.apply(
lambda x: Hourly(
Point(x["latitude"], x["longitude"], x["altitude"]),
x["start"],
x["start"],
)
.fetch()["temp"]
.values[0],
axis=1,
)
Then:
print(df)
# Output
latitude longitude altitude start temp
0 48.2393 11.5713 520 2020-03-12 13:00:00 16.8
1 35.5426 139.5975 5 2020-07-31 18:00:00 24.3
2 49.2466 -123.2214 5 2020-06-23 11:00:00 14.9

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

Python: Working with dataframes of different sizes to create new columns based on datetime conditions

I have 2 dataframes of different sizes in Python. The smaller dataframe has 2 datetime columns, one for the beginning datetime and one for the ending datetime. The other dataframe is bigger (more rows and columns) and it has one datetime column.
df A
Date_hour_beginning Date_hour_end
3/8/2019 18:35 3/8/2019 19:45
4/8/2019 14:22 4/8/2019 14:55
df B
Date_hour compression
3/8/2019 18:37 41
3/8/2019 18:55 47
3/8/2019 19:30 55
3/8/2019 19:51 51
4/8/2019 14:10 53
4/8/2019 14:35 48
4/8/2019 14:51 51
4/8/2019 15:02 58
I want to add the mean and amplitude of compression to df_A that cover the datetime range. To get the following result:
df_A
Date_hour_beginning Date_hour_end mean_compression amplitude
3/8/2019 18:35 3/8/2019 19:45 47.66 14
4/8/2019 14:22 4/8/2019 14:55 49.5 3
I tried the np.where and groupby but I didn't know but I had the error of mismatching dataframe shapes.
Here is my solution. It is kind of a more verbose (and maybe more readable?) version of eva-vw's. eva-vw's uses the .apply() method which is the fastest way of looping over the rows of your dataframe. However it should only make a significant difference in run time if your df_A has really many (many) rows (which does not seem to be the case here).
for i, row in df_A.iterrows() :
start = row['Date_hour_beginning']
end = row['Date_hour_end']
mask = (df_B['Date_hour'] >= start) & (df_B['Date_hour'] <= end)
compression_values = df_B.loc[mask, 'compression']
df_A.loc[i, 'avg comp'] = compression_values.mean()
df_A.loc[i, 'amp comp'] = compression_values.max() - compression_values.min()
For completeness, here is how I created the dataframes:
import numpy as np
import pandas as pd
columns = ['Date_hour_beginning', 'Date_hour_end']
times_1 = pd.to_datetime(['3/8/2019 18:35', '3/8/2019 19:45'])
times_2 = pd.to_datetime(['4/8/2019 14:22', '4/8/2019 14:55'])
df_A = pd.DataFrame(data=[times_1, times_2], columns=columns)
data_B = [ ['3/8/2019 18:37', 41],
['3/8/2019 18:55', 47],
['3/8/2019 19:30', 55],
['3/8/2019 19:51', 51],
['4/8/2019 14:10', 53],
['4/8/2019 14:35', 48],
['4/8/2019 14:51', 51],
['4/8/2019 15:02', 58]]
columns_B = ['Date_hour', 'compression']
df_B = pd.DataFrame(data=data_B, columns=columns_B)
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
To go a bit further: to solve your problem, you need to loop over the rows of df_A. This can be done in three main ways: (i) with a plain for loop over the indices of the rows of the dataframe, (ii) with for loop with the .iterrows() method, or with the apply() method.
I ordered them from the slowest to the fastest at runtime. I picked method (ii) and eva-vw picked method (iii). The advantage of .apply() is that it is the fastest, but its disadvantage (to me) is that you have to write everything you want to do with the row, in a one-line lambda function.
# create test dataframes
df_A = pd.DataFrame(
{
"Date_hour_beginning": ["3/8/2019 18:35", "4/8/2019 14:22"],
"Date_hour_end": ["3/8/2019 19:45", "4/8/2019 14:55"],
}
)
df_B = pd.DataFrame(
{
"Date_hour": [
"3/8/2019 18:37",
"3/8/2019 18:55",
"3/8/2019 19:30",
"3/8/2019 19:51",
"4/8/2019 14:10",
"4/8/2019 14:35",
"4/8/2019 14:51",
"4/8/2019 15:02",
],
"compression": [41, 47, 55, 51, 53, 48, 51, 58],
}
)
# convert to datetime
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
# accumulate compression values per range
df_A["compression"] = df_A.apply(
lambda row: df_B.loc[
(df_B["Date_hour"] >= row["Date_hour_beginning"])
& (df_B["Date_hour"] <= row["Date_hour_end"]),
"compression",
].values.tolist(),
axis=1,
)
# calculate mean compression and amplitude
df_A['mean_compression'] = df_A['compression'].apply(lambda x: sum(x) / len(x))
df_A['amplitude'] = df_A['compression'].apply(lambda x: max(x) - min(x))
Use this:
df_A['Date_hour_beginning'] = pd.to_datetime(df_A['Date_hour_beginning'])
df_A['Date_hour_end'] = pd.to_datetime(df_A['Date_hour_end'])
df_B['Date_hour'] = pd.to_datetime(df_B['Date_hour'])
df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)
df_merge = df_merge.query('Date_hour >= Date_hour_beginning and Date_hour <= Date_hour_end')
df_merge['amplitude'] = df_merge.groupby(['Date_hour_beginning','Date_hour_end'])['compression'].transform(lambda x: x.max()-x.min())
df_merge = df_merge.groupby(['Date_hour_beginning','Date_hour_end']).mean()
Output:
compression amplitude
Date_hour_beginning Date_hour_end
2019-03-08 18:35:00 2019-03-08 19:45:00 47.666667 14.0
2019-04-08 14:22:00 2019-04-08 14:55:00 49.500000 3.0
groupby can accept series equally indexed, i.e.
df['Date_hour'] = pd.to_datetime(df['Date_hour'])
df_a['begin'] = pd.to_datetime(df_a['begin'])
df_a['end'] = pd.to_datetime(df_a['end'])
selector = df.apply(lambda x: df_a.query(f'begin <= \'{x["Date_hour"]}\' <= end').index[0], axis=1)
for i_gr, gr in df.groupby(selector):
print(i_gr, gr)
And then go on with your .mean() or .median()

Moving average on pandas.groupby object that respects time

Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected:

How can I convert a pandas dataframe to Morris dataset for bootstrap

Hi I want to create graphs and tables with Bootstrap using Morris js.
I have to following dataframe:
date x y
0 2016-10-03 156 123
1 2016-10-04 220 156
2 2016-10-05 153 152
I need to get this in this format:
[
{ date: '2016-10-03',x:156, y:123 },
{ date: '2016-10-04',x:220, y:156 },
{ date: '2016-10-05',x:153, y:152 }
]
I tried this with to_json but this isn't the correct format and it transforms the dates to ms or datetimes when selecting iso.
Is there a buildin function for this or do I need to write a custom function with for loops to get this format?
use to_json
print(df.to_json(orient='records'))
[{"date":"2016-10-03","x":156,"y":123},{"date":"2016-10-04","x":220,"y":156},{"date":"2016-10-05","x":153,"y":152}]
Something like this should get you the output you're looking for.
somelist = []
for n,i in df.iterrows():
row = {'date': i.date, 'x': i.x, 'y': i.y}
somelist.append(row)

Categories