Merge two dataframes using pandas - python - python

First of all thank you for your help.
I have two dataframes row indexed by date (DD-MM-YYYY HH:MM) as follows:
DF1
date temp wind
0 31-12-2002 23:00 12.3 80
1 01-01-2004 00:00 15.2 NAN
2 01-01-2004 01:00 18.4 NAN
........
DF2
date temp wind
0 31-12-2002 23:00 14.5 86
1 01-01-2003 00:00 28.7 98
2 01-01-2003 01:00 26.7 88
........
n 01-01-2004 00:00 34.5 23
m 01-01-2004 01:00 35.7 NAN
MergedDF
date temp wind
0 31-12-2002 23:00 12.3 80
1 01-01-2003 00:00 28.7 98
2 01-01-2003 01:00 26.7 88
........
n 01-01-2004 00:00 15.2 23
m 01-01-2004 01:00 18.4 NAN
In DF1 there's one whole year (2003) missing and also some NAN values in the rest of the years.
Basically I want to merge both dataframes, adding the year missing and replacing NAN values if this information is in DF2.
Someone could help me? I don't know very well how to implement this on python/pandas.

MergedDF = df1.append(df2).groupby('date', as_index=False).first()
as_index=False option of group_by is useful to keep the same table index in the aggregated output.
.first() will keep the first non-null value for each date.

Related

Rolling average in time series for specific time intervals

I have data, I want to add a column that shows the moving average of the val column for each day.
df
timestamp val val_mean
2022-10-10 00:00 10 10
2022-10-10 00:01 20 15
..
2022-10-10 23:59 50 23
2022-10-11 00:00 80 80
How can I achieve this
Looks like you want a grouped, expanding mean:
group = pd.to_datetime(df['timestamp']).dt.normalize()
df['val_mean'] = df.groupby(group)['val'].expanding().mean().droplevel(0)
output:
timestamp val val_mean
0 2022-10-10 00:00 10 10.000000
1 2022-10-10 00:01 20 15.000000
2 2022-10-10 23:59 50 26.666667
3 2022-10-11 00:00 80 80.000000

How can I grab rows with max date for each year by ID from Pandas dataframe?

sample dataframe looks:
ID Date Value
2 2020-06-30 124
1 2020-09-30 265
1 2021-12-31 140
1 2020-12-31 142
2 2020-12-31 147
1 2019-12-31 677
1 2021-03-31 235
2 2021-09-30 917
2 2021-03-31 149
I want to grab rows of max date for each year of each ID.
The final output would be:
ID Date Value
1 2019-12-31 677
1 2020-12-31 142
1 2021-12-31 140
2 2020-12-31 147
2 2021-09-30 917
I tried groupby ID but not sure how to grab rows by max date for each year.
Many thanks for your help!
Here is one way to accomplish it
df.assign(yr=df['Date'].astype('datetime64').dt.year).groupby(['ID','yr']).max().reset_index().drop(columns=['yr'])
since a max for each year is needed, a temporary year is created via assign, then grouped by id and year to get the max for each year. Finally dropping the yr column from result
ID Date Value
0 1 2019-12-31 677
1 1 2020-12-31 265
2 1 2021-12-31 235
3 2 2020-12-31 147
4 2 2021-09-30 917
First you would need to extract year from date:
df['year'] = pd.DatetimeIndex(df['Date']).year
then if you want to grab rows with max date in a year, get the max date:
maxDf = df.groupby([year])['Date'].max()
then you can filter you dataframe on the max dates
maxDates = maxDf['Date'].values.tolist()
df.loc[df['Date'].isin(maxDates)]

How to fill a column of one dataframe, conditionally bound to two columns of another dataframe?

My two dataframes:
wetter
Out[223]:
level_0 index TEMPERATURE:TOTAL SLP HOUR Time
0 0 2018-01-01 00:00:00 9.8 NaN 00 00:00
1 1 2018-01-01 01:00:00 9.8 NaN 01 01:00
2 2 2018-01-01 02:00:00 9.2 NaN 02 02:00
3 3 2018-01-01 03:00:00 8.4 NaN 03 03:00
4 4 2018-01-01 04:00:00 8.5 NaN 04 04:00
... ... ... ... ... ...
49034 49034 2018-12-31 22:40:00 8.5 NaN 22 22:40
49035 49035 2018-12-31 22:45:00 8.4 NaN 22 22:45
49036 49036 2018-12-31 22:50:00 8.4 NaN 22 22:50
49037 49037 2018-12-31 22:55:00 8.4 NaN 22 22:55
49038 49038 2018-12-31 23:00:00 8.4 NaN 23 23:00
[49039 rows x 6 columns]
df
Out[224]:
0 Time -14 -13 ... 17 18 NaN
1 00:00 1,256326635 1,218256131 ... 0,080348715 0,040194189 00:15
2 00:15 1,256564788 1,218487067 ... 0,080254367 0,039517006 00:30
3 00:30 1,260350982 1,222158528 ... 0,080219518 0,039054261 00:45
4 00:45 1,259306606 1,221145800 ... 0,080758578 0,039176953 01:00
5 01:00 1,258521518 1,220384502 ... 0,080444585 0,038164953 01:15
.. ... ... ... ... ... ... ...
92 22:45 1,253545107 1,215558891 ... 0,080164570 0,042697436 23:00
93 23:00 1,241253483 1,203639741 ... 0,078395829 0,039685235 23:15
94 23:15 1,242890274 1,205226933 ... 0,078801415 0,039170364 23:30
95 23:30 1,240459118 1,202869448 ... 0,079511294 0,039013684 23:45
96 23:45 1,236228281 1,198766818 ... 0,079186806 0,037570494 00:00
[96 rows x 35 columns]
I want to fill the SLP column of wetter based on TEMPERATURE:TOTAL and Time.
For this I want to look at the df dataframe and fill SLP depending on the columns of df, where the headers are temperatures.
So for the first TEMPERATURE:TOTAL of 9.8 at 00:00, SLP should be filled with the value of the column that is simply called 9 in row 00:00 of Time
I have tried to do this, which is why i also created the Time columns but I am stuck. I thought of some nested loops but knowing a bit of pandas I guess there is probably a two-liner solution for this?
Here is one way!
import numpy as np
import pandas as pd
This is me simulating your dataframes (you are free to skip this step) - next time please provide them.
wetter = pd.DataFrame()
df = pd.DataFrame()
wetter['TEMPERATURE:TOTAL'] = np.random.rand(10) * 10
wetter['SLP'] = np.nan * 10
wetter['Time'] = pd.date_range("00:00", periods=10, freq="H")
df['Time'] = pd.date_range("00:00", periods=10, freq="15T")
for i in range(-14, 18):
df[i] = np.random.rand(10)
Preprocess:
wetter['temp'] = np.floor(wetter['TEMPERATURE:TOTAL'])
wetter = wetter.astype({'temp': 'int'})
wetter.set_index('Time')
df.set_index('Time')
df = df.reset_index()
value_vars_ = list(range(-14, 18))
df_long = pd.melt(df, id_vars='Time', value_vars=value_vars_, var_name='temp', value_name="SLP")
Left-join two dataframes on Time and temp:
final = pd.merge(wetter.drop('SLP', axis=1), df_long, how="left", on=["Time", "temp"])

Resample pandas dataframe by two columns

I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

Categories