I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal
Related
I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.
I wish to iterate through each day in the input DataFrame and do the following:
Check that there is a row for each hour of the day
Check that there is both maximum and minimum data for each hour of the day
If the conditions above are met, I wish to:
Add a row to the output DataFrame for the given date
Use the date to fill the 'Date_time' cell for the row
Transpose the hourly data to the hourly cells
Find the max of the hourly max data, and use it to fill the max cell for the row
Find the min of the hourly min data, and use it to fill the min cell for the row
Example daily input data examples follow.
Example 1
All hours for day available
Max & min available for each hour
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
0 2019-02-03 00:00:00 18.6 18.7 18.5
1 2019-02-03 01:00:00 18.6 18.7 18.5
2 2019-02-03 02:00:00 18.2 18.5 18.0
3 2019-02-03 03:00:00 18.0 18.0 17.9
4 2019-02-03 04:00:00 18.0 18.1 17.9
5 2019-02-03 05:00:00 18.3 18.4 18.1
6 2019-02-03 06:00:00 18.7 19.1 18.4
7 2019-02-03 07:00:00 20.1 21.3 19.1
8 2019-02-03 08:00:00 21.7 22.9 21.0
9 2019-02-03 09:00:00 23.2 23.9 22.8
10 2019-02-03 10:00:00 23.7 24.1 23.3
11 2019-02-03 11:00:00 24.6 25.5 24.0
12 2019-02-03 12:00:00 25.1 25.7 24.7
13 2019-02-03 13:00:00 24.5 25.0 24.2
14 2019-02-03 14:00:00 23.9 25.3 21.2
15 2019-02-03 15:00:00 19.6 21.2 18.8
16 2019-02-03 16:00:00 19.2 19.5 18.7
17 2019-02-03 17:00:00 19.8 19.9 19.4
18 2019-02-03 18:00:00 19.6 19.8 19.5
19 2019-02-03 19:00:00 19.3 19.4 19.1
20 2019-02-03 20:00:00 19.2 19.4 19.1
21 2019-02-03 21:00:00 19.3 19.4 18.9
22 2019-02-03 22:00:00 18.8 19.0 18.7
23 2019-02-03 23:00:00 19.0 19.1 18.9
Example 2
All hours for day available
Max & min available for each hour
NaN values for some Mean_temp entries
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
24 2019-02-04 00:00:00 18.9 19.0 18.9
25 2019-02-04 01:00:00 18.8 18.9 18.7
26 2019-02-04 02:00:00 18.6 18.8 18.4
27 2019-02-04 03:00:00 18.4 18.6 18.1
28 2019-02-04 04:00:00 18.7 18.9 18.4
29 2019-02-04 05:00:00 18.8 18.8 18.7
30 2019-02-04 06:00:00 19.0 19.3 18.8
31 2019-02-04 07:00:00 19.7 20.4 19.3
32 2019-02-04 08:00:00 21.4 22.8 20.3
33 2019-02-04 09:00:00 23.5 23.9 22.8
34 2019-02-04 10:00:00 25.7 23.6
35 2019-02-04 11:00:00 26.5 25.4
36 2019-02-04 12:00:00 27.1 26.1
37 2019-02-04 13:00:00 25.8 26.8 24.8
38 2019-02-04 14:00:00 25.4 27.8 23.7
39 2019-02-04 15:00:00 22.1 24.1 20.2
40 2019-02-04 16:00:00 21.8 22.6 20.2
41 2019-02-04 17:00:00 20.9 22.4 19.6
42 2019-02-04 18:00:00 18.9 19.6 18.6
43 2019-02-04 19:00:00 18.8 18.9 18.6
44 2019-02-04 20:00:00 18.9 19.0 18.8
45 2019-02-04 21:00:00 18.8 18.9 18.7
46 2019-02-04 22:00:00 18.8 18.9 18.7
47 2019-02-04 23:00:00 18.9 19.2 18.7
Example 3
Not all hours of the day are available
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
48 2019-02-05 00:00:00 19.2 19.3 19.0
49 2019-02-05 01:00:00 19.3 19.4 19.3
50 2019-02-05 02:00:00 19.3 19.4 19.2
51 2019-02-05 03:00:00 19.4 19.5 19.4
52 2019-02-05 04:00:00 19.5 19.6 19.3
53 2019-02-05 05:00:00 19.3 19.5 19.1
54 2019-02-05 06:00:00 20.1 20.6 19.2
55 2019-02-05 07:00:00 21.1 21.7 20.6
56 2019-02-05 08:00:00 22.3 23.2 21.7
57 2019-02-05 15:00:00 25.3 25.8 25.0
58 2019-02-05 16:00:00 25.8 26.0 25.2
59 2019-02-05 17:00:00 24.3 25.2 23.3
60 2019-02-05 18:00:00 22.5 23.3 22.1
61 2019-02-05 19:00:00 21.6 22.1 21.1
62 2019-02-05 20:00:00 21.1 21.3 20.9
63 2019-02-05 21:00:00 21.2 21.3 20.9
64 2019-02-05 22:00:00 20.9 21.0 20.6
65 2019-02-05 23:00:00 19.9 20.6 19.7
Example 4
All hours of the day are available
Max and/or min have at least one NaN value
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
66 2019-02-06 00:00:00 19.7 19.8 19.7
67 2019-02-06 01:00:00 19.6 19.7 19.3
68 2019-02-06 02:00:00 19.0 19.3 18.6
69 2019-02-06 03:00:00 18.5 18.6 18.4
70 2019-02-06 04:00:00 18.6 18.7 18.4
71 2019-02-06 05:00:00 18.5 18.6
72 2019-02-06 06:00:00 19.0 19.6 18.5
73 2019-02-06 07:00:00 20.3 21.2 19.6
74 2019-02-06 08:00:00 21.5 21.7 21.2
75 2019-02-06 09:00:00 21.4 22.3 20.9
76 2019-02-06 10:00:00 23.5 24.4 22.3
77 2019-02-06 11:00:00 24.7 25.4 24.3
78 2019-02-06 12:00:00 24.9 25.5 23.9
79 2019-02-06 13:00:00 23.4 24.0 22.9
80 2019-02-06 14:00:00 23.3 23.8 22.9
81 2019-02-06 15:00:00 24.4 23.7
82 2019-02-06 16:00:00 24.9 25.1 24.7
83 2019-02-06 17:00:00 24.4 24.9 23.8
84 2019-02-06 18:00:00 22.5 23.8 21.7
85 2019-02-06 19:00:00 20.8 21.8 19.6
86 2019-02-06 20:00:00 19.1 19.6 18.9
87 2019-02-06 21:00:00 19.0 19.1 18.9
88 2019-02-06 22:00:00 19.1 19.1 19.0
89 2019-02-06 23:00:00 19.1 19.1 19.0
Just to recap, the above inputs would create the following output:
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!
Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?
You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)
Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])
all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time'] + all_hours + ['Max','Min'])
## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
## check if NaN is in either 'Max Temp' or 'Min Temp' columns
new_df_data = {}
if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
## create a dictionary for the new row of the final_df
new_df_data['Date_time'] = group
new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
new_df_data['Max'], new_df_data['Min'] = df_group['Max_temp'].max(), df_group['Min_temp'].min()
final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
else:
continue
Output:
>>> final_df
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.2 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
0 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 NaN NaN NaN 25.8 25.4 22.1 21.8 20.9 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
time a b
2021-05-23 22:06:54 10.4 70.1
2021-05-23 22:21:41 10.7 68.3
2021-05-23 22:36:28 10.4 69.4
2021-05-23 22:51:15 9.9 71.7
2021-05-23 23:06:02 9.5 73.1
... ... ... ... ... ...
2021-11-19 08:18:31 19.8 43.0
2021-11-19 08:20:04 21.0 42.0
2021-11-19 08:21:25 35.5 20.0
2021-11-19 08:21:32 19.8 43.0
2021-11-19 08:23:05 21.0 42.0
here time is in the index, not a column.
when I did df.between_time("2021-11-17 08:15:00","2021-11-19 08:00:00")
it throws the error ValueError: Cannot convert arg ['2021-11-17 08:15:00'] to a time
data frame has not proper time stamp.
What i want to do,-: when i pass time range or date range, i want to get all the data between given time.
Thanks
Use truncate:
>>> df.truncate("2021-05-23 23:00:00", "2021-11-19 08:20:00")
a b
time
2021-05-23 23:06:02 9.5 73.1
2021-11-19 08:18:31 19.8 43.0
I have the following data contained in a DataFrame which is part of a custom Class, and I want to compute stats on it for night-time periods.
LAeq,T LAFmax,T LA90,T
Start date & time
2021-08-18 22:00:00 71.5 90.4 49.5
2021-08-18 22:15:00 70.6 94.0 45.7
2021-08-18 22:30:00 69.3 82.2 48.3
2021-08-18 22:45:00 70.1 89.9 46.4
2021-08-18 23:00:00 68.9 82.4 46.0
... ... ...
2021-08-24 08:30:00 72.3 85.0 61.3
2021-08-24 08:45:00 72.9 84.6 62.2
2021-08-24 09:00:00 73.1 86.1 62.6
2021-08-24 09:15:00 72.8 86.4 61.6
2021-08-24 09:30:00 73.2 93.5 61.5
For example, I want to find the nth highest LAFmax, T for each given night-time period.
The night-time period typically spans 23:00 to 07:00, and I have managed to accomplish my goal using the resample() method as follows.
def compute_nth_lmax(self, n):
nth_lmax = self.df["LAFmax,T"].between_time(self._night_start, self._day_start,
include_start=True, include_end=False).resample(
rule=self._night_length, offset=pd.Timedelta(self._night_start)).apply(
lambda x: (np.sort(x))[-n] if x.size > 0 else np.nan).dropna()
return nth_lmax
The problem is that resample() assumes a regular resampling, and this works fine when the night-time period is 8 hours and therefore subdivides 24 equally (as in the default case of 23:00 to 07:00), but not for an irregular night-time period (say, if I extended it to 22:00 to 07:00).
I have tried to accomplish this using groupby(), but had no luck.
The only thing I can think of is adding another column to label each of the rows as "Night-time 1", "Night-time 2" etc., and grouping by these, but that feels rather messy.
I decided to go with what I consider a slightly inelegant approach and create a separate column which flags the night-time periods, before processing them. Still, I managed to achieve my goal in 2 lines of code.
self.df["Night-time indices"] = (self.df.index - pd.Timedelta(self._day_start)).date
nth_event = self.df.sort_values(by=[col], ascending=False).between_time(self._night_start, self._day_start)[
[col, period]].groupby(by=period).nth(n)
Out[43]:
Night-time indices
2021-08-18 100.0
2021-08-19 96.9
2021-08-20 97.7
2021-08-21 95.5
2021-08-22 101.7
2021-08-23 92.7
2021-08-24 85.8
Name: LAFmax,T, dtype: float64
I'm trying to compare two dataframes and drop rows from the first dataframe that aren't between the dates in the second dataframe (or...selecting those rows that are between the dates in the 2nd dataframe). The selections should be inclusive. This might be really simple but its just not clicking for me right now.
Example data is below. For dataframe 1, this can be generated using daily data starting July 1 2018 and ending November 30 2018 with random numbers in the 'number' column. The ... in the dataframe 1 are meant used to show skipping data but the data is there in the real dataframe.
Dataframe 1:
Number
Date
2018-07-01 15.2
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-07-05 19.1
...
2018-09-15 30.4
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-09-19 23.4
...
2018-11-01 30.8
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
2018-11-05 43.8
Dataframe 2:
Change
Date
2018-07-02 Start
2018-07-04 End
2018-09-16 Start
2018-09-18 End
2018-11-02 Start
2018-11-04 End
With the example above, the output should be:
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can try this, I hope the Start and End comes one after the other and is sorted.
df3 = pd.concat([df[i:j] for i,j in zip(df2.loc[df2['Change']=='Start'].index, df2.loc[df2['Change']=='End'].index)]))
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can build an IntervalIndex from df2's index and search in logarithmic time.
df2.index = pd.to_datetime(df2.index)
idx = pd.IntervalIndex.from_arrays(df2.index[df.Change == 'Start'],
df2.index[df.Change == 'End'],
closed='both')
df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)