I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?
Related
I want to sum up some market volumes based on equal prices in, let's say, 6 hours of 2017.
I have a DataFrame, df1 (market_volumes), that contains the market volumes in some areas. Then I have another DataFrame, df2 (mFRR_price), which contains some market prices.
df1
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
df2
Date NO1 Up NO2 Up NO3 Up ... SE4 Up FI Up DK1 Up DK2 Up
35062 31-12-2020 47.4 47.4 27.2 ... 61.1 61.1 94.1 94.1
35063 31-12-2020 31.0 31.0 25.7 ... 58.0 35.3 89.4 89.4
35064 31-12-2020 24.8 24.8 24.8 ... 54.5 24.8 56.7 56.7
35065 31-12-2020 24.8 24.8 24.8 ... 51.2 28.0 52.4 52.4
35066 31-12-2020 24.6 24.6 24.6 ... 45.8 26.6 51.9 51.9
35067 31-12-2020 24.1 24.1 23.3 ... 24.1 24.1 78.7 78.7
Now, I want to sum up the market volumes from df1 IF the values in a row in df2 are equal to the value in column "NO1 UP".
i.e., I am looking for a way to end up with a new DataFrame that would result in:
df3
Date NO1 Up NO1 Down NO2 Up ... DK1 Up DK1 Down DK2 Up DK2 Down SUM
35062 31-12-2020 54.0 0.0 214.0 ... 33.0 0.0 31.0 0.0 (54+214)
35063 31-12-2020 3.0 0.0 121.0 ... 125.0 0.0 21.0 0.0 (3+121)
35064 31-12-2020 0.0 -28.0 0.0 ... 0.0 -9.0 0.0 0.0 etc.
35065 31-12-2020 0.0 -83.0 0.0 ... 0.0 0.0 0.0 0.0
35066 31-12-2020 0.0 -80.0 0.0 ... 0.0 -55.0 0.0 0.0
35067 31-12-2020 0.0 -42.0 0.0 ... 79.0 0.0 23.0 0.0
... because it locates the area prices that are equal and sums the market volumes on those locations in the DataFrame.
I've been working on this:
market_volumes['sum'] = mFRR_price.eq(mFRR_price['NO1 Up'], axis=0).mul(mFRR_price['NO1 Up'], axis=0).sum(axis=1)
But it sums the values in df2 in puts it in the df1. I need the POSITIONS in df2, but the values from df1.
import pandas as pd
df3['SUM'] = df3['NO1 Up'] + df3['NO2 Up']
You can use .loc and apply boolean indexing.
df1.loc[df2['NO1 Up'] == df2['NO2 Up'], 'SUM'] = df1['NO1 Up'] + df1['NO2 Up']
df1.loc[df2['NO1 Up'] != df2['NO2 Up'], 'SUM'] = 0
First line goes down df2's index and checks if values in columns NO1 Up and NO2 Up are equal. It then creates a column called 'SUM' - the value of this new column is dependent on the outcome of the preceding boolean.. We say if preceding boolean is true, then go to the SUM column and do the below operation:
= df1['NO1 Up'] + df1['NO2 Up']
Conversely, if the outcome false, then pandas will insert 'NaN' into your SUM column.
Not sure if you are ok with NaN values. Most are not, so the second line of code is more or less the inverse of the first... If df2['NO1 Up'] != df2['NO2 Up'], then insert integer 0 in the df1 SUM column.
Again, there are probably other ways to accomplish what you want.
I have time series data from "5 Jan 2015" to "28 Dec 2018". I observed some of the working days' dates and their values are missing. How to check how many weekdays are missing in my time range? and what are those dates so that i can extrapolate the values for those dates.
Example:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
On observing calendar, 21st Dec, 2018 was Friday. Then excluding Saturday and Sunday, the dataset should be having "24th Dec 2018" in the list, but its missing. I need to identify such missing dates from range.
My approach till now:
I tried using
pd.date_range('2015-01-05','2018-12-28',freq='W')
to identify the number of weeks and calculate the no. of weekdays from them manually, to identify number of missing dates.
But it dint solved purpose as I need to identify missing dates from range.
Let's say this is your full dataset:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
And dates were:
dates = pd.date_range('2018-12-15', '2018-12-31')
First, make sure the Date column is actually a date type:
df['Date'] = pd.to_datetime(df['Date'])
Then set Date as the index:
df = df.set_index('Date')
Then reindex with unutbu's solution:
df = df.reindex(dates, fill_value=0.0)
Then reset the index to make it easier to work with:
df = df.reset_index()
It now looks like this:
index Price Volume
0 2018-12-15 0.0 0.0
1 2018-12-16 0.0 0.0
2 2018-12-17 0.0 0.0
3 2018-12-18 0.0 0.0
4 2018-12-19 0.0 0.0
5 2018-12-20 0.0 0.0
6 2018-12-21 172.8 800.0
7 2018-12-22 0.0 0.0
8 2018-12-23 0.0 0.0
9 2018-12-24 0.0 0.0
10 2018-12-25 171.0 2200.0
11 2018-12-26 170.4 500.0
12 2018-12-27 173.6 400.0
13 2018-12-28 172.0 800.0
14 2018-12-29 0.0 0.0
15 2018-12-30 0.0 0.0
16 2018-12-31 0.0 0.0
Do:
df['weekday'] = df['index'].dt.dayofweek
Finally, how many weekdays are missing in your time range:
missing_weekdays = df[(~df['weekday'].isin([5,6])) & (df['Volume'] == 0.0)]
Result:
>>> missing_weekdays
index Price Volume weekday
2 2018-12-17 0.0 0.0 0
3 2018-12-18 0.0 0.0 1
4 2018-12-19 0.0 0.0 2
5 2018-12-20 0.0 0.0 3
9 2018-12-24 0.0 0.0 0
16 2018-12-31 0.0 0.0 0
I have two dataframes that I need to merge based on date. The first dataframe looks like:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0
18 2019-07-27 01:00:00 0.000000 0.000000 0.0
19 2019-07-27 02:00:00 0.000000 0.000000 0.0
20 2019-07-27 03:00:00 0.000000 0.000000 0.0
The second is like this:
Time Stamp Qty Compl
0 2019-07-26 150
1 2019-07-27 20
2 2019-07-29 230
3 2019-07-30 230
4 2019-07-31 170
Both Time Stamp columns are datetime64[ns]. I wanted to merge left, and forward fill the date into all the other rows for a day. My problem is at the merge, the Qty Compl from the second df is applied at midnight of each day, and some days does not have a midnight time stamp, such as the first day in the first dataframe.
Is there a way to merge and match every row that contains the same day? The desired output would look like this:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean Qty Compl
0 2019-07-26 07:00:00 410.637966 414.607081 0.0 150
1 2019-07-26 08:00:00 403.521735 424.787366 0.0 150
2 2019-07-26 09:00:00 403.143925 425.739639 0.0 150
3 2019-07-26 10:00:00 410.542895 426.210538 0.0 150
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0 20
18 2019-07-27 01:00:00 0.000000 0.000000 0.0 20
19 2019-07-27 02:00:00 0.000000 0.000000 0.0 20
20 2019-07-27 03:00:00 0.000000 0.000000 0.0 20
Use merge_asof with sorted both DataFrames by datetimes:
#if necessary
df1['Time Stamp'] = pd.to_datetime(df1['Time Stamp'])
df2['Time Stamp'] = pd.to_datetime(df2['Time Stamp'])
df1 = df1.sort_values('Time Stamp')
df2 = df2.sort_values('Time Stamp')
df = pd.merge_asof(df1, df2, on='Time Stamp')
print (df)
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean \
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
4 2019-07-27 00:00:00 0.000000 0.000000 0.0
5 2019-07-27 01:00:00 0.000000 0.000000 0.0
6 2019-07-27 02:00:00 0.000000 0.000000 0.0
7 2019-07-27 03:00:00 0.000000 0.000000 0.0
Qty Compl
0 150
1 150
2 150
3 150
4 20
5 20
6 20
7 20
Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0
If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.
I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')