Pivoting a pandas dataframe - python

I have the following dataframe:
ID Datetime Y
0 1000 00:29:59 0.117
1 1000 00:59:59 0.050
2 1000 01:29:59 0.025
3 1000 01:59:59 0.025
4 1000 02:29:59 0.049
... ... ...
48973133 2999 21:59:59 0.618
48973134 2999 22:29:59 0.495
48973135 2999 22:59:59 0.745
48973136 2999 23:29:59 0.514
48973137 2999 23:59:59 0.419
The Datetime column is not actually in that format, here it is:
0 00:29:59
1 00:59:59
2 01:29:59
3 01:59:59
4 02:29:59
...
48973133 21:59:59
48973134 22:29:59
48973135 22:59:59
48973136 23:29:59
48973137 23:59:59
Name: Datetime, Length: 48973138, dtype: object
I am trying to run the following pivot code:
print(df.assign(group=df.index//48).pivot(index='group', values='Y', columns=df['Datetime'][0:48]))
But i am getting following error:
KeyError: '00:29:59'
How can i fix it? I expect to get 48 columns (1 day of half-hourly measured data) in the pivoted dataframe, so my columns should be:
00:29:59 00:59:59 01:29:59 ... 23:29:59 23:59:59
The first row should have the first 48 values of Y, the second row should have the next 48, and so on.
EDIT: Picture of the cumcount()issue:

Based on your comment, it seems you have the same ID for multiple days. I would therefore suggest to keep track of the day with cumcount before pivoting:
df['Day'] = df.groupby(['ID', 'Datetime']).cumcount()
df.pivot(index=['ID', 'Day'], values='Y', columns='Datetime')
Edit: based on your comment under my answer, it seems that not all days have all timestamps. A solution could be to generate the right number of timestamps (repeating [00:29:59 00:59:59 01:29:59 ... 23:29:59 23:59:59]) and add missing values to df. This would be quite CPU intensive though:
import math
from itertools import cycle
# gapless list of Datetime:
dt = (x for i in range(24) for x in [f"{i}:29:59".zfill(8), f"{i}:59:59".zfill(8)])
for i, t in enumerate(cycle(dt)):
if i == len(df):
break
if df.loc[i, 'Datetime'] != t:
if t == "00:29:59": # filling missing IDs
id_ = df.loc[i, 'ID']
else:
id_ = df.loc[i-1, 'ID']
df = pd.concat([df.loc[0:i-1], pd.DataFrame({'ID': id_, 'Datetime': [t]}), df.loc[i:]], ignore_index=True)
Then apply groupby and pivot like shown above.
Edit2: using cycle instead of chain + tee

You can use DataFrame.pivot_table for avoid ValueError: Index contains duplicate entries, cannot reshape - if same value per ID and Datetime valuesa are aggregate - here is used default function mean:
td = pd.timedelta_range('00:00:00','24:00:00', freq='30Min')[1:]
td = [f'{x - pd.Timedelta("1 sec")}'[-8:] for x in td]
print (td)
['00:29:59', '00:59:59', '01:29:59', '01:59:59', '02:29:59', '02:59:59', '03:29:59', '03:59:59', '04:29:59', '04:59:59', '05:29:59', '05:59:59', '06:29:59', '06:59:59', '07:29:59', '07:59:59', '08:29:59', '08:59:59', '09:29:59', '09:59:59', '10:29:59', '10:59:59', '11:29:59', '11:59:59', '12:29:59', '12:59:59', '13:29:59', '13:59:59', '14:29:59', '14:59:59', '15:29:59', '15:59:59', '16:29:59', '16:59:59', '17:29:59', '17:59:59', '18:29:59', '18:59:59', '19:29:59', '19:59:59', '20:29:59', '20:59:59', '21:29:59', '21:59:59', '22:29:59', '22:59:59', '23:29:59', '23:59:59']
df1 = df.pivot_table(index='ID', columns='Datetime', values='Y', aggfunc='mean')
print (df1)
Datetime 00:29:59 00:59:59 01:29:59 01:59:59 02:29:59 21:59:59 \
ID
1000 0.117 0.05 0.025 0.025 0.049 NaN
2999 NaN NaN NaN NaN NaN 0.618
Datetime 22:29:59 22:59:59 23:59:59
ID
1000 NaN NaN NaN
2999 0.495 0.745 0.4665
If need all times add DataFrame.reindex:
df1 = (df.pivot_table(index='ID', columns='Datetime', values='Y', aggfunc='mean')
.reindex(td, axis=1))
print (df1)
Datetime 00:29:59 00:59:59 01:29:59 01:59:59 02:29:59 02:59:59 \
ID
1000 0.117 0.05 0.025 0.025 0.049 NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 03:29:59 03:59:59 04:29:59 04:59:59 05:29:59 05:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 06:29:59 06:59:59 07:29:59 07:59:59 08:29:59 08:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 09:29:59 09:59:59 10:29:59 10:59:59 11:29:59 11:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 12:29:59 12:59:59 13:29:59 13:59:59 14:29:59 14:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 15:29:59 15:59:59 16:29:59 16:59:59 17:29:59 17:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 18:29:59 18:59:59 19:29:59 19:59:59 20:29:59 20:59:59 \
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN NaN NaN NaN NaN NaN
Datetime 21:29:59 21:59:59 22:29:59 22:59:59 23:29:59 23:59:59
ID
1000 NaN NaN NaN NaN NaN NaN
2999 NaN 0.618 0.495 0.745 NaN 0.4665

Related

yfinance shows 2 rows for the same day with Nan values

I'm using yfinance library with 2 tickers(^BVSP and BRL=X) but when i display the dateframe it show 2 rows per day where each row shows the information of only one ticket. The information about the other ticket is Nan. I want to put all the information in one row
How can i solve this?
I tried this
dados_bolsa =\["^BVSP","BRL=X"\]
today = datetime.datetime.now()
one_year = today - datetime.timedelta(days=365)
print(one_year)
dados_mercado = yf.download(dados_bolsa , one_year,today)
display(dados_mercado)
i get
2022-02-06 13:27:29.158181
[*********************100%***********************] 2 of 2 completed
Adj Close Close High Low Open Volume
BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP
Date
2022-02-07 00:00:00+00:00 5.3269 NaN 5.3269 NaN 5.3430 NaN 5.276800 NaN 5.326200 NaN 0.0 NaN
2022-02-07 03:00:00+00:00 NaN 111996.00000 NaN 111996.00000 NaN 112517.000000 NaN 111490.00000 NaN 112247.000000 NaN 10672800.0
2022-02-08 00:00:00+00:00 5.2626 NaN 5.2626 NaN 5.2849 NaN 5.251000 NaN 5.262800 NaN 0.0 NaN
2022-02-08 03:00:00+00:00 NaN 112234.00000 NaN 112234.00000 NaN 112251.000000 NaN 110943.00000 NaN 111995.000000 NaN 10157500.0
2022-02-09 00:00:00+00:00 5.2584 NaN 5.2584 NaN 5.2880 NaN 5.232774 NaN 5.256489 NaN 0.0 NaN
Look that we have 2 rows for the same day with Nan. I want just one row but with all information.

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Groupby time bins in multilevel index

I have a sparsely filled data frame that looks like this:
entity_id 59e75f2b9e182f68cf25721d 59e75f2bc0bd722a5f395ee9 59e75f2c05e40310ebe1f433 ...
organisation_id group_id datetime ...
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:01:00 NaN NaN NaN ...
2018-04-01 02:02:00 NaN 2.15 NaN ...
2018-04-01 02:03:00 NaN NaN 3.689 ...
2018-04-01 02:04:00 NaN NaN NaN ...
2018-04-01 02:05:00 NaN NaN NaN ...
... ... ... ... ...
5cb590649f18c69541d34f7a 2019-04-01 01:55:00 NaN NaN NaN ...
2019-04-01 01:56:00 NaN NaN NaN ...
2019-04-01 01:57:00 NaN NaN NaN ...
2019-04-01 01:58:00 NaN NaN NaN ...
2019-04-01 01:59:00 NaN NaN NaN ...
I would like to group this frame by group_id and 10-minute bins applied to the datetime index (for each group i want values that occurred inside the same 10 minute window to be grouped so i can take the mean over columns, disregarding the minute portion of the datetime index essentially).
I have tried using pd.Grouper(freq='10T') but that doesn't work in conjunction with multilevel indices it would seem.
group_mean = frame.groupby(
pd.Grouper(freq='10T'), level='datetime').mean(axis=1)
This gives me the error message
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
For reference, my wanted output should look something like this:
group_mean
organisation_id group_id datetime
59e7515edb84e482acce8339 59e75177575fc94638c1f8e7 2018-04-01 02:10:00 mean(axis=1)
2018-04-01 02:20:00 mean(axis=1)
...
5cb590649f18c69541d34f7a 2019-04-01 01:50:00 mean(axis=1)
2019-04-01 02:00:00 mean(axis=1)
...
where mean(axis=1) is the mean of all columns that are not NaN for that specific group and time bin.
Solution need DatetimeIndex, so first convert another levels to columns and add it to groupby in list:
Notice: Mean is per groups, not per columns.
group_mean = (frame.reset_index(['organisation_id','group_id'])
.groupby(['organisation_id',
'group_id',
pd.Grouper(freq='10T',level='datetime')])
.mean())
If need mean per columns:
df = frame.mean(axis=1)

pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?

I have time-indexed data:
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
b
day
2012-01-01 0.22
2012-01-03 0.30
What is the best way to extend this data frame so that it has one row for every day in January 2012 (say), where all columns are set to NaN (here only b) where we don't have data?
So the desired result would be:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
...
2012-01-31 NaN
Many thanks!
Use this (current as of pandas 1.1.3):
ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)
Which gives:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
[...]
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
For older versions of pandas replace pd.date_range with pd.DatetimeIndex.
You can resample passing day as frequency, without specifying a fill_method parameter missing values will be NaN filled as you desired
df3 = df2.asfreq('D')
df3
Out[16]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
To answer your second part, I can't think of a more elegant way at the moment:
df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged
Out[46]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
This constructs a second time series and then we just append and call asfreq('D') as before.
Here's another option:
First add a NaN record on the last day you want, then resample. This way resampling will fill the missing dates for you.
Starting Frame:
import pandas as pd
import numpy as np
from datetime import date
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2
Out:
b
day
2012-01-01 0.22
2012-01-03 0.30
Filled Frame:
df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')
Out:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
Mark's answer seems to not be working anymore on pandas 1.1.1.
However, using the same idea, the following works:
from datetime import datetime
import pandas as pd
# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()
# set index
df.set_index('date', inplace=True)
# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)
EDIT: just found out that this exact use case is in the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex
def extendframe(df, ndays):
"""
(df, ndays) -> df that is padded by ndays in beginning and end
"""
ixd = df.index - datetime.timedelta(ndays)
ixu = df.index + datetime.timedelta(ndays)
ixx = df.index.union(ixd.union(ixu))
df_ = df.reindex(ixx)
return df_
Not exactly the question since here you know that the second index is all days in January, but suppose you have another index say from another data frame df1, which might be disjoint and with a random frequency. Then you can do this:
ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)
Converting indices to lists allows one to create a longer list in a natural way.

Categories