How to select the first valid rows in a pandas dataframe? - python

I have pd.DataFrame with time series as index:
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 133.0
2018-01-02 14:15:00+00:00 134.0 134.0
I am interested in preserving the first non-NaN value of each columns and the rest of elements should be NaN
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN
Does pandas/numpy have an operation to achieve this in a vectorized way (without writing for loops)?

You can try apply Series.first_valid_index per column and mask the other rows with nan
df[df.apply(lambda col: col.index != col.first_valid_index())] = np.nan
print(df)
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 132.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN

Using a boolean masking:
m1 = df.isna().cummin() # get NAs prior to first non-NA
m2 = m1.shift(fill_value=False) # get first non-NA and after
out = df.where(m2&~m1)
output:
a b
2018-01-02 12:30:00+00:00 NaN NaN
2018-01-02 13:45:00+00:00 NaN 232.0
2018-01-02 14:00:00+00:00 133.0 NaN
2018-01-02 14:15:00+00:00 NaN NaN

Related

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Reindexing timeseries data

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)

Create a DataFrame in Pandas using an index from an existing TimeSerie and a column form another TimeSerie

I want to create a DataFrame or TimeSerie using an index of an existing TimeSerie and the values from another TimeSerie with different time indices. The TimeSeries look like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 172.3000
2018-01-03 174.5500
2018-01-04 173.4700
2018-01-05 175.3700
2018-01-08 175.6100
2018-01-09 175.0600
2018-01-10 174.3000
2018-01-11 175.4886
2018-01-12 177.3600
2018-01-16 179.3900
2018-01-17 179.2500
2018-01-18 180.1000
...
and
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-09 175.610
2018-01-16 177.360
2018-01-23 180.100
...
I want to use the index from the first TS and fill it with the values with appropriate index form the second TS. Like;
<class 'pandas.core.series.Series'>
DT
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 NaN
2018-01-08 NaN
2018-01-09 175.610
2018-01-10 NaN
2018-01-11 NaN
2018-01-12 NaN
2018-01-16 177.360
2018-01-17 NaN
2018-01-18 NaN
...
Thx
IIUC, use Series.reindex:
new_s = s2.reindex(s1.index)
#2018-01-02 NaN
#2018-01-03 NaN
#2018-01-04 NaN
#2018-01-05 NaN
#2018-01-08 NaN
#2018-01-09 175.61
#2018-01-10 NaN
#2018-01-11 NaN
#2018-01-12 NaN
#2018-01-16 177.36
#2018-01-17 NaN
#2018-01-18 NaN
#Name: s2, dtype: float64
convert your series data structure into a Data frame Data structure then use the following line :
pd.merge(TS1,TS2,left_index=True,right_index=True,how='left').iloc[:,-1]

Pandas Set all value in a day equal to data of a time of that day

Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
df
>>>
data1 data2 data3
2018-01-01 00:00:00 1.0 3.0 NaN
2018-01-01 01:00:00 8.0 5.0 8.0
2018-01-01 02:00:00 5.0 NaN 6.0
2018-01-01 03:00:00 4.0 7.0 4.0
2018-01-01 04:00:00 NaN 8.0 NaN
... ... ... ...
2018-01-07 20:00:00 8.0 7.0 NaN
2018-01-07 21:00:00 5.0 4.0 5.0
2018-01-07 22:00:00 NaN 6.0 NaN
2018-01-07 23:00:00 2.0 4.0 3.0
2018-01-08 00:00:00 NaN NaN NaN
I want to select a specific time each day, then set all value in a day equal to the data of that time.
For example, I want to select 1:00:00, then all data of 2018-01-01 will be equal to 2018-01-01 01:00:00, all data of 2018-01-02 will be equal to 2018-01-02 01:00:00,etc.,
I know how to select the data of the time:
timestamp = "01:00:00"
df[df.index.strftime("%H:%M:%S") == timestamp]
but I don't know how to set data of the day equal to it.
Thank you for reading.
Check with reindex
s=df[df.index.strftime("%H:%M:%S") == timestamp]
s.index=s.index.date
df[:]=s.reindex(df.index.date).values

How to change entire row if NaN present if a single column has NaN

I have a DataFrame like this
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN 8.891460
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN 2.876649
1979-07-31 18:00:00 6.066667 1.438324
I want to obtain a DataFrame Like this
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324
What I will do reindex
df.dropna().reindex(df.index)
mask:
df.mask(df.gauge.isna())
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324
use np.where to add nan
import numpy as np
df['satellite'] = np.where(df['gauge'].isnull(),np.nan,df['satellite'])
Second solution
use .loc and isnull
df.loc[df['guage'].isnull(),'satellite'] = np.nan
You can use np.where:
df['satellite'] = np.where(df['gauge'].isna(), np.NaN, df['satellite'])
df['gauge'] = np.where(df['satellite'].isna(), np.NaN, df['gauge'])
You need to find if a row has np.nan. .any(1) gives you masking for a row.
df.loc[df.isna().any(1)] = np.nan
Output:
gauge satellite
1979-06-23 18:00:00 6.700000 2.484378
1979-06-27 03:00:00 NaN NaN
1979-06-27 06:00:00 1.833333 4.053460
1979-06-27 09:00:00 NaN NaN
1979-07-31 18:00:00 6.066667 1.438324

Categories