How do I fill in NAN values in dataframe with a default date of 2015-01-01
what do I use here df['SIGN_DATE'] = df['SIGN_DATE'].fillna(??, inplace=True)
>>>df.SIGN_DATE.head()
0 2012-03-28 14:14:18
1 2011-05-18 00:41:48
2 2011-06-13 16:36:58
3 nan
4 2011-05-22 23:43:56
Name: SIGN_DATE, dtype: object
type(df.SIGN_DATE)
pandas.core.series.Series
df['SIGN_DATE'].fillna(value=pd.to_datetime('1/1/2015'), inplace=True)
Related
I've a sample dataframe
id created_at is_valid
1 2022-02-23 13:00:00 1
2 2022-02-24 12:12:00 1
3 2022-03-21 11:00:00 1
4 0000-00-00 00:00:00 0
5 null 0
6 0
How can I create a is_valid column based on the datetime column?
Convert values to datetimes with errors='coerce' in to_datetime, so is generate NaNs if not valid datetimes, so check them by Series.notna and casting to integers for True/False to 1/0 mapping:
df['is_valid'] = pd.to_datetime(df['created_at'], errors='coerce').notna().astype(int)
I am trying to resample a Pandas dataframe after subsetting for 2 columns. Below is the head of the dataframe. Both columns are Pandas Series.
temp_2011_clean[['visibility', 'dry_bulb_faren']].head()
visibility dry_bulb_faren
2011-01-01 00:53:00 10.00 51.0
2011-01-01 01:53:00 10.00 51.0
2011-01-01 02:53:00 10.00 51.0
2011-01-01 03:53:00 10.00 50.0
2011-01-01 04:53:00 10.00 50.0
type(temp_2011_clean['visibility'])
pandas.core.series.Series
type(temp_2011_clean['dry_bulb_faren'])
pandas.core.series.Series
While the .resample('W') method successfully creates the resample object, if I chain the .mean() method to the same, it is picking up only one column, instead of the expected, both columns. Can someone suggest what could be the issue? Why is that one column is missed?
temp_2011_clean[['visibility', 'dry_bulb_faren']].resample('W')
<pandas.core.resample.DatetimeIndexResampler object at 0x0000016F4B943288>
temp_2011_clean[['visibility', 'dry_bulb_faren']].resample('W').mean().head()
dry_bulb_faren
2011-01-02 44.791667
2011-01-09 50.246637
2011-01-16 41.103774
2011-01-23 47.194313
2011-01-30 53.486188
I think problem should be column visibility is not numeric, so non numeric column is excluded.
print (temp_2011_clean.dtypes)
visibility object
dry_bulb_faren float64
dtype: object
df = temp_2011_clean[['visibility', 'dry_bulb_faren']].resample('W').mean()
print (df)
dry_bulb_faren
2011-01-02 50.6
So convert column to numeric by to_numeric with errors='coerce' for convert non numeric values to NaNs:
temp_2011_clean['visibility'] = pd.to_numeric(temp_2011_clean['visibility'], errors='coerce')
print (temp_2011_clean.dtypes)
visibility float64
dry_bulb_faren float64
dtype: object
df = temp_2011_clean[['visibility', 'dry_bulb_faren']].resample('W').mean()
print (df)
visibility dry_bulb_faren
2011-01-02 10.0 50.6
I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)
I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.
I am trying get the 10 days aggregate of my data which has NaN values. The sum of 10 days should return a nan values if there is a NaN value in the 10 day duration.
When I apply the below code, pandas is considering NaN as Zero and returning the sum of remaining days.
dateRange = pd.date_range(start_date, periods=len(data), freq='D')
# Creating a data frame so that the timeseries can handle numpy array.
df = pd.DataFrame(data)
base_Series = pd.DataFrame(list(df.values), index=dateRange)
# Converting to aggregated series
agg_series = base_Series.resample('10D', how='sum')
agg_data = agg_series.values
Sample Data:
2011-06-01 46.520536
2011-06-02 8.988311
2011-06-03 0.133823
2011-06-04 0.274521
2011-06-05 1.283360
2011-06-06 2.556313
2011-06-07 0.027461
2011-06-08 0.001584
2011-06-09 0.079193
2011-06-10 2.389549
2011-06-11 NaN
2011-06-12 0.195844
2011-06-13 0.058720
2011-06-14 6.570925
2011-06-15 0.015107
2011-06-16 0.031066
2011-06-17 0.073008
2011-06-18 0.072198
2011-06-19 0.044534
2011-06-20 0.240080
Output:
2011-06-01 62.254651
2011-06-11 7.301481
This uses numpy sum which will return nan if nan is present in the sum
In [35]: s = Series(randn(100),index=date_range('20130101',periods=100))
In [36]: s.iloc[11] = np.nan
In [37]: s.resample('10D',how=lambda x: x.values.sum())
Out[37]:
2013-01-01 6.910729
2013-01-11 NaN
2013-01-21 -1.592541
2013-01-31 -2.013012
2013-02-10 1.129273
2013-02-20 -2.054807
2013-03-02 4.669622
2013-03-12 3.489225
2013-03-22 0.390786
2013-04-01 -0.005655
dtype: float64
to filter out those days which have any NaNs, I propose that you do
noNaN_days_only = s.groupby(lambda x: x.date).filter(lambda x: ~x.isnull().any()
where s is a DataFrame
Just apply an agg function:
agg_series = base_Series.resample('10D').agg(lambda x: np.nan if np.isnan(x).all() else np.sum(x) )