Drop string entries from index list of Pandas series - python

I have a series which has dates as indices, except for some values that are strings, i.e.
2015-11-27 00:00:00 4
2016-08-03 00:00:00 1
Some string 1
2015-05-29 00:00:00 1
2015-05-20 00:00:00 2
2015-08-14 00:00:00 6
I want to drop those strings, but I haven't found a nice way to do that. Would appreciate any idea.

You can try converting your index to datetime index whilst coercing the un-convertibles to NaT. Then drop those and index again:
s.index = pd.to_datetime(s.index, errors="coerce")
s[s.index.dropna()]
to get
2015-11-27 4
2016-08-03 1
2015-05-29 1
2015-05-20 2
2015-08-14 6

Related

Number timestamps based on time of timestamp

I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.
Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).
This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.
Any suggestion how to do this most efficiently?
Example of how it is supposed to look like:
Timestamp
Category
2020-01-17 08:18:00
1
2020-01-17 11:57:00
2
2020-01-17 15:35:00
3
2020-01-18 09:00:00
1
2020-01-18 12:00:00
2
2020-01-18 17:00:00
3
Use groupby() and .cumcount()
df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)
Output:
>>> df
Timestamp Category
0 2020-01-17 08:18:00 1
1 2020-01-17 11:57:00 2
2 2020-01-17 15:35:00 3
3 2020-01-18 09:00:00 1
4 2020-01-18 12:00:00 2
5 2020-01-18 17:00:00 3
UPDATE: Try this:
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp'))['Timestamp'].diff().ne(pd.Timedelta(0)).cumsum()

Match datetime YYYY-MM-DD object in pandas dataframe

I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

How can i split DataFrame every x rows?

I have DataFrame in following format:
Date Open High Low Close
0 2015-06-19 20:00:00 1201.60 1202.84 1201.55 1202.13
1 2015-06-19 21:00:00 1202.13 1202.50 1200.84 1200.88
2 2015-06-19 22:00:00 1200.88 1201.55 1200.61 1201.06
3 2015-06-19 23:00:00 1201.06 1201.26 1200.02 1200.57
4 2015-06-22 01:00:00 1200.57 1201.48 1197.04 1198.94
5 2015-06-22 02:00:00 1198.94 1199.79 1198.49 1199.34
6 2015-06-22 03:00:00 1199.34 1200.05 1198.64 1199.74
7 2015-06-22 04:00:00 1199.74 1200.34 1199.14 1199.66
I am trying to split this DataFrame by dates and after that i am trying to split dates in eveery 4 hours. Here is how i select DataFrame by date:
i = 0
this_date = df["Date"][i:i+1].values[0].split(" ")[0]
today = df[df["Date"].apply(lambda x: x.split(" ")[0]) == this_date]
Now i need to split today dataframe in every 4 hours. The last size will be 3 in total as it ends at 23:00
How can i do this? Are there any easy way or do i need to map over DataFrame and do it manually?

How to get last day of each month in Pandas DataFrame index (using TimeGrouper)

I have a DataFrame with incomplete dates and I only need the date/row of the last day available of each month.
I tried using TimeGrouper and take .last() of each group.
import pandas as pd
idx = [pd.datetime(2016,2,1),pd.datetime(2017,1,20),pd.datetime(2017,2,1),pd.datetime(2017,2,27)]
df = pd.DataFrame([1,2,3,4],index=idx)
df
0
2016-02-01 1
2017-01-20 2
2017-02-01 3
2017-02-27 4
Expecting:
df_eom
0
2016-02-01 1
2017-01-20 2
2017-02-27 4
However I got this:
df_eom = df.groupby(pd.TimeGrouper(freq='1M')).last()
df_eom
0
2016-02-29 1.0
2016-03-31 NaN
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 NaN
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 NaN
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 NaN
2017-01-31 2.0
2017-02-28 4.0
Not only it creates date that weren't in df but also changed the index of first and last row of df. Am I using TimeGrouper wrong?
Here's one way
In [795]: df.iloc[df.reset_index().groupby(df.index.to_period('M'))['index'].idxmax()]
Out[795]:
0
2016-02-01 1
2017-01-20 2
2017-02-27 4
Or
In [802]: df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]
Out[802]:
0
2016-02-01 1
2017-01-20 2
2017-02-27 4
You could group by the year and month and iterate through your groups to find the last date. Like so:
groups = df.groupby([df.index.year, df.index.month])
df_eom = pd.DataFrame()
for idx, group in groups:
df_eom = df_eom.append(group.iloc[-1])
df_eom
0
2016-02-01 1
2017-01-20 2
2017-02-27 4
I don't really like this because of the looping, but given that you really can't have an outrageous number of years and each year will have a maximum of 12 month groups it shouldn't be too awful.
I believe this solution is more appropriate in more use cases. The previous instances only work if the the date is exactly a month end. If you deal with financial data for example, the last day of the month may or may not be a calendar month end. This solution accounts for it:
df[df['as_of_date'].dt.month.shift(-1)!=df['as_of_date'].dt.month].reset_index(drop=True)

Categories