Remove time from a range of dates in a pandas dataframe - python

I need all the values of the DataFrame to not have a time element.
For example with the dataframe:
df = pd.date_range("2019-01-01'", periods = 9, freq='M')
Currently df[2] shows:
2019-03-31 00:00:00
What is the best way to remove the time part and just include the date for all elements in the array?

Related

extract data in specific interval

I have one table with two columns: date (01-01-2010 to 31-08-2021), value (mm)
I would like to get only data during 2020. There is a function or similar to get some only data in specific period?
For example to create one pivot.
try this:
df = pd.DataFrame(
{'date':['27-02-2010','31-1-2020','31-1-2021','02-1-2020','13-2-2020',
'07-2-2019','30-4-2018','04-8-2020','06-4-2013','21-6-2020'],
'value':['foo','bar','lorem','ipsum','alpha','omega','big','small','salt','pepper']})
df[pd.to_datetime(df['date']).dt.year == 2020]
Output:
date value
1 31-1-2020 bar
3 02-1-2020 ipsum
4 13-2-2020 alpha
7 04-8-2020 small
9 21-6-2020 pepper
Or for serching with any range you can use this:
df['date'] = pd.to_datetime(df['date'])
df[(df['date']>pd.Timestamp(2020,1,1)) & (df['date']<pd.Timestamp(2020,12,31))]
Here is an example of a idea on how you can return the values from a dataset based on the year using string slicing! If this doesn't pertain to your situation I would need you to edit your post with a specific example of code!
import pandas as pd
df = pd.DataFrame(
{'date':['27-02-2010','31-1-2020','31-1-2021','02-1-2020','13-2-2020','07-2-2019','30-4-2018','04-8-2020','06-4-2013','21-6-2020'],'value':['foo','bar','lorem','ipsum','alpha','omega','big','small','salt','pepper']})
for row in df.iterrows():
if row[1]['date'][-4::1] == '2020':
print (row[1]['value'])
this will only return the values from the dataframe that come from dates with a year of 2020
Pandas has extensive time series features that you may want to use, but for a simpler approach, you could define the date column as the index and then slice the data (assuming the table is already sorted by date):
import pandas as pd
df = pd.DataFrame({'date': ['31-12-2019', '01-01-2020', '01-07-2020',
'31-12-2020', '01-01-2021'],
'value': [1, 2, 3, 4, 5]})
df.index = df.date
df.loc['01-01-2020':'31-12-2020']
date value
date
01-01-2020 01-01-2020 2
01-07-2020 01-07-2020 3
31-12-2020 31-12-2020 4

How to subset rows based on date overlap range efficiently using python pandas?

My data frame has two date type columns: start and end (yyyy-mm-dd).
Here's my data frame:
import pandas as pd
import datetime
data=[["2016-10-17","2017-03-08"],["2014-08-17","2016-09-08"],["2014-01-01","2015-01-01"],["2017-12-20","2019-01-01"]]
df=pd.DataFrame(data,columns=['start','end'])
df['start'] = pd.to_datetime(df['start'], format='%Y-%m-%d')
df['end'] = pd.to_datetime(df['end'], format='%Y-%m-%d')
start end
0 2016-10-17 2017-03-08
1 2014-08-17 2016-09-08
2 2014-01-01 2015-01-01
3 2017-12-20 2019-01-01
And I have reference start and end date as following.
ref_start=datetime.date(2015, 9, 20)
ref_end=datetime.date(2017,1,31)
print(ref_start,ref_end)
2015-09-20 2017-01-31
I would like to subset rows if the start and end date range of a row overlaps with reference start and end date. The third and the fourth rows are not selected since the start and end date range does not overlap with reference date range (2015-09-20 ~ 2017-01-31)
So my desired outcome looks like this:
start end
0 2016-10-17 2017-03-08
1 2014-08-17 2016-09-08
To do that, I was thinking about using the following codes based on this: Efficient date range overlap calculation in python?
df[(max(df['start'],ref_start)>min(df['end'],ref_end))]
However, it doesn't work. Is there any way to get the desired outcome efficiently?
A trick I learned early on in my career is what I call "crossing the dates": you compare the start of one range against the end of the other.
# pd.Timestamp can do everything that datetime/date does and some more
ref_start = pd.Timestamp(2015, 9, 20)
ref_end = pd.Timestamp(2017,1,31)
# Compare the start of one range to the end of another and vice-versa
# Made into a separate variable for reability
cond = (ref_start <= df['end']) & (ref_end >= df['start'])
df[cond]

In pandas dataframes, how would you convert all index labels as type DatetimeIndex to datetime.datetime?

Just as the title says, I am trying to convert my DataFrame lables to type datetime. In the following attempted solution I pulled the labels from the DataFrame to dates_index and tried converting them to datetime by using the function DatetimeIndex.to_datetime, however, my compiler says that DatetimeIndex has no attribute to_datetime.
dates_index = df.index[0::]
dates = DatetimeIndex.to_datetime(dates_index)
I've also tried using the pandas.to_datetime function.
dates = pandas.to_datetime(dates_index, errors='coerce')
This returns the datetime wrapped in DatetimeIndex instead of just datetimes.
My DatetimeIndex labels contain data for date and time and my goal is to push that data into two seperate columns of the DataFrame.
if your DateTimeIndex is myindex, then
df.reset_index() will create a myindex column, which you can do what you want with, and if you want to make it an index again later, you can revert by `df.set_index('myindex')
You can set the index after converting the datatype of the column.
To convert datatype to datetime, use: to_datetime
And, to set the column as index use: set_index
Hope this helps!
import pandas as pd
df = pd.DataFrame({
'mydatecol': ['06/11/2020', '06/12/2020', '06/13/2020', '06/14/2020'],
'othcol1': [10, 20, 30, 40],
'othcol2': [1, 2, 3, 4]
})
print(df)
print(f'Index type is now {df.index.dtype}')
df['mydatecol'] = pd.to_datetime(df['mydatecol'])
df.set_index('mydatecol', inplace=True)
print(df)
print(f'Index type is now {df.index.dtype}')
Output is
mydatecol othcol1 othcol2
0 06/11/2020 10 1
1 06/12/2020 20 2
2 06/13/2020 30 3
3 06/14/2020 40 4
Index type is now int64
othcol1 othcol2
mydatecol
2020-06-11 10 1
2020-06-12 20 2
2020-06-13 30 3
2020-06-14 40 4
Index type is now datetime64[ns]
I found a quick solution to my problem. You can create a new pandas column based on the index and then use datetime to reformat the date.
df['date'] = df.index # Creates new column called 'date' of type Timestamp
df['date'] = df['date'].dt.strftime('%m/%d/%Y %I:%M%p') # Date formatting

Pandas resampling on same frequency - how to copy rows?

Please consider the following reproducible dataframe as an example:
import pandas as pd
import numpy as np
from datetime import datetime
list_dates = ['2018-01-05',
'2019-01-01',
'2019-01-02',
'2019-01-05',
'2019-01-08',
'2019-01-22']
index = []
for i in list_dates:
tp = datetime.strptime(i, "%Y-%m-%d")
index.append(tp)
data = np.array([np.arange(6)]*3).T
columns = ['A','B', 'C']
df = pd.DataFrame(data, index = index, columns=columns)
df['D']= ['Loc1', 'Loc1', 'Loc2', 'Loc2', 'Loc4', 'Loc3']
df['E'] = [0.1, 1, 10, 100, 1000, 10000]
Image of the above example dataframe:
Then, I try to create a new dataframe df2 by resampling the above dataset, so I have all daily dates between 2018-01-05 (first date in list_dates) and 2019-01-22 (last date in list_dates). When doing the resampling, I basically create new rows in my dataframe for which I don't have any data.
These new rows should simply be copies of their last known value. So for example, in my example dataframe above, I have data for 2018-01-05, but not for 2018-01-06 until 2018-12-31. All these rows should be filled with the values of the previous / last known value (= the row of 2018-01-05).
I tried doing that using:
df2 = df.resample('D').last()
However, this doesn't work. Instead I get the full range of dates from 2018-01-05 until 2019-01-22, where all new rows (that were not in the original dataframe df) have nan values only.
What am I missing? Any suggestions on how I can fix this?

Pandas - Add seconds from a column to datetime in other column

I have a dataFrame with two columns, ["StartDate" ,"duration"]
the elements in the StartDate column are datetime type, and the duration are ints.
Something like:
StartDate Duration
08:16:05 20
07:16:01 20
I expect to get:
EndDate
08:16:25
07:16:21
Simply add the seconds to the hour.
I'd being checking some ideas about it like the delta time types and that all those datetimes have the possibilities to add delta times, but so far I can find how to do it with the DataFrames (in a vector fashion, cause It might be possible to iterate over all the rows performing the operation ).
consider this df
StartDate duration
0 01/01/2017 135
1 01/02/2017 235
You can get the datetime column like this
df['EndDate'] = pd.to_datetime(df['StartDate']) + pd.to_timedelta(df['duration'], unit='s')
df.drop('StartDate,'duration', axis = 1, inplace = True)
You get
EndDate
0 2017-01-01 00:02:15
1 2017-01-02 00:03:55
EDIT: with the sample dataframe that you posted
df['EndDate'] = pd.to_timedelta(df['StartDate']) + pd.to_timedelta(df['Duration'], unit='s')
df.StartDate = df.apply(lambda x: pd.to_datetime(x.StartDate)+pd.Timedelta(Second(df.duration)) ,axis = 1)

Categories