Please consider the following reproducible dataframe as an example:
import pandas as pd
import numpy as np
from datetime import datetime
list_dates = ['2018-01-05',
'2019-01-01',
'2019-01-02',
'2019-01-05',
'2019-01-08',
'2019-01-22']
index = []
for i in list_dates:
tp = datetime.strptime(i, "%Y-%m-%d")
index.append(tp)
data = np.array([np.arange(6)]*3).T
columns = ['A','B', 'C']
df = pd.DataFrame(data, index = index, columns=columns)
df['D']= ['Loc1', 'Loc1', 'Loc2', 'Loc2', 'Loc4', 'Loc3']
df['E'] = [0.1, 1, 10, 100, 1000, 10000]
Image of the above example dataframe:
Then, I try to create a new dataframe df2 by resampling the above dataset, so I have all daily dates between 2018-01-05 (first date in list_dates) and 2019-01-22 (last date in list_dates). When doing the resampling, I basically create new rows in my dataframe for which I don't have any data.
These new rows should simply be copies of their last known value. So for example, in my example dataframe above, I have data for 2018-01-05, but not for 2018-01-06 until 2018-12-31. All these rows should be filled with the values of the previous / last known value (= the row of 2018-01-05).
I tried doing that using:
df2 = df.resample('D').last()
However, this doesn't work. Instead I get the full range of dates from 2018-01-05 until 2019-01-22, where all new rows (that were not in the original dataframe df) have nan values only.
What am I missing? Any suggestions on how I can fix this?
Related
I have the following DataFrame with a Date column,
0 2021-12-13
1 2021-12-10
2 2021-12-09
3 2021-12-08
4 2021-12-07
...
7990 1990-01-08
7991 1990-01-05
7992 1990-01-04
7993 1990-01-03
7994 1990-01-02
I am trying to find the index for a specific date in this DataFrame using the following code,
# import raw data into DataFrame
df = pd.DataFrame.from_records(data['dataset']['data'])
df.columns = data['dataset']['column_names']
df['Date'] = pd.to_datetime(df['Date'])
# sample date to search for
sample_date = dt.date(2021,12,13)
print(sample_date)
# return index of sample date
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
The output of the program is,
2021-12-13
[]
I can't understand why. I have cast the Date column in the DataFrame to a DateTime and I'm doing a like-for-like comparison.
I have reproduced your Dataframe with minimal samples. By changing the way that you can compare the date will work like this below.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2021-12-13','2021-12-10','2021-12-09','2021-12-08']})
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
sample_date = dt.datetime.strptime('2021-12-13', '%Y-%m-%d')
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
output:
[0]
The search data was in the index number 0 of the DataFrame
Please let me know if this one has any issues
I have one table with two columns: date (01-01-2010 to 31-08-2021), value (mm)
I would like to get only data during 2020. There is a function or similar to get some only data in specific period?
For example to create one pivot.
try this:
df = pd.DataFrame(
{'date':['27-02-2010','31-1-2020','31-1-2021','02-1-2020','13-2-2020',
'07-2-2019','30-4-2018','04-8-2020','06-4-2013','21-6-2020'],
'value':['foo','bar','lorem','ipsum','alpha','omega','big','small','salt','pepper']})
df[pd.to_datetime(df['date']).dt.year == 2020]
Output:
date value
1 31-1-2020 bar
3 02-1-2020 ipsum
4 13-2-2020 alpha
7 04-8-2020 small
9 21-6-2020 pepper
Or for serching with any range you can use this:
df['date'] = pd.to_datetime(df['date'])
df[(df['date']>pd.Timestamp(2020,1,1)) & (df['date']<pd.Timestamp(2020,12,31))]
Here is an example of a idea on how you can return the values from a dataset based on the year using string slicing! If this doesn't pertain to your situation I would need you to edit your post with a specific example of code!
import pandas as pd
df = pd.DataFrame(
{'date':['27-02-2010','31-1-2020','31-1-2021','02-1-2020','13-2-2020','07-2-2019','30-4-2018','04-8-2020','06-4-2013','21-6-2020'],'value':['foo','bar','lorem','ipsum','alpha','omega','big','small','salt','pepper']})
for row in df.iterrows():
if row[1]['date'][-4::1] == '2020':
print (row[1]['value'])
this will only return the values from the dataframe that come from dates with a year of 2020
Pandas has extensive time series features that you may want to use, but for a simpler approach, you could define the date column as the index and then slice the data (assuming the table is already sorted by date):
import pandas as pd
df = pd.DataFrame({'date': ['31-12-2019', '01-01-2020', '01-07-2020',
'31-12-2020', '01-01-2021'],
'value': [1, 2, 3, 4, 5]})
df.index = df.date
df.loc['01-01-2020':'31-12-2020']
date value
date
01-01-2020 01-01-2020 2
01-07-2020 01-07-2020 3
31-12-2020 31-12-2020 4
I need all the values of the DataFrame to not have a time element.
For example with the dataframe:
df = pd.date_range("2019-01-01'", periods = 9, freq='M')
Currently df[2] shows:
2019-03-31 00:00:00
What is the best way to remove the time part and just include the date for all elements in the array?
Just as the title says, I am trying to convert my DataFrame lables to type datetime. In the following attempted solution I pulled the labels from the DataFrame to dates_index and tried converting them to datetime by using the function DatetimeIndex.to_datetime, however, my compiler says that DatetimeIndex has no attribute to_datetime.
dates_index = df.index[0::]
dates = DatetimeIndex.to_datetime(dates_index)
I've also tried using the pandas.to_datetime function.
dates = pandas.to_datetime(dates_index, errors='coerce')
This returns the datetime wrapped in DatetimeIndex instead of just datetimes.
My DatetimeIndex labels contain data for date and time and my goal is to push that data into two seperate columns of the DataFrame.
if your DateTimeIndex is myindex, then
df.reset_index() will create a myindex column, which you can do what you want with, and if you want to make it an index again later, you can revert by `df.set_index('myindex')
You can set the index after converting the datatype of the column.
To convert datatype to datetime, use: to_datetime
And, to set the column as index use: set_index
Hope this helps!
import pandas as pd
df = pd.DataFrame({
'mydatecol': ['06/11/2020', '06/12/2020', '06/13/2020', '06/14/2020'],
'othcol1': [10, 20, 30, 40],
'othcol2': [1, 2, 3, 4]
})
print(df)
print(f'Index type is now {df.index.dtype}')
df['mydatecol'] = pd.to_datetime(df['mydatecol'])
df.set_index('mydatecol', inplace=True)
print(df)
print(f'Index type is now {df.index.dtype}')
Output is
mydatecol othcol1 othcol2
0 06/11/2020 10 1
1 06/12/2020 20 2
2 06/13/2020 30 3
3 06/14/2020 40 4
Index type is now int64
othcol1 othcol2
mydatecol
2020-06-11 10 1
2020-06-12 20 2
2020-06-13 30 3
2020-06-14 40 4
Index type is now datetime64[ns]
I found a quick solution to my problem. You can create a new pandas column based on the index and then use datetime to reformat the date.
df['date'] = df.index # Creates new column called 'date' of type Timestamp
df['date'] = df['date'].dt.strftime('%m/%d/%Y %I:%M%p') # Date formatting
I have a pandas Dataframe with two date columns (A and B) and I would like to create a 3rd column (C) that holds dates created using month and year from column A and the day of column B. Obviously I would need to change the day for the months that day doesn't exist like we try to create 31st Feb 2020, it would need to change it to 29th Feb 2020.
For example
import pandas as pd
df = pd.DataFrame({'A': ['2020-02-21', '2020-03-21', '2020-03-21'],
'B': ['2020-01-31', '2020-02-11', '2020-02-01']})
for c in df.columns:
dfx[c] = pd.to_datetime(dfx[c])
Then I want to create a new column C that is a new datetime that is:
year = df.A.dt.year
month = df.A.dt.month
day = df.B.dt.day
I don't know how to create this column. Can you please help?
Here is one way to do it, using pandas' time series functionality:
import pandas as pd
# your example data
df = pd.DataFrame({'A': ['2020-02-21', '2020-03-21', '2020-03-21'],
'B': ['2020-01-31', '2020-02-11', '2020-02-01']})
for c in df.columns:
# keep using the same dataframe here
df[c] = pd.to_datetime(df[c])
# set back every date from A to the end of the previous month,
# then add the number of days from the date in B
df['C'] = df.A - pd.offsets.MonthEnd() + pd.TimedeltaIndex(df.B.dt.day, unit='D')
display(df)
Result:
A B C
0 2020-02-21 2020-01-31 2020-03-02
1 2020-03-21 2020-02-11 2020-03-11
2 2020-03-21 2020-02-01 2020-03-01
As you can see in row 0, this handles the case of "February 31st" not quite as you suggested, but still in a logical way.