I have a column of data which looks like the following:
I am trying to set a range of the entire month:
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
But my column of data (above) is missing a few hours, and I am unsure where (since my data is 2 million rows large.
I tried to use the reindex command, but it instead seemed to have filled everyhthing with zeroes.
The code that I was using is as follows:
df = pd.DataFrame(df_csv)
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
df = df.reindex(rng,fill_value=0.0)
How do I properly fill in the missing date/times without filling everything with a 0?
I think you need set_index from column date first, then is possible use reindex:
#cast column date if dtype is not datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date').reindex(rng,fill_value=0.0)
You get all NaN values, because reindexing int index by datetime values (After using fill_value=0.0 all NaN are replaced to 0.0).
Also if column date is sorted, you can use more general solution with selecting first and last value of column date:
start_date = df.date.iat[0]
end_date = df.date.iat[-1]
rng = pd.date_range(start_date, end_date, freq='S')
Related
Im desperatly trying group my data inorder to see which months most people travel but first i want to remove all the data from before a certain year.
As you can see in the picture, i've data all the way back to the year 0003 which i do not want to include.
How can i set an interval from 1938-01-01 to 2020-09-21 with pandas and datetime
My_Code
One way to solve this is:
Verify that the date is on datetime format (its neccesary to convert this)
df.date_start = pd.to_datetime(df.date_start)
Set date_start as new index:
df.index = df.date_start
Apply this
df.groupby([pd.Grouper(freq = "1M"), "country_code"]) \
.agg({"Name of the column with frequencies": np.sum})
Boolean indexing with pandas.DataFrame.Series.between
# sample data
df = pd.DataFrame(pd.date_range('1910-01-01', '2020-09-21', periods=10), columns=['Date'])
# boolean indexing with DataFrame.Series.between
new_df = df[df['Date'].between('1938-01-01', '2020-09-21')]
# groupby month and get the count of each group
months = new_df.groupby(new_df['Date'].dt.month).count()
I am trying to set a new column(Day of year & Hour)
My date time consist of date and hour, i tried to split it up by using
data['dayofyear'] = data['Date'].dt.dayofyear
and
df['Various', 'Day'] = df.index.dayofyear
df['Various', 'Hour'] = df.index.hour
but it is always returning error, im not sure how i can split this up and get it to a new column.
I think problem is there is no DatetimeIndex, so use to_datetime first and then assign to new columns names:
df.index = pd.to_datetime(df.index)
df['Day'] = df.index.dayofyear
df['Hour'] = df.index.hour
Or use DataFrame.assign:
df.index = pd.to_datetime(df.index)
df = df.assign(Day = df.index.dayofyear, Hour = df.index.hour)
I have a pandas dataframe 'df' with a column 'DateTimes' of type datetime.time.
The entries of that column are hours of a single day:
00:00:00
.
.
.
23:59:00
Seconds are skipped, it counts by minutes.
How can I choose rows by hour, for example the rows between 00:00:00 and 00:01:00?
If I try this:
df.between_time('00:00:00', '00:00:10')
I get an error that index must be a DateTimeIndex.
I set the index as such with:
df=df.set_index(keys='DateTime')
but I get the same error.
I can't seem to get 'loc' to work either. Any suggestions?
Here a working example of what you are trying to do:
times = pd.date_range('3/6/2012 00:00', periods=100, freq='S', tz='UTC')
df = pd.DataFrame(np.random.randint(10, size=(100,1)), index=times)
df.between_time('00:00:00', '00:00:30')
Note the index has to be of type DatetimeIndex.
I understand you have a column with your dates/times. The problem probably is that your column is not of this type, so you have to convert it first, before setting it as index:
# Method A
df.set_index(pd.to_datetime(df['column_name'], drop=True)
# Method B
df.index = pd.to_datetime(df['column_name'])
df = df.drop('col', axis=1)
(The drop is only necessary if you want to remove the original column after setting it as index)
Check out these links:
convert column to date type: Convert DataFrame column type from string to datetime
filter dataframe on dates: Filtering Pandas DataFrames on dates
Hope this helps
(I think) I have a dataset with the columns representing datetime intervals
Columns were transformed in datetime with:
for col in df.columns:
df.rename({col: pd.to_datetime(col, infer_datetime_format=True)}, inplace=True)
Then, I need to resample the columns (year and month '2001-01') into quarters using mean
I tried
df = df.resample('1q', how='mean', axis=1)
The DataFrame also has a multindex set ['RegionName', 'County']
But I get the error:
Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Is the problem in the to_datetime function or in the wrong sampling?
(I think) you are renaming each column head rather than making the entire columns object a DatetimeIndex
Try this instead:
df.columns = pd.to_datetime(df.columns)
Then run your resample
note:
I'd do it with period after converting to DatetimeIndex. That way, you get the period in your column header rather than an end date of the quarter.
df.groupby(df.columns.to_period('Q'), axis=1).mean()
demo
df = pd.DataFrame(np.arange(12).reshape(2, -1),
columns=['2011-01-31', '2011-02-28', '2011-03-31',
'2011-04-30', '2011-05-31', '2011-06-30'])
df.columns = pd.to_datetime(df.columns)
print(df.groupby(df.columns.to_period('Q'), axis=1).mean())
2011Q1 2011Q2
0 1 4
1 7 10
I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!
Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.