My data frame look like this -
In [1]: df.head()
Out[1]:
Datetime Value
2018-04-21 14:08:30.761 offline
2018-04-21 14:08:40.761 offline
2018-04-21 14:08:50.761 offline
2018-04-21 14:09:00.761 offline
2018-04-21 14:09:10.761 offline
I have data for 2 weeks. I want to plot Value against time (hours:minutes) for each day in week. If I am to see data one week at a time that also works.
I took a slice for a single day created a plot using plotly.
In[9]: df['numval'] = df.Value.apply(lambda x: 1 if x == 'online' else -1)
In[10]: df.iplot()
If can have mutiple plots similar to this for sunday to saturday using few lines it would speed up my work
Suggestions -
Something like I can put in arguments as weekday (0-6), time (x axis) and Value (y axis) and it would create 7 plots.
In[11]: df['weekday'] = df.index.weekday
In[12]: df['weekdayname'] = df.index.weekday_name
In[13]: df['time'] = df.index.time
Any library would work as I just want to see the data and will need to test out modifications to data.
Optional - Distribution curve, similar to KDE, over data would be nice
This may not be the exact answer you are looking for. Just giving an approach which could be helpful.
The approach here is to group the data based on date and then generate a plot for each group. For this you need to split the DateTime column into two columns - date and time. Code below will do that:
datetime_series = df['Datetime']
date_series = pd.Series()
time_series = pd.Series()
for datetime_string in datetime_series:
date,time = datetime_string.split(" ")
date_s = pd.Series(date,dtype=str)
time_s = pd.Series(time, dtype=str)
date_series=date_series.append(date_s, ignore_index=True)
time_series = time_series.append(time_s, ignore_index=True)
Code above will give you two separate pandas series. One for date and the other one for time. Now you can add the two columns to your dataframe
df['date'] = date_series
df['time'] = time_series
Now you can use groupby functionality to group the data based on date and plot data for each group. Something like this:
First replace 'offline' with value 0:
df1 = df.replace(to_replace='offline',value=0)
Now group the data based on date and plot:
for title, group in df1.groupby('date'):
group.plot(x='time', y='Value', title=title)
Related
I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()
Im desperatly trying group my data inorder to see which months most people travel but first i want to remove all the data from before a certain year.
As you can see in the picture, i've data all the way back to the year 0003 which i do not want to include.
How can i set an interval from 1938-01-01 to 2020-09-21 with pandas and datetime
My_Code
One way to solve this is:
Verify that the date is on datetime format (its neccesary to convert this)
df.date_start = pd.to_datetime(df.date_start)
Set date_start as new index:
df.index = df.date_start
Apply this
df.groupby([pd.Grouper(freq = "1M"), "country_code"]) \
.agg({"Name of the column with frequencies": np.sum})
Boolean indexing with pandas.DataFrame.Series.between
# sample data
df = pd.DataFrame(pd.date_range('1910-01-01', '2020-09-21', periods=10), columns=['Date'])
# boolean indexing with DataFrame.Series.between
new_df = df[df['Date'].between('1938-01-01', '2020-09-21')]
# groupby month and get the count of each group
months = new_df.groupby(new_df['Date'].dt.month).count()
I have data like this that I want to plot by month and year using matplotlib.
df = pd.DataFrame({'date':['2018-10-01', '2018-10-05', '2018-10-20','2018-10-21','2018-12-06',
'2018-12-16', '2018-12-27', '2019-01-08','2019-01-10','2019-01-11',
'2019-01-12', '2019-01-13', '2019-01-25', '2019-02-01','2019-02-25',
'2019-04-05','2019-05-05','2018-05-07','2019-05-09','2019-05-10'],
'counts':[10,5,6,1,2,
5,7,20,30,8,
9,1,10,12,50,
8,3,10,40,4]})
First, I converted the datetime format, and get the year and month from each date.
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Then, I tried to do groupby like this.
aggmonth = df.groupby(['year', 'month']).sum()
And I want to visualize it in a barchart or something like that. But as you notice above, there are missing months in between the data. I want those missing months to be filled with 0s. I don't know how to do that in a dataframe like this. Previously, I asked this question about filling missing dates in a period of data. where I converted the dates to period range in month-year format.
by_month = pd.to_datetime(df['date']).dt.to_period('M').value_counts().sort_index()
by_month.index = pd.PeriodIndex(by_month.index)
df_month = by_month.rename_axis('month').reset_index(name='counts')
df_month
idx = pd.period_range(df_month['month'].min(), df_month['month'].max(), freq='M')
s = df_month.set_index('month').reindex(idx, fill_value=0)
s
But when I tried to plot s using matplotlib, it returned an error. It turned out you cannot plot a period data using matplotlib.
So basically I got these two ideas in my head, but both are stuck, and I don't know which one I should keep pursuing to get the result I want.
What is the best way to do this? Thanks.
Convert the date column to pandas datetime series, then use groupby on monthly period and aggregate the data using sum, next use DataFrame.resample on the aggregated dataframe to resample using monthly frequency:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
df1 = df1.resample('M').asfreq().fillna(0)
Plotting the data:
df1.plot(kind='bar')
I have the following data set:
df
OrderDate Total_Charged
7/9/2017 5
7/9/2017 5
7/20/2017 10
8/20/2017 6
9/20/2019 1
...
I want to make a bar plot with month_year (X-axis) and Total charged per month/year i.e. sum it over month and year. Firstly, I want to groupby month and year and next make the plot.However, I get error on the first step:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
monthly_orders=df.groupby([(df.index.year),(df.index.month)]).sum()["Total_Charged"]
Got following error:
AttributeError: 'RangeIndex' object has no attribute 'year'
What am I doing wrong (what does the error mean)? How can i fix it?
Not sure why you're grouping by the index there. If you want the group by year and month respectively you could do the following:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
df.groupby([df.OrderDate.dt.year, df.OrderDate.dt.month]).sum().plot.bar()
pandas.DataFrame.resample
This is a versatile option, that easily implements aggregation over various time ranges (e.g. weekly, daily, quarterly, etc)
Code:
A more expansive dataset:
This code block sets up the sample dataset.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# list of dates
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
df = pd.DataFrame({'OrderDate': list_of_dates,
'Total_Charged': [np.random.randint(10) for _ in range(len(list_of_dates))]})
Using resample for Monthly Sum:
requires a datetime index
df.OrderDate = pd.to_datetime(df.OrderDate)
df.set_index('OrderDate', inplace=True)
monthly_sums = df.resample('M').sum()
monthly_sums.plot.bar(figsize=(8, 6))
plt.show()
An example with Quarterly Avg:
this shows the versatility of resample compared to groupby
Quarterly would not be easily implemented with groupby
quarterly_avg = df.resample('Q').mean()
quarterly_avg.plot.bar(figsize=(8, 6))
plt.show()
I am working on a code that takes hourly data for a month and groups it into 24 hour sums. My problem is that I would like the index to read the date/year and I am just getting an index of 1-30.
The code I am using is
df = df.iloc[:,16:27].groupby([lambda x: x.day]).sum()
example of output I am getting
DateTime data
1 1772.031568
2 19884.42243
3 28696.72159
4 24906.20355
5 9059.120325
example of output I would like
DateTime data
1/1/2017 1772.031568
1/2/2017 19884.42243
1/3/2017 28696.72159
1/4/2017 24906.20355
1/5/2017 9059.120325
This is an old question, but I don't think the accepted solution is the best in this particular case. What you want to accomplish is to down sample time series data, and Pandas has built-in functionality for this called resample(). For your example you will do:
df = df.iloc[:,16:27].resample('D').sum()
or if the datetime column is not the index
df = df.iloc[:,16:27].resample('D', on='datetime_column_name').sum()
There are (at least) 2 benefits from doing it this way as opposed to accepted answer:
Resample can up sample and down sample, groupby() can only down sample
No lambdas, list comprehensions or date formatting functions required.
For more information and examples, see documentation here: resample()
If your index is a datetime, you can build a combined groupby clause:
df = df.iloc[:,16:27].groupby([lambda x: "{}/{}/{}".format(x.day, x.month, x.year)]).sum()
or even better:
df = df.iloc[:,16:27].groupby([lambda x: x.strftime("%d%m%Y")]).sum()
if your index was not datetime object.
import pandas as pd
df = pd.DataFrame({'data': [1772.031568, 19884.42243,28696.72159, 24906.20355,9059.120325]},index=[1,2,3,4,5])
print df.head()
rng = pd.date_range('1/1/2017',periods =len(df.index), freq='D')
df.set_index(rng,inplace=True)
print df.head()
will result in
data
1 1772.031568
2 19884.422430
3 28696.721590
4 24906.203550
5 9059.120325
data
2017-01-01 1772.031568
2017-01-02 19884.422430
2017-01-03 28696.721590
2017-01-04 24906.203550
2017-01-05 9059.120325
First you need to create an index on your datetime column to expose functions that break the datetime into smaller pieces efficiently (like the year and month of the datetime).
Next, you need to group by the year, month and day of the index if you want to apply an aggregate method (like sum()) to each day of the year, and retain separate aggregations for each day.
The reset_index() and rename() functions allow us to rename our group_by categories to their original names. This "flattens" out our data, making the category an actual column on the resulting dataframe.
import pandas as pd
date_index = pd.DatetimeIndex(df.created_at)
# 'df.created_at' is the datetime column in your dataframe
counted = df.group_by([date_index.year, date_index.month, date_index.day])\
.agg({'column_to_sum': 'sum'})\
.reset_index()\
.rename(columns={'level_1': 'year',
'level_2': 'month',
'level_3': 'day'})
# Resulting dataframe has columns "column_to_sum", "year", "month", "day" available
You can exploit panda's DatetimeIndex:
working_df=df.iloc[:, 16:27]
result = working_df.groupby(pd.DatetimeIndex(working_df.DateTime)).date).sum()
This if you DateTime column is actually DateTime (and be careful of the timezone).
In this way you will have valid datetime in the index, so that you can easily do other manipulations.