Using GroupBy with DateTime in Pandas (Python) - python

I have a data that looks like this that I get from an API (of course in JSON form):
0,1500843600,8872
1,1500807600,18890
2,1500811200,2902
.
.
.
where the second column is the date/time in ticks, and the third column is some value. I basically have the data for every hour of the day, for every day for a couple of months. Now, what I want to achieve is that I want to get the minimum value for the third column for every week. I have the code segment below, which correctly returns the minimum value for me, but apart from returning the minimum value, I also want to return the specific Timestamp as which date/time the lowest that week occurred. How can I modify my code below, so I can get also the Timestamp along with the minimum value.
df = pandas.DataFrame(columns=['Timestamp', 'Value'])
# dic holds the data that I get from my API.
for i in range(len(dic)):
df.loc[i] = [dic[i][1], dic[i][2]]
df['Timestamp'] = pandas.to_datetime(df['Timestamp'], unit='s')
df.sort_values(by=['Timestamp'])
df.set_index(df['Timestamp'], inplace=True)
df.groupby([pandas.Grouper(key='Timestamp', freq='W-MON')])['Value'].min()

I think you need DataFrameGroupBy.idxmin for index of min value of column Timestamp and then select by loc:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
df = df.loc[df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin()]
print (df)
Timestamp Value
2 2017-07-23 12:00:00 2902
Detail:
print (df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin())
Timestamp
2017-07-24 2
Freq: W-MON, Name: Value, dtype: int64

Related

pandas: Extracting only the month-day values for rows from a column

I want to plot a line graph for my data however the x-axis becomes extremely tight together due to the long date format (Y-M-D), and I've checked the data type for 'date' and it returned:
In[200]: df['date'].dtypes
Out[200]: dtype('O')
So my 'date' values are:
date
----
2020-04-12
2020-05-13
2020-02-02
but I want to extract only the month and day to make the column look like
date
----
04-12
05-13
02-02
How should I do this? I apologise for dupes as I couldn't find anything similar due to my datatype being 'O'. Appreciate all the help!
Use Series.str.split and select second ist by indexing str[1]:
df['date'] = df['date'].str.split('-', n=1).str[1]
#if dates objects
#df['date'] = df['date'].astype(str).str.split('-', n=1).str[1]
print (df)
date
0 04-12
1 05-13
2 02-02
Or convert to datetimes by to_datetime with Series.dt.strftime:
df['date'] = pd.to_datetime(df['date']).dt.strftime('%m-%d')

Filter is not working on Pandas Date Index

I have created a DF with Index as Date column, when i try to filter DF by Index as sales_df['2020'] it's throwing below error. Ideally i should be able to filter by Year or Month or Day when Index is Date. Could you tell me what's going on here?
sales_df['2020'] is used to select the 2020 column from the data frame, which obviously wouldn't work for your data frame.
To filter rows using a date time index, use
df.loc['2020'] # which returns nothing for your data set
Or something like
df.loc['2010-02':'2010-09']
Date Store
Date
2010-02-12 2010-02-12 1
2010-09-10 2010-09-10 1

How can i set a date interval for my data in python?

Im desperatly trying group my data inorder to see which months most people travel but first i want to remove all the data from before a certain year.
As you can see in the picture, i've data all the way back to the year 0003 which i do not want to include.
How can i set an interval from 1938-01-01 to 2020-09-21 with pandas and datetime
My_Code
One way to solve this is:
Verify that the date is on datetime format (its neccesary to convert this)
df.date_start = pd.to_datetime(df.date_start)
Set date_start as new index:
df.index = df.date_start
Apply this
df.groupby([pd.Grouper(freq = "1M"), "country_code"]) \
.agg({"Name of the column with frequencies": np.sum})
Boolean indexing with pandas.DataFrame.Series.between
# sample data
df = pd.DataFrame(pd.date_range('1910-01-01', '2020-09-21', periods=10), columns=['Date'])
# boolean indexing with DataFrame.Series.between
new_df = df[df['Date'].between('1938-01-01', '2020-09-21')]
# groupby month and get the count of each group
months = new_df.groupby(new_df['Date'].dt.month).count()

How do I get the sum of columns from a csv within specified rows using dates inputting as variables in python?

Date,hrs,Count,Status
2018-01-02,4,15,SFZ
2018-01-03,5,16,ACZ
2018-01-04,3,14,SFZ
2018-01-05,5,15,SFZ
2018-01-06,5,18,ACZ
This is the fraction of data to what I've been working on. The actual data is in the same format with around 1000 entries of each date in it. I am taking the start_date and end_date as inputs from user:
start_date=dt.date(2018, 1, 2)
end_date=dt.date(2018, 1, 23)
Now, I have to display a total for hrs and the count within the selected date range, on the output. I am able to do so by entering the dates directly into between clause, using this snippet:
df = df.loc[df['Date'].between('2018-01-02','2018-01-06'), ['hrs','Count']].sum()
print (df)
Output:
hrs 22
Count 78
dtype: int64
I am using pandas and datetime library. But, I want to pass them using the variables start_date and end_date as they might change everytime. I tried replacing it, it dosen't gives me an error, but the total shows 0.
df = df.loc[df['Date'].between('start_date','end_date'), ['hrs','Count']].sum()
print (df)
Output:
Duration_hrs 0
Reject_Count 0
dtype: int64
You only need to convert all the values to a compatible type, pd.Timestamp:
df = df.loc[pd.to_datetime(df['Date']).between(pd.Timestamp(start_date),
pd.Timestamp(end_date)),
['hrs','Count']].sum()

Is it possible to resample and sum values in a Pandas dataframe by specifying a date range?

I have a dataframe like the following (dates with an associated binary value (whether or not a flood occurs), spanning a total of 20 years):
...
2019-12-27 0.0
2019-12-28 1.0
2019-12-29 1.0
2019-12-30 0.0
2019-12-31 0.0
...
I need to produce a count (i.e. sum, considering the values are binary) over a series of custom date ranges, e.g. '24-05-2019 to 09-09-2019', or '15-10-2019 to 29-12-2019', etc.
My initial thoughts were to use the resample method, however as I understand this will not allow me to select a custom date range, rather it will allow me to resample over a set time period, e.g. month or year.
Any ideas out there?
Thanks in advance
If the dates are a DatetimeIndex and the index of the dataframe or Series, you can directly select the relevant rows:
df.loc['24-05-2019':'09-09-2019', 'flood'].sum()
Since it's a Pandas dataframe you should be able to do something like:
start_date = df[df.date == '24-05-2019']].index.values
end_date = df[df.date == '09-09-2019'].index.values
subset = df[start_date:end_date]
sum(subset.flood) # Or other maths
where 'date' and 'flood' are your column headers, and 'df' is your dataframe. This assumes your dates are strings, and that each date only appears once. If not, you'll have to pick which date you want from the list of index values in 'start_date' and 'end_date'.

Categories