Pandas - How to remove duplicates based on another series?

Pandas - How to remove duplicates based on another series? - python

I have a dataframe that contains three series called Date, Element,
and Data_Value--their types are string, string, and numpy.int64
respectively. Date has dates in the form of yyyy-mm-dd; Element has
strings that say either TMIN or TMAX, and it denotes whether the
Data_Value is the minimum or maximum temperature of a particular date;
lastly, the Data_Value series just represents the actual temperature.
The date series has multiple duplicates of the same date. E.g. for the
date 2005-01-01, there are 19 entries for the temperature column, the
values start at 28 and go all the way up to 156. I want to create a
new dataframe with the date and the maximum temperature only--I'll
eventually want one for TMIN values too, but I figure that if I can do
one I can figure out the other. I'll post some psuedocode with
explanation below to show what I've tried so far.
So far I have pulled in the csv and assigned it to a variable, df.
Then I sorted the values by Date, Element and Temperature
(Data_Value). After that, I created a variable called tmax that grabs
the necessary dates (I only need the data from 2005-2014) that have
'TMAX' as its Element value. I cast tmax into a new DataFrame, reset
its index to get rid of the useless index data from the first
dataframe, and dropped the 'Element' column since it was redundant at
this point. Now I'm (ultimately) trying to create a list of all the
Temperatures for TMAX so that I can plot it with pyplot. But I can't
figure out for the life of me how to reduce the dataframe to just the
single date and max value for that date. If I could just get that then
I could easily convert the series to a list and plot it.
def record_high_and_low_temperatures():
#read in csv
df = pd.read_csv('somedata.csv')
#sort values so they're in a nice order
df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True)
# grab all entries for TMAX in correct date range
tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]
# cast to dataframe
tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])
# Remove index column from previous dataframe
tmax.reset_index(drop=True, inplace=True)
# this is where I'm stuck, how do I get the max value per unique date?
max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]
Any and all help is appreciated, let me know if I need to clarify anything.
TL;DR:
Ok...
input dataframe looks like
date | data_value
2005-01-01 28
2005-01-01 33
2005-01-01 33
2005-01-01 44
2005-01-01 56
2005-01-02 0
2005-01-02 12
2005-01-02 30
2005-01-02 28
2005-01-02 22
Expected df should look like:
date | data_value
2005-01-01 79
2005-01-02 90
2005-01-03 88
2005-01-04 44
2005-01-05 63
I just want a dataframe that has each unique date coupled with the highest temperature on that day.

If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:
df.groupby('date').max()
This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:
df.groupby('date').agg(['max', 'min'])
which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.

Try this:
df.groupby("Date")['data_value'].max()

Related

How to best calculate average issue age for a range of historic dates (pandas)

Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.

I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.

Find indices of rows of dates falling within datetime range

Using datetime and a dataframe, I want to find which rows fall within the range of dates I have specified.
Sample dataframe:
times = pd.date_range(start="2018-01-01",end="2020-02-02")
values = np.random.rand(512)
# Make df
df = pd.DataFrame({'Time' : times,
'Value': values})
How do I easily select all values that fall within a certain month or range?
I feel like a good step is using:
pd.to_datetime(df['Time']).dt.to_period('M')
>>> df
0 2018-01
1 2018-02
2 2018-03
But I wouldn't know how to contine. I would like to be able to select a year/month like 2019-01 or a range 2019-01:2020-01 to find the indices in the dataframe that the input.

I apparently did the right thing already, but had a wrong syntax.
It was quick, but just to be sure here is the answer:
np.where(pd.to_datetime(df['time']).dt.to_period('M') == '2018-01')
Then a range can be specified as well.

With query you can also select date ranges pretty quickly:
df.query('"2019-01-01" <= Time < "2019-02-01"')

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance

I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])

Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

How to get average of a column inside a time era?

I need to get the average of a column (which I will set in the input of my function) during a precise era :
In my case the date is the index, so I can get the week with index.week.
Then I would like to compute some basic statistics each 2 weeks for instances
So I will need to "slice" the dataframe every 2 weeks and then compute. It can destroy the part of the dataframe already computed, but what's still in the dataframe mustn't be erase.
My first guess was to parse the data with a row iterator then compare it :
# get the week num. of the first row
start_week = temp.data.index.week[0]
# temp.data is my data frame
for index, row in temp.data.iterrows():
while index.week < start_week + 2:
print index.week
but it's really slow so shouldn't be the proper way

Welcome to Stackoverflow. Please note that your question is not very specific and is difficult to supply you with exactly what you want. Optimally, you would supply code to recreate your dataset and also post the expected outcome. I'll post regarding two parts: (i) Working with dataframes sliced using time-specific functions and (ii) Applying statistical functions using rolling window operations.
Working with Dataframes and time indices
The question is not how to get the mean of x, because you know how to do that (x.mean()). The question is, how to get x: How do you select elements of a dataframe which satisfy certain conditions on their timestamp? I will use a series generated by the documentation which I found after googling for one minute:
In[13]: ts
Out[13]:
2011-01-31 0.356701
2011-02-28 -0.814078
2011-03-31 1.382372
2011-04-29 0.604897
2011-05-31 1.415689
2011-06-30 -0.237188
2011-07-29 -0.197657
2011-08-31 -0.935760
2011-09-30 2.060165
2011-10-31 0.618824
2011-11-30 1.670747
2011-12-30 -1.690927
Then, you can select some time series based on index weeks using
ts[(ts.index.week > 3) & (ts.index.week < 10)]
And specifically, if you want to get the mean of this series, you can do
ts[(ts.index.week > 3) & (ts.index.week < 10)].mean()
If you work with a dataframe, you might want to select the column first:
df[(df.index.week > 3) & (df.index.week < 10)]['someColumn'].mean()
Rolling window operations
Now, if you want to operate rolling statistics onto a time series indexed pandas object, have a look at this part of the manual.
Given that I have a monthly time series, say I want the mean for 3 months, I'd do:
rolling_mean(ts, window=3)
Out[25]:
2011-01-31 NaN
2011-02-28 NaN
2011-03-31 0.308331
2011-04-29 0.391064
2011-05-31 1.134319
2011-06-30 0.594466
2011-07-29 0.326948
2011-08-31 -0.456868
2011-09-30 0.308916
2011-10-31 0.581076
2011-11-30 1.449912
2011-12-30 0.199548

Indexing by row counts in a pandas dataframe

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.

One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.

Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.