Get rows with min date associated with every ID

Get rows with min date associated with every ID - python

I have a pandas dataframe with multiple IDs and with other columns I have one date columns say : 'date1'. I want to get all the rows with minimum date associated with all the IDs. The others column values should also be retained.
What I have:
ID date1 value
1 1/1/2013 a
1 4/1/2013 a
1 8/3/2014 b
2 11/4/2013 a
2 19/5/2016 b
2 8/4/2017 b
The output I want :
ID date1 value
1 1/1/2013 a
2 11/4/2013 a
Thank you

Convert to datetime:
df = df.assign(date1 = pd.to_datetime(df.date1))
Get the label index of the minimum and subset:
df.loc[df.groupby("ID").date1.idxmin()]
ID date1 value
0 1 2013-01-01 a
3 2 2013-11-04 a

Assuming you have IDs in ID and dates in DATE:
df.groupby('ID')['DATE'].min()
Groups by your ID and then selects the minimum in each group. Returns a series. If you want a data frame for that, then call _.reset_index() on the output.
If you instead want to select only the minimum rows, I would set the output as keys and then new_df.join(old_df.set_index(['ID', 'DATE']) rather than dealing with some index-based shenanigans.

Related

How does (DataFrame - Groupby) match rows?

I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']

Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64

I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa

Pandas select values from each hour for each ID

I have a dataframe in which I have some IDs, and for each ID I have some values and timestamps (around one value each 5 minutes for 5 to 7 days in a row). I would like to select, for each hour and for each ID, the mean, median and variance of the values in that hour and store them in different columns like in the following result:
hour mean var median ID
0 2 4 4 1234
1 4 5 3 1234
...
23 2 2 3 1234
My columns are:
ID int64
Value float64
Date datetime64[ns]
dtype: object
My timestamps are in the following type:
%Y-%m-%d %H:%M:%S.%f
How do I create the final dataframe for each ID? Thank you very much
Edit:
With the following line I created a column correctly with the hours:
df['hour'] = df.Date.dt.hour
Now the problem is that I have a very long column with the hours, the same, and if I use the resample like this:
df = df.set_index('Date').resample('60T').mean().reset_index()
automatically it erases the value columns and overwrites with the mean values. I would like to keep that columns, so that i can create different columns for mean, variance and median, based on the values in the Value columns. How can I do that part?

Try this:
# Extract the hour from the Date column
h = df['Date'].dt.hour.rename('Hour')
# Group by ID and Hour
df.groupby(['ID', h]).agg({
'Value': ['mean', 'var', 'median']
})
You can replace the h series by pd.Grouper. By default pd.Grouper groups the index. You can set the key parameter so that it targets another column:
df.groupby([pd.Grouper('1H', key='Date'), 'ID').agg({
'Value': ['mean', 'var', 'median']
})

Pandas: How to sort dataframe rows by date of one column

So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work

Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2

Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2

why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]

Taking Min and Max of Contigous Rows in a Pandas Dataframe

I have some data that looks something like this:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-06-01
3 A 1 2001-06-02 2001-12-31
What I want to do is collapse consecutive rows where there Id and value are the same. So ideally the output would be:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-12-31
However, if you naively take np.min(Starts) and np.max(Ends) it appears that (A,1) spans the values (A,2).
gb = df.groupby(['ID', 'Value'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)
ID Value Starts Ends
0 A 1 2000-01-01 2001-12-31
1 A 2 2000-06-02 2000-12-31
Is there an efficient way to get Pandas to do what I want?

If you add a column (let's call it "extra") that increments each time the groupby category changes, you can groupby that instead. The challenge is then to make the addition of the new column efficient, and this is the most vectorized way I can think of to make it work.
increment = ((df.Value[:-1] != df.Value[1:]) | (df.ID[:-1] != df.ID[1:])).cumsum()
df["extra"] = pd.concat((pd.Series([0]),increment),ignore_index=True)
The first line takes the cumulative sum of a boolean array showing differing lines, then the second tacks on a zero at the front and adds it to the dataframe.
Then you can do
gb = df.groupby(['extra'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)

Just do df.drop_duplicates(subset = ['ID', 'Value'], inplace=True)
this will drop the rows where you have duplicate ID and Value input.

How to sum values of one column and group them by another column

I have a data frame like this:
df
time type qty
12:00 A 5
13:00 A 1
14:00 B 9
I need to sum the values of qty and group them by type. This is how I do it, but it seems to be not working, because I don't know how to add qty.
keys = df['type'].unique()
summary = pd.DataFrame()
for k in keys:
summary[k] = df[df['type']==k].sum()

GroupBy has a sum method:
In [11]: df.groupby("type").sum()
Out[11]:
qty
type
A 6
B 9
see the groupby docs.

To make sure you are summing up the column you want to:
df.groupby(by=['type'])['qty'].sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get rows with min date associated with every ID - python

Convert to datetime: df = df.assign(date1 = pd.to_datetime(df.date1)) Get the label index of the minimum and subset: df.loc[df.groupby("ID").date1.idxmin()] ID date1 value 0 1 2013-01-01 a 3 2 2013-11-04 a

Related

How does (DataFrame - Groupby) match rows?

Pandas select values from each hour for each ID

Pandas: How to sort dataframe rows by date of one column

Taking Min and Max of Contigous Rows in a Pandas Dataframe

How to sum values of one column and group them by another column

Categories

Resources