collapse group into one row pandas dataframe - python

I have a dataframe as below:
id timestamp name
1 2018-01-23 15:49:53 "aaa"
1 2018-01-23 15:54:56 "bbb"
1 2018-01-23 15:49:57 "bbb"
1 2018-01-23 15:49:54 "ccc"
This is one example of group of id from my data. I have several groups of ids.
What I am trying to do is to collapse each group into a row but in a chronological order according to timestamp eg like this
id name
1 aaa->ccc->bbb->bbb
The values in name are in chronological order as they appear with timestamp. Any pointers regarding this ?

I too the liberty to add some data to your df:
print(df)
Output:
id timestamp name
0 1 2018-01-23T15:49:53 aaa
1 1 2018-01-23T15:54:56 bbb
2 1 2018-01-23T15:49:57 bbb
3 1 2018-01-23T15:49:54 ccc
4 2 2018-01-23T15:49:54 ccc
5 2 2018-01-23T15:49:57 aaa
Then you need:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['id', 'timestamp'])
grp = df.groupby('id')['name'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
Output:
id name
0 1 aaa->ccc->bbb->bbb
1 2 ccc->aaa

Related

How to join in pandas on a.date<=b.date and then taking the only the row where a.date is max?

I am trying to join two dataframes by ID and Date. However, the date criteria is that the a.date<=b.date and in case a.date has a many results, then taking the max value (but still <b.date). How would I do that?
Dataframe A (cumulative sales table)
ID| date | cumulative_sales
1 | 2020-01-01 | 10
1 | 2020-01-03 | 15
1 | 2021-01-02 | 20
Dataframe B
ID| date | cumulative_sales (up to this date, how much was purchased for a given ID?)
1 | 2020-05-01 | 15
In SQL, I would do a join by a.date<=b.date, then I would next do a dense_rank() and take the max value within that partition for each ID. Not sure how to approach this with Pandas. Any suggestion?
Looks like you simply want a merge_asof:
dfA['date'] = pd.to_datetime(dfA['date'])
dfB['date'] = pd.to_datetime(dfB['date'])
out = pd.merge_asof(dfB.sort_values(by='date'),
dfA.sort_values(by='date'),
on='date', by='ID')
Here's a way to do what your question asks:
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query('date <= date_b').drop(
columns='date_b').groupby(['ID']).last().reset_index()
Explanation:
sort dfA by ID, date
use join to join with dfB on ID and bring the columns from dfB in with the suffix _b
use query to keep only rows where dfA.date <= dfB.date
use groupby on ID and then last to select the row with the highest remaining value of dfA.date (i.e., the highest dfA.date that is <= dfB.date for each ID)
use reset_index to convert ID from an index level back into a column label
Full test code:
import pandas as pd
dfA = pd.DataFrame({'ID':[1,1,1,2,2,2], 'date':['2020-01-01','2020-01-03','2020-01-02','2020-01-01','2020-01-03','2020-01-02'], 'cumulative_sales':[10,15,20,30,40,50]})
dfB = pd.DataFrame({'ID':[1,2], 'date':['2020-05-01','2020-01-01'], 'cumulative_sales':[15,30]})
print(dfA)
print(dfB)
dfA = dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b').query(
'date <= date_b').drop(columns='date_b').groupby(['ID']).last().reset_index()
print(dfA)
Input:
dfA:
ID date cumulative_sales
0 1 2020-01-01 10
1 1 2020-01-03 15
2 1 2020-01-02 20
3 2 2020-01-01 30
4 2 2020-01-03 40
5 2 2020-01-02 50
dfB:
ID date cumulative_sales
0 1 2020-05-01 15
1 2 2020-01-01 30
Output:
ID date cumulative_sales cumulative_sales_b
0 1 2020-01-03 15 15
1 2 2020-01-01 30 30
Note: I have left cumulative_sales_b in place in case you want it. If it's not needed, it can be dropped by replacing drop(columns='date_b') with drop(columns=['date_b', 'cumulative_sales_b']).
UPDATE:
For fun, if your version of python has the walrus operator := (also known as "conditional assignment" operator), you can do this instead of using query:
dfA = (dfA := dfA.sort_values(['ID', 'date']).join(
dfB.set_index('ID'), on='ID', rsuffix='_b'))[dfA.date <= dfA.date_b].drop(
columns='date_b').groupby(['ID']).last().reset_index()
We can do merge
out = df1.merge(df2, on = 'ID', suffixes = ('','_x')).\
query('date<=date_x').sort_values('date').drop_duplicates('ID',keep='last')[df1.columns]
Out[272]:
ID date cumulative_sales
1 1 2020-01-03 15

How to check if date ranges are overlapping in a pandas dataframe according to a categorical column?

Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
Try:
Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and iter on a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False
Set the index as an intervalindex, and use groupby to get your overlapping IDs:
(df.set_index(pd.IntervalIndex
.from_arrays(df.Date_min,
df.Date_max,
closed='both'))
.groupby('ID')
.apply(lambda df: df.index.is_overlapping)
)
ID
1 True
2 False
3 False
dtype: bool

Returning the rows based on specific value without column name

I know how to return the rows based on specific text by specifying the column name like below.
import pandas as pd
data = {'id':['1', '2', '3','4'],
'City1':['abc','def','abc','khj'],
'City2':['JH','abc','abc','yuu'],
'City2':['JRR','ytu','rr','abc']}
df = pd.DataFrame(data)
df.loc[df['City1']== 'abc']
and output is -
id City1 City2
0 1 abc JRR
2 3 abc rr
but what i need is -my specific value 'abc' can be in any columns and i need to return rows values that has specific text eg 'abc' without giving column name. Is there any way? need output as below
id City1 City2
0 1 abc JRR
1 3 abc rr
2 4 khj abc
You can use any with the (1) parameter to apply it on all columns to get the expected result :
>>> df[(df == 'abc').any(1)]
id City1 City2
0 1 abc JRR
2 3 abc rr
3 4 khj abc

Pandas - Replace row values based on multi-column match [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
This is a simple question but most of the solution I found here were based on one column match (mainly only ID).
Df1
'Name' 'Dept' 'Amount' 'Leave'
ABC 1 10 0
BCD 1 5 0
Df2
'Alias_Name', 'Dept', 'Amount', 'Leave', 'Address', 'Join_Date'
ABC 1 100 5 qwerty date1
PQR 2 0 2 asdfg date2
I want to replaces row values in df1 when both the Name and Dept are matched.
I tried merge(left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='left') but it gives me double number of columns with _x and _y suffix. I just need to replaces the Dept, Amount, Leave in df1 if the Name and Dept are matched with any row in df2.
Desired Output:
Name Dept Amount Leave
ABC 1 100 5
BCD 1 5 0
new_df = df1[['Name', 'Dept']].merge(df2[['Alias_Name', 'Dept', 'Amount', 'Leave']].rename(columns={'Alias_Name': 'Name'}), how='left').fillna(df1[['Amount', 'Leave']])
Result:
Name Dept Amount Leave
0 ABC 1 100.0 5.0
1 BCD 1 5.0 0.0
You can use new_df[['Amount', 'Leave']] = new_df[['Amount', 'Leave']].astype(int) to re-cast the dtype if that's important.
You can create a temp column in both data frames which will be sum of both "Name" and "Dept". That column can be used as primary key to match
Try:
# select rows that should be replace
replace_df = df1[['Name', 'Dept']].merge(df2, left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='inner')
# replace rows in df1
df1.iloc[replace_df.index] = replace_df
Result:
Name Dept Amount Leave
0 ABC 1 100 5
1 BCD 1 5 0

how to get percentage of elements which equal to string value

How can I get the percentage of by group (name) for t_results equals to "ok" ?
name t_result
0 aaa ok
1 aaa err_1
2 bbb err_1
3 bbb ok
4 aaa err_2
5 aaa ok
name, percentage
aaa 0.5
bbb 0.5
You can use mean of boolean mask, compared by Series.eq, convert to 0, 1 by Series.view or Series.astype and aggregate by df['name'] as Series:
df1 = (df['t_result'].eq('ok')
.view('i1') # .astype(int)
.groupby(df['name'])
.mean()
.reset_index(name='percentage'))
print (df1)
name percentage
0 aaa 0.5
1 bbb 0.5
Solution with new column and aggregate by column name name:
df1 = (df.assign(percentage = df['t_result'].eq('ok').view('i1'))
.groupby('name', as_index=False)
.mean())

Categories