Pandas dataframe manipulation to summarize days until certain status appears

Pandas dataframe manipulation to summarize days until certain status appears - python

What is the most efficient way to transforme this dataframe:
date transaction_ids id status
2020-01-01 00 1 approved
2020-02-01 00 1 approved
2020-03-01 25 1 disapproved
2020-03-08 33 1 trying_to_approve
2020-03-21 25 1 disapproved
2020-03-22 00 1 approved
2020-04-01 00 1 approved
2020-01-01 00 2 approved
2020-02-01 25 2 disapproved
2020-03-01 25 2 disapproved
2020-04-01 25 2 disapproved
To this onde:
id status days_until_approve transaction_ids_not_approved total_transaction_ids
1 approved 21 [25, 33, 25] 3
Where days_until_approve is the time, in days, between (2020-03-01), which is the first disapproved for id 1 and (2020-03-22), the first approved (00) after the disapproved.

Yes use groupby and define your own function for each of the parameters:
#make sure date is a datetime object
df['date'] = pd.to_datetime(df['date'])
def fun(df):
r = pd.DataFrame(columns = ['status', 'days_until_approve',
'transaction_ids_not_approved',
'total_transaction_ids'])
id = df.id.max()
r.loc[id, :] = None
r.loc[id,'status'] = df.query('date == date.max()').status.iloc[0]
first_disapproved = df[df.status != 'approved'].index[0]
next_approve = df[(df.status == 'approved') &
(df.index > first_disapproved) ]
if(next_approve.empty is False):
next_approve = next_approve.index[0]
r.loc[id,'days_until_approve'] = (df.iloc[next_approve].date -
df.iloc[first_disapproved].date).days
else:
r.loc[id,'days_until_approve'] = None
r.loc[id,'transaction_ids_not_approved'] = df.query(
'status != "approved"'
).transaction_ids.to_list()
r.loc[id,'total_transaction_ids'] = df.transaction_ids.nunique()
return r
t.head()
# status days_until_approve transaction_ids_not_approved total_transaction_ids
# id
# 1 approved 21 [25, 33, 25] 3
# 2 disapproved None [25, 25, 25] 2

df['date'] = pd.to_datetime(df['date'])
# Create New Dataframe to populate
column_names = ['id','status','days_until_approve','transaction_ids_not_approved','total_transaction_ids']
new_df = pd.DataFrame(columns = column_names)
for id, group in df.groupby('id'):
# Set Values for New Dataframe
new_df.at[id, 'total_transaction_ids'] = group['transaction_ids'].nunique()
new_df.at[id, 'id'] = id
new_df.at[id, 'status'] = 'approved' if (group['status'].isin(['approved']) ).any() else 'not approved'
# Find Transaction Ids that were never approved
t_ids = group['transaction_ids'].unique()
not_approved = [ str(i) for i in t_ids if ((group['transaction_ids']==i) & (~group['status'].isin(['approved'])) ).any()]
new_df.at[id, 'transaction_ids_not_approved'] = ','.join(not_approved)
# Find number of days it took to approve for each ID group
first_rows_dis = [i for i, row in group.iterrows() if row['status'] == 'disapproved' and ( (group.index> i) & (group['status'].isin(['approved'])) ).any() ]
if first_rows_dis:
first_row = min(first_rows_dis)
first_disapproved_date = group.loc[first_row, 'date']
first_approved_date = group.loc[ (group.index> first_row) & (group['status'].isin(['approved'])),"date"].min()
diff_days = (first_approved_date - first_disapproved_date).days
new_df.at[id, 'days_until_approve'] = diff_days
else:
new_df.at[id, 'days_until_approve'] = None
Output:
id status days_until_approve transaction_ids_not_approved total_transaction_ids
1 approved 21 25,33 3
2 approved None 25 2

Related

Adding columns to a data frame with data coming from multiple dataframes

I want this matrix as outcome:
int = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
"sales_year_data":['15','205','24','920','1310'],
"country_image_data":['2','5','-6','7','-1'],
}
df_inv = pd.DataFrame(int)
The data of colum "sales_year_data" in df_inv come df1=
sales_year_data = {
"country":['Spain','France','Germany','Belgium','Italy'],
"2014":['45','202','24','216','219'],
"2015":['15','55','214','2016','209'],
"2016":['615','2333','205','207','920'],
"2017":['1215','255','234','2116','101'],
"2018":['415','1320','214','2516','2019'],
"2019":['215','220','5614','416','1310'],
"2020":['205','202','44','296','2011'],
}
df1 = pd.DataFrame(sales_year_data)
As you can see in the column "sales_year_data" of df_inv, the number 15 is the intersection in df1 between year 2015 and Spain, the number 205 is in the intersection between Spain and 2020, 24 is in the intersection between Germany and 2014 and so on.
Data of colum "country_image_data" in df_inv comes from df2
country_change_data = {
"country":['Spain','Spain','Germany','Italy','Italy'],
"2014":['4','2','-6','6','9'],
"2015":['2','5','-5','2','3'],
"2016":['5','3','5','7','9'],
"2017":['8','7','5','6','1'],
"2018":['5','1','4','6','2'],
"2019":['1','2','4','6','-1'],
"2020":['5','2','4','6','2'],
}
df2 = pd.DataFrame(country_change_data)
As you can see in the column "country_change_data" of df_inv, the number 2 is the intersection in df2 between year 2015 and Spain, the number 5 is in the intersection between Spain and 2020, -6 is in the intersection between Germany and 2014 and so on.
If my original dataframe is:
inv = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
}
df0 = pd.DataFrame(inv)
How could I automate the search across various df1 and df2 in the intersections of interest for building df_inv departing prom df0?

This does it.
sales_counters = {}
country_counters = {}
new_df_data = []
for row in df0.iloc:
c = row['country']
y = row['yeardum']
sales_idx = sales_counters[c] = sales_counters.get(c, -1) + 1
country_idx = country_counters[c] = country_counters.get(c, -1) + 1
d1 = df1[df1['country'] == c]
d2 = df2[df2['country'] == c]
sales_year = d1.iloc[min(sales_idx, d1.shape[0]-1)][y]
country_image = d2.iloc[min(country_idx, d2.shape[0]-1)][y]
new_df_data.append([sales_year, country_image])
df0 = pd.concat([df0, pd.DataFrame(new_df_data)], axis=1).rename({0: 'sales_year_data', 1: 'country_image_data'}, axis=1)
Test:
>>> df0
vendor country yeardum sales_year_data country_image_data
0 A Spain 2015 15 2
1 B Spain 2020 205 2
2 C Germany 2014 24 -6
3 D Italy 2016 920 7
4 E Italy 2019 1310 -1

creating a dataframe and based on 2 dataframe sets that have different lengths

I have 2 dataframe sets , I want to create a third one. I am trying to to write a code that to do the following :
if A_pd["from"] and A_pd["To"] is within the range of B_pd["from"]and B_pd["To"] then add to the C_pd dateframe A_pd["from"] and A_pd["To"] and B_pd["Value"].
if the A_pd["from"] is within the range of B_pd["from"]and B_pd["To"] and A_pd["To"] within the range of B_pd["from"]and B_pd["To"] of teh next row , then i want to split the range A_pd["from"] and A_pd["To"] to 2 ranges (A_pd["from"] and B_pd["To"]) and ( B_pd["To"] and A_pd["To"] ) and the corresponded B_pd["Value"].
I created the following code:
import pandas as pd
A_pd = {'from':[0,20,80,180,250],
'To':[20, 50,120,210,300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0,20,100,200],
'To':[20, 100,200,300],
'Value':[20, 17,15,12]}
B_pd=pd.DataFrame(B_pd)
for i in range(len(A_pd)):
numberOfIntrupt=0
for j in range(len(B_pd)):
if A_pd["from"].values[i] >= B_pd["from"].values[j] and A_pd["from"].values[i] > B_pd["To"].values[j]:
numberOfIntrupt+=1
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols, index=range(len(A_pd)+numberOfIntrupt))
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a=A_pd ["from"].values[i]
b=A_pd["To"].values[i]
c_eval=B_pd["Value"].values[j]
range_s=B_pd["from"].values[j]
range_f=B_pd["To"].values[j]
if a >= range_s and a <= range_f and b >= range_s and b <= range_f :
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=b
C_dp['C_value'].loc[i]=c_eval
elif a >= range_s and b > range_f:
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=range_f
C_dp['C_value'].loc[i]=c_eval
C_dp['C_from'].loc[i+1]=range_f
C_dp['C_To'].loc[i+1]=b
C_dp['C_value'].loc[i+1]=B_pd["Value"].values[j+1]
print(C_dp)
The current result is C_dp:
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 180 200 15
4 250 300 12
5 200 300 12
6 NaN NaN NaN
7 NaN NaN NaN
the expected should be :
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12
Thank you a lot for the support

I'm sure there is a better way to do this without loops, but this will help your logic flow.
import pandas as pd
A_pd = {'from':[0, 20, 80, 180, 250],
'To':[20, 50, 120, 210, 300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0, 20, 100, 200],
'To':[20, 100,200, 300],
'Value':[20, 17, 15, 12]}
B_pd=pd.DataFrame(B_pd)
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols)
spillover = False
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a_from = A_pd["from"].values[i]
a_to = A_pd["To"].values[i]
b_from = B_pd["from"].values[j]
b_to = B_pd["To"].values[j]
b_value = B_pd['Value'].values[j]
if (a_from >= b_to):
# a_from outside b range
continue # next b
elif (a_from >= b_from):
# a_from within b range
if a_to <= b_to:
C_dp = C_dp.append({"C_from": a_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
break # next a
else:
C_dp = C_dp.append({"C_from": a_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
if j < len(B_pd):
spillover = True
continue
if spillover:
if a_to <= b_to:
C_dp = C_dp.append({"C_from": b_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
spillover = False
break
else:
C_dp = C_dp.append({"C_from": b_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
spillover = True
continue
print(C_dp)
Output
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12

Flatten a column in pandas Dataframe

I have a json like the following:
js = """[{"id": 13, "kits": [{"kit": "KIT1216A", "quantity_parts": 80, "quantity_kit": 1},
{"kit": "KIT1216B", "quantity_parts":680, "quantity_kit": 11}],
"transaction_date": "2020-11-27T05:02:03.822000Z", "dispatch_date": "2020-11-27T05:02:05.919000Z", "transaction_no"
: 2005, "transporter_name": "TCI", "vehicle_details": "hr55ab3337", "invoice_number": "355733019", "remarks": "0", "sending_location": 11, "owner": 4}]"""
Where kits is a list which contains multiple dictionaries.
How do I flatten the dataframe which I created from it such that the data of kits is included in the row itself ?
I simply tried:
data = json.loads(js)
df = pd.DataFrame(data)
output:
id kits transaction_date dispatch_date transaction_no transporter_name vehicle_details invoice_number remarks sending_location owner
0 13 [{'kit': 'KIT1216A', 'quantity_parts': 80, 'qu... 2020-11-27T05:02:03.822000Z 2020-11-27T05:02:05.919000Z 2005 TCI hr55ab3337 355733019 0 11 4
Desired Output:

Use json_normalize:
data = json.loads(js)
cols = ['id','transaction_date','dispatch_date','transaction_no','transporter_name',
'vehicle_details','invoice_number','remarks','sending_location','owner']
df = pd.json_normalize(data, 'kits', cols)
print (df)
kit quantity_parts quantity_kit id transaction_date \
0 KIT1216A 80 1 13 2020-11-27T05:02:03.822000Z
1 KIT1216B 680 11 13 2020-11-27T05:02:03.822000Z
dispatch_date transaction_no transporter_name \
0 2020-11-27T05:02:05.919000Z 2005 TCI
1 2020-11-27T05:02:05.919000Z 2005 TCI
vehicle_details invoice_number remarks sending_location owner
0 hr55ab3337 355733019 0 11 4
1 hr55ab3337 355733019 0 11 4

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!

The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

Getting a value out of pandas dataframe based on a set of conditions

I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]

you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55

# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe manipulation to summarize days until certain status appears - python

Related

Adding columns to a data frame with data coming from multiple dataframes

creating a dataframe and based on 2 dataframe sets that have different lengths

Flatten a column in pandas Dataframe

Pandas: index-derived column with specific increments based on other columns

Getting a value out of pandas dataframe based on a set of conditions

Categories

Resources