How to create a new data frame from a larger dataset - python

I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.
From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.
I tried to do this by:
group = df.groupby('account_num')
which then gives me
pandas.core.groupby.generic.DataFrameGroupBy
Then, when I want to get the group for a specific account number, say 51234:
group.get_group('51234')
I receive an error:
KeyError: 51234
How can I make a separate data frame containing all the transaction for one single account number?
(Sorry if this is a very basic question, Im a newbie)

IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime object, and then filtering your dataframe for the specific account number - there are plenty of ways to do this, a common one is loc, but in my case I use query. Then you can sort based on your date, using sort_values, and lastly you can use groupby on the year part of your date column:
# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Filter and sort
>>> print(df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True))
# Equivalently with loc
print(
df.loc[df['account_num'] == 51234]\
.sort_values(by=['date'],ascending=True))
account_num date
0 51234 2020-01-01
1 51234 2020-02-01
2 51234 2020-03-01
7 51234 2020-08-01
9 51234 2020-08-01
11 51234 2020-08-01
13 51234 2020-08-01
3 51234 2021-04-01
4 51234 2021-05-01
5 51234 2023-06-01
6 51234 2023-07-01
8 51234 2023-07-01
10 51234 2023-07-01
12 51234 2023-07-01
# Filter, sort, and get yearly count
>>> print(
df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True)\
.groupby(df['date'].dt.year).account_num.count())
date
2020 7
2021 2
2023 5
Based on the below sample DF:
{'account_num': {0: 51234,
1: 51234,
2: 51234,
3: 51234,
4: 51234,
5: 51234,
6: 51234,
7: 51234,
8: 51234,
9: 51234,
10: 51234,
11: 51234,
12: 51234,
13: 51234,
14: 512346,
15: 512346,
16: 512346,
17: 512346,
18: 512346,
19: 512346,
20: 512346,
21: 512346,
22: 512346,
23: 13123,
24: 13123,
25: 13123,
26: 13123,
27: 13123,
28: 13123,
29: 13123,
30: 13123,
31: 13123},
'date': {0: '01/01/2020',
1: '02/01/2020',
2: '03/01/2020',
3: '04/01/2021',
4: '05/01/2021',
5: '06/01/2023',
6: '07/01/2023',
7: '08/01/2020',
8: '07/01/2023',
9: '08/01/2020',
10: '07/01/2023',
11: '08/01/2020',
12: '07/01/2023',
13: '08/01/2020',
14: '09/01/2020',
15: '10/01/2020',
16: '11/01/2020',
17: '12/01/2020',
18: '13/01/2020',
19: '14/01/2020',
20: '15/01/2020',
21: '16/01/2020',
22: '17/01/2020',
23: '18/01/2020',
24: '19/01/2020',
25: '20/01/2020',
26: '21/01/2020',
27: '22/01/2020',
28: '23/01/2020',
29: '24/01/2020',
30: '25/01/2020',
31: '26/01/2020'}}

Related

Dataframe sort and remove on date

I have the following data frame
import pandas as pd
from pandas import Timestamp
df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85},
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8},
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37},
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04},
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08},
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62},
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77},
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})
I scrape these from a website several times a day
I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept.
So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31
The entries on the previous day are made already okay manually :-(
I intent to do this once a month so each day has several entries
I already tried
dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')
But that gives me al the records back because the time is different
Any one an idea how to do this?
If the data is sorted by date, use a groupby.last:
df.groupby(df['Datum'].dt.date, as_index=False).last()
else:
df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]
output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
2 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
2 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
2 48.62 52.77 2022-07-12 15:59:31
You can use .max() with datetime columns like this:
dfclean = df.loc[
(df['Datum'].dt.date < df['Datum'].max().date()) |
(df['Datum'] == df['Datum'].max())
]
Output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
6 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
6 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
6 48.62 52.77 2022-07-12 15:59:31
Below a working example, where I keep only the date part of the timestamp to filter the dataframe:
df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)
Does this get you what you need?
df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]

Create a 2nd column based on the maximum date By Month in 1st column

I would like to create a 2nd column based on the maximum date by month in 1st column, but I'm having trouble identifying the maximum date by month in the 1st column (first step below).
I'm trying to do a groupby but im getting a ValueError: Cannot index with multidimensional key.
I believe the steps are:
Within the datadate column, identify the maximum date by month. Eg.
1/29/1993, 2/11/1993, 3/29/1993, etc.
For the datadate row that equals the maximum date by month: in a new column called last_day_in_month, put the maximum
possible date: Eg. 1/31/1993, 2/28/1993, 3/31/1993, etc. For all the other rows where datadate row != maximum date by month, put
False.
Sample Data and Ideal Output:
{'tic': {0: 'SPY', 1: 'SPY', 2: 'SPY', 3: 'SPY', 4: 'SPY', 5: 'SPY', 6: 'SPY', 7: 'SPY', 8: 'SPY', 9: 'SPY'}, 'cusip': {0: '78462F103', 1: '78462F103', 2: '78462F103', 3: '78462F103', 4: '78462F103', 5: '78462F103', 6: '78462F103', 7: '78462F103', 8: '78462F103', 9: '78462F103'}, 'datadate': {0: '1993-01-29', 1: '1993-02-01', 2: '1993-02-02', 3: '1993-02-03', 4: '1993-02-04', 5: '1993-02-05', 6: '1993-02-08', 7: '1993-02-09', 8: '1993-02-10', 9: '1993-02-11'}, 'prccd': {0: 43.938, 1: 44.25, 2: 44.34375, 3: 44.8125, 4: 45.0, 5: 44.96875, 6: 44.96875, 7: 44.65625, 8: 44.71875, 9: 44.9375}, 'next_year': {0: '1994-01-25', 1: '1994-01-26', 2: '1994-01-27', 3: '1994-01-28', 4: '1994-01-31', 5: '1994-02-01', 6: '1994-02-02', 7: '1994-02-03', 8: '1994-02-04', 9: '1994-02-07'}, 'next_year_px': {0: 47.1875, 1: 47.3125, 2: 47.75, 3: 47.875, 4: 48.21875, 5: 47.96875, 6: 48.28125, 7: 48.0625, 8: 46.96875, 9: 47.1875}, 'one_yr_chg': {0: 0.073956484136738, 1: 0.0692090395480226, 2: 0.076814658210007, 3: 0.0683403068340306, 4: 0.0715277777777777, 5: 0.0667129951355107, 6: 0.0736622654621264, 7: 0.0762771168649405, 8: 0.050314465408805, 9: 0.0500695410292072}, 'daily_chg': {0: nan, 1: 0.0071009149255769, 2: 0.0021186440677967, 3: 0.0105708245243127, 4: 0.0041841004184099, 5: -0.0006944444444444, 6: 0.0, 7: -0.0069492703266157, 8: 0.0013995801259623, 9: 0.004891684136967}, 'last_day_in_month': {0: '1993-01-31', 1: 'False', 2: 'False', 3: 'False', 4: 'False', 5: 'False', 6: 'False', 7: 'False', 8: 'False', 9: '1993-02-28'}}
Check group by month and idxmax for maximum days. Check to_period and to_timestamp for last day of each month.
datetime = pd.to_datetime(df.datadate)
max_day_indx = datetime.groupby(datetime.dt.month).idxmax()
df['last_day_in_month'] = False
df.loc[max_day_indx, 'last_day_in_month'] = datetime[max_day_indx].dt.to_period('M').dt.to_timestamp('M').dt.strftime('%Y-%m-%d')
print(df)
tic cusip datadate prccd next_year next_year_px one_yr_chg \
0 SPY 78462F103 1993-01-29 43.93800 1994-01-25 47.18750 0.073956
1 SPY 78462F103 1993-02-01 44.25000 1994-01-26 47.31250 0.069209
2 SPY 78462F103 1993-02-02 44.34375 1994-01-27 47.75000 0.076815
3 SPY 78462F103 1993-02-03 44.81250 1994-01-28 47.87500 0.068340
4 SPY 78462F103 1993-02-04 45.00000 1994-01-31 48.21875 0.071528
5 SPY 78462F103 1993-02-05 44.96875 1994-02-01 47.96875 0.066713
6 SPY 78462F103 1993-02-08 44.96875 1994-02-02 48.28125 0.073662
7 SPY 78462F103 1993-02-09 44.65625 1994-02-03 48.06250 0.076277
8 SPY 78462F103 1993-02-10 44.71875 1994-02-04 46.96875 0.050314
9 SPY 78462F103 1993-02-11 44.93750 1994-02-07 47.18750 0.050070
daily_chg last_day_in_month
0 NaN 1993-01-31
1 0.007101 False
2 0.002119 False
3 0.010571 False
4 0.004184 False
5 -0.000694 False
6 0.000000 False
7 -0.006949 False
8 0.001400 False
9 0.004892 1993-02-28

Nested for loop filtering inner loop based on outer loop and appending dataframe

I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.
I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.
But I haven't been able to get it right. I've tried to model on this SO question and answer.
This is my starting dataframe:
df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
"county_fips" : ['06019','06021','06023'] ,
"grade11" : [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]})
df
This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.
projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
for i in df['county_fips'].unique():
print(i)
grade11_change=df.iloc[0]['grade11_chg']
grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
full_name=[]
for year in projection_years:
#print(year)
df_select=df[df['county_fips']==i]
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row = {}
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
print('Finished processing')
But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)
df_final_goal=pd.DataFrame({'year': {0: '2020_21', 1: '2021_22', 2: '2022_23', 3: '2023_24', 4: '2024_25', 5: '2025_26',
6: '2026_27', 7: '2027_28', 8: '2028_29', 9: '2029_30', 10: '2030_31', 11: '2020_21', 12: '2021_22', 13: '2022_23',
14: '2023_24', 15: '2024_25', 16: '2025_26', 17: '2026_27', 18: '2027_28', 19: '2028_29', 20: '2029_30', 21: '2030_31',
22: '2020_21', 23: '2021_22', 24: '2022_23', 25: '2023_24', 26: '2024_25', 27: '2025_26', 28: '2026_27', 29: '2027_28',
30: '2028_29', 31: '2029_30', 32: '2030_31'},
'county_fips': {0: '06019', 1: '06019', 2: '06019', 3: '06019', 4: '06019', 5: '06019', 6: '06019', 7: '06019', 8: '06019',
9: '06019', 10: '06019', 11: '06021', 12: '06021', 13: '06021', 14: '06021', 15: '06021', 16: '06021', 17: '06021', 18: '06021',
19: '06021', 20: '06021', 21: '06021', 22: '06023', 23: '06023', 24: '06023', 25: '06023', 26: '06023', 27: '06023',
28: '06023', 29: '06023', 30: '06023', 31: '06023', 32: '06023'},
'grade11': {0: 5000, 1: 5050, 2: 5101, 3: 5152, 4: 5203, 5: 5255, 6: 5308, 7: 5361, 8: 5414, 9: 5468, 10: 5523,
11: 2000, 12: 2040, 13: 2081, 14: 2122, 15: 2165, 16: 2208, 17: 2252, 18: 2297, 19: 2343, 20: 2390, 21: 2438,
22: 2000, 23: 2060, 24: 2122, 25: 2185, 26: 2251, 27: 2319, 28: 2388, 29: 2460, 30: 2534, 31: 2610, 32: 2688},
'grade12': {0: 5200, 1: 4500, 2: 4545, 3: 4590, 4: 4636, 5: 4683, 6: 4730, 7: 4777, 8: 4825, 9: 4873, 10: 4922,
11: 2200, 12: 1600, 13: 1632, 14: 1665, 15: 1698, 16: 1732, 17: 1767, 18: 1802, 19: 1838, 20: 1875, 21: 1912,
22: 2200, 23: 1740, 24: 1792, 25: 1846, 26: 1901, 27: 1958, 28: 2017, 29: 2078, 30: 2140, 31: 2204, 32: 2270}})
Thanks for any assistance.
Creating a helper function for calculating grade11 helps make this a bit easier.
import pandas as pd
def expand_grade11(
grade11: int,
grade11_chg: float,
len_projection_years: int
) -> list:
"""
Calculate `grade11` values based on current
`grade11`, `grade11_chg`, and number of
`projection_years`.
"""
list_of_vals = []
while len(list_of_vals) < len_projection_years:
grade11 = int(grade11 * grade11_chg)
list_of_vals.append(grade11)
return list_of_vals
# initial info
df = pd.DataFrame({
"year": ['2020_21', '2020_21','2020_21'],
"county_fips": ['06019','06021','06023'] ,
"grade11": [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])
# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)
# calculate grade11
final = df.groupby([
"county_fips",
"year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")
# concat with original df to get other columns
final = pd.concat([
df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)
# calculate grade12
grade12 = final.groupby([
"county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)
# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)
there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
full_name = []
for i in df['county_fips'].unique():
print(i)
df_select = df[df['county_fips']==i]
grade11_change = df_select.iloc[0]['grade11_chg']
grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
for year in projection_years:
#print(year)
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
print('Finished processing')
fixes:
full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)
(edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)
the final result (print(df_final.to_markdown())) with the above code is:
index
year
county_fips
grade11
grade12
0
0
2020_21
06019
5000
5200
1
0
2021_22
06019
5050
4500
2
0
2022_23
06019
5100
4545
3
0
2023_24
06019
5151
4590
4
0
2024_25
06019
5202
4635
5
0
2025_26
06019
5254
4681
6
0
2026_27
06019
5306
4728
7
0
2027_28
06019
5359
4775
8
0
2028_29
06019
5412
4823
9
0
2029_30
06019
5466
4870
10
0
2030_31
06019
5520
4919
11
1
2020_21
06021
2000
2200
12
0
2021_22
06021
2040
1600
13
0
2022_23
06021
2080
1632
14
0
2023_24
06021
2121
1664
15
0
2024_25
06021
2163
1696
16
0
2025_26
06021
2206
1730
17
0
2026_27
06021
2250
1764
18
0
2027_28
06021
2295
1800
19
0
2028_29
06021
2340
1836
20
0
2029_30
06021
2386
1872
21
0
2030_31
06021
2433
1908
22
2
2020_21
06023
2000
2200
23
0
2021_22
06023
2060
1740
24
0
2022_23
06023
2121
1792
25
0
2023_24
06023
2184
1845
26
0
2024_25
06023
2249
1900
27
0
2025_26
06023
2316
1956
28
0
2026_27
06023
2385
2014
29
0
2027_28
06023
2456
2074
30
0
2028_29
06023
2529
2136
31
0
2029_30
06023
2604
2200
32
0
2030_31
06023
2682
2265
note: edited to address the comments

How can I map all the names with common 3-letter sets in a new column in pandas dataframe in Python?

I have a pandas dataframe df which looks as follows:
Unnamed: 0 Characters Length Characters Split A B C D Names with common 3-letters
0 FROKDUWJU 9 [FRO, KDU, WJU] FRO KDU WJU NaN
1 IDJWPZSUR 9 [IDJ, WPZ, SUR] IDJ WPZ SUR NaN
2 UCFURKIRODCQ 12 [UCF, URK, IRO, DCQ] UCF URK IRO DCQ
3 ORI 3 [ORI] ORI NaN NaN NaN
4 PROIRKIQARTIBPO 15 [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI
5 QAZWREDCQIBR 12 [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR
6 PLPRUFSWURKI 12 [PLP, RUF, SWU, RKI] PLP RUF SWU RKI
7 FROIEUSKIKIR 12 [FRO, IEU, SKI, KIR] FRO IEU SKI KIR
8 ORIUWJZSRFRO 12 [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO
9 URKIFJVUR 9 [URK, IFJ, VUR] URK IFJ VUR NaN
10 RUFOFR 6 [RUF, OFR] RUF OFR NaN NaN
11 IEU 3 [IEU] IEU NaN NaN NaN
12 PIMIEU 6 [PIM, IEU] PIM IEU NaN NaN
In the last column, Names with common 3-letters, I'd like to have a list of all the names from first column which have common 3-letters set in their names. For example, in the first row, I'd like to have a list of all the names which have FRO, KRU and WJU in their names. These 3-letter split of names could also be found in "Characters Split" or A,B, C, and D columns for reference.
To talk in a stepwise manner, I need to scan whether the 3-letter set present in a name in a given row is also present in any name in rest of the rows. And if it is present, I need to add the corresponding name of the other row as a list in "Names with common 3-letters" column. As an example, in the screenshot attached, in column C, the yellow highlighted cells have the names that have common 3-letter set with the name in same row.
What would be the appropriate way to accomplish this? Should I use a function or a loop-statement?
Note: df.to_dict() looks as follows:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Length': {0: 9,
1: 9,
2: 12,
3: 3,
4: 15,
5: 12,
6: 12,
7: 12,
8: 12,
9: 9,
10: 6,
11: 3,
12: 6},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Names with common 3-letters': {0: '',
1: '',
2: '',
3: '',
4: '',
5: '',
6: '',
7: '',
8: '',
9: '',
10: '',
11: '',
12: ''}}
There may be a quicker way to search and create the lists, but this works:
# create a different temporary, column (you can't search the Characters Split column directly as the 3 letter combinations aren't honored
df['patrn'] = df.apply( lambda x: '|'.join(x['Characters Split']), axis=1)
def find_matches(x):
# print(x.name) # index number
new_df = df[~df.index.isin([x.name])] # all rows except current index
return set(new_df.loc[df['patrn'].str.contains(x['patrn'], case=False)]['Unnamed: 0'].tolist())
df['Names with common 3-letters'] = df.apply(lambda x: find_matches(x), axis=1)
df
Output
Unnamed: 0 Characters Length Characters Split A B C D Names with common 3-letters patrn
0 FROKDUWJU 9 [FRO, KDU, WJU] FRO KDU WJU NaN {FROIEUSKIKIR, ORIUWJZSRFRO} FRO|KDU|WJU
1 IDJWPZSUR 9 [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {} IDJ|WPZ|SUR
2 UCFURKIRODCQ 12 [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {URKIFJVUR, QAZWREDCQIBR} UCF|URK|IRO|DCQ
3 ORI 3 [ORI] ORI NaN NaN NaN {ORIUWJZSRFRO} ORI
4 PROIRKIQARTIBPO 15 [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {} PRO|IRK|IQA|RTI|BPO
5 QAZWREDCQIBR 12 [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {UCFURKIRODCQ} QAZ|WRE|DCQ|IBR
6 PLPRUFSWURKI 12 [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {RUFOFR} PLP|RUF|SWU|RKI
7 FROIEUSKIKIR 12 [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {PIMIEU, FROKDUWJU, ORIUWJZSRFRO, IEU} FRO|IEU|SKI|KIR
8 ORIUWJZSRFRO 12 [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {FROKDUWJU, FROIEUSKIKIR, ORI} ORI|UWJ|ZSR|FRO
9 URKIFJVUR 9 [URK, IFJ, VUR] URK IFJ VUR NaN {UCFURKIRODCQ} URK|IFJ|VUR
10 RUFOFR 6 [RUF, OFR] RUF OFR NaN NaN {PLPRUFSWURKI} RUF|OFR
11 IEU 3 [IEU] IEU NaN NaN NaN {PIMIEU, FROIEUSKIKIR} IEU
12 PIMIEU 6 [PIM, IEU] PIM IEU NaN NaN {FROIEUSKIKIR, IEU} PIM|IEU

Create a ranking of data based on the dates and categories of another column

I have the following dataframe:
account_id contract_id type date activated
0 1 AAA Downgrade 2021-01-05
1 1 ADS Original 2020-12-12
2 1 ADGD Upgrade 2021-02-03
3 1 BB Winback 2021-05-08
4 1 CC Upgrade 2021-06-01
5 2 HHA Original 2021-03-05
6 2 HAKD Downgrade 2021-03-06
7 3 HADSA Original 2021-05-01
I want the following output:
account_id contract_id type date activated Renewal Order
0 1 ADS Original 2020-12-12 Original
1 1 AAA Downgrade 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original
The column I want to create is "Renewal Order". Each account can have multiple contracts. The condition is based on each account (account_id), the type (only when it is either "Original" or "Winback", and the order in which the contracts are activated (date_activated). The first contract (or tagged as "Original" under the "Type" column) will be identified as "Original" while the succeeding contracts as "1st", "2nd", and so on. The order resets when the contract is tagged as "Winback" under the "Type" column, i.e. it will now be identified as "Original" and the succeeding contracts as "1st", "2nd", and so on (refer to contract_id BB).
I tried the following code but it does not consider the condition on the "Winback":
def format_order(n):
if n == 0:
return 'Original'
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
if 11 <= (n % 100) <= 13:
suffix = 'th'
return str(n) + suffix
df = df.sort_values(['account_id', 'date_activated']).reset_index(drop=True)
# apply
df['Renewal Order'] = df.groupby('account_id').cumcount().apply(format_order)
Here's the dictionary of the original dataframe:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'AAA',
1: 'ADS',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Downgrade',
1: 'Original',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2021-01-05 00:00:00'),
1: Timestamp('2020-12-12 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')}}
Here's the dictionary for the result:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'ADS',
1: 'AAA',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Original',
1: 'Downgrade',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2020-12-12 00:00:00'),
1: Timestamp('2021-01-05 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')},
'Renewal Order': {0: 'Original',
1: '1st',
2: '2nd',
3: 'Original',
4: '1st',
5: 'Original',
6: '1st',
7: 'Original'}}
Let us just change the cumcount result
s = df.groupby('account_id').cumcount()
s[df.type=='Winback'] = 0
df['Renewal Order'] = s.apply(format_order)
Using #BENY solution:
df = df.sort_values(['account_id', 'date activated']).reset_index(drop=True)
s = df.groupby(['account_id',
(df['type'] == 'Winback').cumsum()
]).cumcount()
df['Renewal Order'] = s.apply(format_order)
Output:
account_id contract_id type date activated Renewal Order
0 1 ADS Downgrade 2020-12-12 Original
1 1 AAA Original 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original

Categories