String split using a delimiter on pandas column to create new columns - python

I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?

Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D

Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need

Related

Find duplicate rows in one column and print duplicated rows to a new dataframe table as a group using python pandas

My objective is to get the duplicated groups of column A and print/extract them into a new dataframe, ultimately to print each new dataframe into csv.
my current dataframe:
column A
column B
A
2
A
2
A
3
B
2
B
3
B
4
C
2
C
2
D
2
D
2
D
3
desired output:
column A
column B
A
2
A
2
A
3
column A
column B
B
2
B
3
column A
column B
C
2
C
2
column A
column B
D
2
D
2
D
3
You can loop over the unique values of column A and can diplay the data with specific value of column A
Code:
[df[df['ColA']==i] for i in set(df.ColA.values)]
Output;
[ ColA ColB
0 A 2
1 A 2
2 A 3,
ColA ColB
6 C 2
7 C 2,
ColA ColB
3 B 2
4 B 3
5 B 4,
ColA ColB
8 D 2
9 D 2
10 D 3]
g = df.groupby('column A')
dup_chk = df.loc[df['column A'].eq('A'), 'column B']
out = [g.get_group(x)[lambda x: x['column B'].isin(dup_chk)] for x in g.groups]
out(list of dataframes)
[ column A column B
0 A 2
1 A 2
2 A 3,
column A column B
3 B 2
4 B 3,
column A column B
6 C 2
7 C 2,
column A column B
8 D 2
9 D 2
10 D 3]
Use groupby function to group each repeated elements in a row use for loop to loop through each group
grouped_df = df.groupby('column A')
for group in grouped_df:
print(group)

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

How can I remove a certain type of values in a group in pandas?

I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?
We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.
You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c

Assigning incremental values based on an unique value of a column

I am trying to add an incremental value to a column based on specific values of another column in a dataframe. So that...
col A col B
A 0
B 1
C 2
A 3
A 4
B 5
Would become something like this:
col A col B
A 1
B 2
C 3
A 1
A 1
B 2
C 3
Have tried using groupby function but cant really get my head around setting incremental values on column B.
Any thoughts?
Thanks
I think need factorize:
df['col B'] = pd.factorize(df['col A'])[0] + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2
Another solution:
df['col B'] = pd.Categorical(df['col A']).codes + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2

Categories