Append column value if string is contained in another string - python

I want to add a new column a3 to my dataframe df: If the strings of "b" contain strings of "b2" from dataframe df2, the new column a3 should append values from a2 of df2.
first dataframe df:
d = {'a': [100, 300], 'b': ["abc", "dfg"]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 100 abc
2 300 dfg
second dataframe df2:
d2 = {'a2': ["L1", "L2", "L3"], 'b2': ["bc", "op", "fg"]}
df2 = pd.DataFrame(data=d2, index=[1, 2, 3])
print(df2)
a2 b2
1 L1 bc
2 L2 op
3 L3 fg
The output should look like this:
print(df)
a b a3
1 100 abc L1
2 300 dfg L3
I tried a nested for loop, which did not work.
for i in df.b:
for ii in df2.b2:
for iii in df2.a3:
if ii in i:
df["a3"]=iii

You need to test all combinations. You could still take advantage of pandas vector str.contains:
common = (pd.DataFrame({x: df['b'].str.contains(x) for x in df2['b2']})
.replace({False: pd.NA})
.stack()
.reset_index(level=1, name='b2')['level_1'].rename('b2')
)
# 1 bc
# 2 fg
# Name: b2, dtype: object
df.join(common).merge(df2, on='b2')
output:
a b b2 a2
0 100 abc bc L1
1 300 dfg fg L3

You can half fix your logic as follows:
for i in df.b:
for ii, iii in zip(df2.b2, df2.a2):
if ii in i:
df["a3"]=iii
However the final line df["a3"] = iii is assigning iii to every row so you just get the last value for iii in the loop for every row:
a b a3
1 100 abc L3
2 300 dfg L3
You will get many 'correct' options, but one that is closest to your attempt is perhaps:
new_column = [None] * len(df) # create list of Nones same 'height' as df
for i, b in enumerate(df.b):
for a2, b2 in zip(df2.a2, df2.b2):
if b2 in b:
new_column[i] = a2
continue # this moves us on to next 'row' in df
df["a3"] = new_column
A difference from your attempt is that this builds the 'new_column' separately and then adds to your dataframe after. In the case where there is no match you will be left with None. In the case of multiple matches, you will get the first (top) match. You could remove the continue line to instead get the last (bottom) match.

Among a lof of approaches, you can use list comprehension:
df["a2"] = [df2.iloc[i]["a2"] for y in df.b for i,x in enumerate(df2.b2) if x in y]
df
Output
a
b
a2
1
100
abc
L1
2
300
dfg
L3
And note that, it shouldn't be d2 = {'a2': [10, 30, 25], 'b2': ["bc", "op", "fg"]}, rather it should be d2 = {'a2': ["L1", "L2", "L3"], 'b2': ["bc", "op", "fg"]}.

Related

How to compare a value of a single column over multiple columns in the same row using pandas?

I have a dataframe that looks like this:
np.random.seed(21)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B1', 'B2', 'B3'])
df['current_State'] = [df['B1'][0], df['B1'][1], df['B2'][2], df['B2'][3], df['B3'][4], df['B3'][5], df['B1'][6], df['B2'][7]]
df
I need to create a new column that contains the name of the column where the value of 'current_State' is the same, this is the desired output:
I tried many combinations of apply and lambda functions but without success. Any help is very welcome!
You can compare the current_State column with all the remaining columns to create a boolean mask, then use idxmax along axis=1 on this mask to get the name of the column where the value in the given row equal to corresponding value in current_State:
c = 'current_State'
df['new_column'] = df.drop(c, 1).eq(df[c], axis=0).idxmax(1)
In case if there is a possibility that there are no matching values we can instead use:
c = 'current_State'
m = df.drop(c, 1).eq(df[c], axis=0)
df['new_column'] = m.idxmax(1).mask(~m.any(1))
>>> df
A B1 B2 B3 current_State new_column
0 -0.051964 -0.111196 1.041797 -1.256739 -0.111196 B1
1 0.745388 -1.711054 -0.205864 -0.234571 -1.711054 B1
2 1.128144 -0.012626 -0.613200 1.373688 -0.613200 B2
3 1.610992 -0.689228 0.691924 -0.448116 0.691924 B2
4 0.162342 0.257229 -1.275456 0.064004 0.064004 B3
5 -1.061857 -0.989368 -0.457723 -1.984182 -1.984182 B3
6 -1.476442 0.231803 0.644159 0.852123 0.231803 B1
7 -0.464019 0.697177 1.567882 1.178556 1.567882 B2

enumerate equal elements within dataframe column

I would like to enumerate elements in a column which appear more than once. Elements that appear only once should not be modified.
I have come up with two solutions, but they seem to be very inelegant, and I am hoping that there is a better solution.
Input:
X
0 A
1 B
2 C
3 A
4 C
5 C
6 D
Output:
new_name
X
A A1
A A2
B B
C C1
C C2
C C3
D D
Here are two possible ways of achieving this, one using .expanding().count(), the other using .cumcount(), but both pretty ugly
import pandas as pd
def solution_1(df):
pvt = (df.groupby(by='X')
.expanding()
.count()
.rename(columns={'X': 'Counter'})
.reset_index()
.drop('level_1', axis=1)
.assign(name = lambda s: s['X'] + s['Counter'].astype(int).astype(str))
.set_index('X')
)
pvt2 = (df.reset_index()
.groupby(by='X')
.count()
.rename(columns={'index': 'C'}
))
df2 = pd.merge(left=pvt, right=pvt2, left_index=True, right_index=True)
ind=df2['C']>1
df2.loc[ind, 'new_name']=df2.loc[ind, 'name']
df2.loc[~ind, 'new_name']=df2.loc[~ind].index
df2 = df2.drop(['Counter', 'C', 'name'], axis=1)
return df2
def solution_2(df):
pvt = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'cumcount'})
).rename(columns={'X': 'Counter'})
pvt2 = pd.DataFrame(df.groupby(by='X')
.agg({'X': 'count'})
).rename(columns={'X': 'Total Count'})
# print(pvt2)
df2 = df.merge(pvt, left_index=True, right_index=True)
df3 = df2.merge(pvt2, left_on='X', right_index=True)
ind=df3['Total Count']>1
df3['Counter'] = df3['Counter']+1
df3.loc[ind, 'new_name']=df3.loc[ind, 'X']+df3.loc[ind, 'Counter'].astype(int).astype(str)
df3.loc[~ind, 'new_name']=df3.loc[~ind, 'X']
df3 = df3.drop(['Counter', 'Total Count'], axis=1).set_index('X')
return df3
if __name__ == '__main__':
s = ['A', 'B', 'C', 'A', 'C', 'C', 'D']
df = pd.DataFrame(s, columns=['X'])
print(df)
sol_1 = solution_1(df)
print(sol_1)
sol_2 = solution_2(df)
print(sol_2)
Any suggestions? Thanks a lot.
First we use GroupBy.cumcount to get a cumulative count for each unique value in X.
Then we add 1 and convert the numeric values to string with Series.astype.
Finally we concat the values to our original column with Series.cat:
df['new_name'] = df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str))
X new_name
0 A A1
1 A A2
2 B B1
3 C C1
4 C C2
5 C C3
6 D D1
If you actually dont want a number at the values which only appear once, we can use:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').eq(1),
df['new_name'].str.replace('\d', ''),
df['new_name'])
X new_name
0 A A1
1 A A2
2 B B
3 C C1
4 C C2
5 C C3
6 D D
All in one line:
df['new_name'] = np.where(df.groupby('X')['X'].transform('size').ne(1),
df['X'].str.cat(df.groupby('X').cumcount().add(1).astype(str)),
df['X'])
IIUC
df.X+(df.groupby('X').cumcount()+1).mask(df.groupby('X').X.transform('count').eq(1),'').astype(str)
Out[18]:
0 A1
1 B
2 C1
3 A2
4 C2
5 C3
6 D
dtype: object

Python iterate each sub group of rows and apply function

I need to combine all iterations of subgroups to apply a function to and return a single value output along with concatenated string items identifying which iterations were looped.
I understand how to use pd.groupby and can set level=0 or level=1 and then call agg{'LOOPED_AVG':'mean'}. However, I need to group (or subset) rows by subgroup and then combine all rows from an iteration and then apply the function to it.
Input data table:
MAIN_GROUP SUB_GROUP CONCAT_GRP_NAME X_1
A 1 A1 9
A 1 A1 6
A 1 A1 3
A 2 A2 7
A 3 A3 9
B 1 B1 7
B 1 B1 3
B 2 B2 7
B 2 B2 8
C 1 C1 9
Desired result:
LOOP_ITEMS LOOPED_AVG
A1 B1 C1 6.166666667
A1 B2 C1 7
A2 B1 C1 6.5
A2 B2 C1 7.75
A3 B1 C1 7
A3 B2 C1 8.25
Assuming that you have three column pairs then you can apply the following, for more column pairs then adjust the script accordingly. I wanted to give you a way to solve the problem, this may not be the most efficient way but it gives a starting point.
import pandas as pd
import numpy as np
ls = [
['A', 1, 'A1', 9],
['A', 1, 'A1', 6],
['A', 1, 'A1', 3],
['A', 2, 'A2', 7],
['A', 3, 'A3', 9],
['B', 1, 'B1', 7],
['B', 1, 'B1', 3],
['B', 2, 'B2', 7],
['B', 2, 'B2', 8],
['C', 1, 'C1', 9],
]
#convert to dataframe
df = pd.DataFrame(ls, columns = ["Main_Group", "Sub_Group", "Concat_GRP_Name", "X_1"])
#get count and sum of concatenated groups
df_sum = df.groupby('Concat_GRP_Name')['X_1'].agg(['sum','count']).reset_index()
#print in permutations formula to calculate different permutation combos
import itertools as it
perms = it.permutations(df_sum.Concat_GRP_Name)
def combute_combinations(df, colname, main_group_series):
l = []
import itertools as it
perms = it.permutations(df[colname])
# Provides sorted list of unique values in the Series
unique_groups = np.unique(main_group_series)
for perm_pairs in perms:
#take in only the first three pairs of permuations and make sure
#the first column starts with A, secon with B, and third with C
if all([main_group in perm_pairs[ind] for ind, main_group in enumerate(unique_groups)]):
l.append([perm_pairs[ind] for ind in range(unique_groups.shape[0])])
return l
t = combute_combinations(df_sum, 'Concat_GRP_Name', df['Main_Group'])
#convert to dataframe and drop duplicate pairs
df2 = pd.DataFrame(t, columns = ["Item1", 'Item2', 'Item3']) .drop_duplicates()
#do a join between the dataframe that contains the sums and counts for the concat_grp_name to bring in the counts for
#each column from df2, since there are three columns: we must apply this three times
merged = df2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item1'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item1_sum'}, axis=1)\
.rename({'count':'item1_count'}, axis=1)
merged2 = merged.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item2'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item2_sum'}, axis=1)\
.rename({'count':'item2_count'}, axis=1)
merged3 = merged2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item3'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item3_sum'}, axis=1)\
.rename({'count':'item3_count'}, axis=1)
#get the sum of all of the item_sum cols
merged3['sums']= merged3[['item3_sum', 'item2_sum', 'item1_sum']].sum(axis = 1)
#get sum of all the item_count cols
merged3['counts']= merged3[['item3_count', 'item2_count', 'item1_count']].sum(axis = 1)
#find the average
merged3['LOOPED_AVG'] = merged3['sums'] / merged3['counts']
#remove irrelavent fields
merged3 = merged3.drop(['item3_count', 'item2_count', 'item1_count', 'item3_sum', 'item2_sum', 'item1_sum', 'counts', 'sums' ], axis = 1)

Checking data match and mismatch between two columns using python pandas

Sample data below
enter image description here
input of file A and File B is given and the output format also given . can someone help me on this
I'd also be curious to see a clever/pythonic solution to this. My "ugly" solution iterating over index is as follows:
dfa, dfb are the two dataframes, columns named as in example.
dfa = pd.DataFrame({'c1':['v','f','h','m','s','d'],'c2':['100','110','235','999','333','39'],'c3':['tech','jjj',None,'iii','mnp','lf'],'c4':['hhh','scb','kkk','lop','sos','kdk']})
dfb = pd.DataFrame({'c1':['v','h','m','f','L','s'],'c2':['100','235','999','110','777','333'],'c3':['tech',None,'iii','jkl','9kdf','mnp1'],'c4':['hhh','mckkk','lok','scb','ooo','sos1']})
Now let's create lists of indexes to identify the rows that don't match between dfa and dfb
dfa, dfb = dfa.set_index(['c1','c2']), dfb.set_index(['c1','c2'])
mismatch3, mismatch4 = [],[]
for i in dfa.index:
if i in dfb.index:
if dfa.loc[i,'c3']!=dfb.loc[i,'c3']:
mismatch3.append(i)
if dfa.loc[i,'c4']!=dfb.loc[i,'c4']:
mismatch4.append(i)
mismatch = list(set(mismatch3+mismatch4))
Now that this is done, we want to rename dfb, perform the join operation on the mismatched indexes, and add the "status" columns based on mismatch3 and mismatch4.
dfb = dfb.rename(index=str, columns={'c3':'b_c3','c4':'b_c4'})
df = dfa.loc[mismatch].join(dfb)
df['c3_status'] = 'match'
df['c4_status'] = 'match'
df.loc[mismatch3, 'c3_status'] = 'mismatch'
df.loc[mismatch4, 'c4_status'] = 'mismatch'
Finally, let's get those columns in the right order :)
result = df[['c3','b_c3','c3_status','c4','b_c4','c4_status']]
Once again, I'd love to see a prettier solution. I hope this helps!
Here are four lines of code that may do what you are looking for:
columns_to_compare =['c2','c3']
dfa['Combo'] = dfa[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
dfb['Combo1'] = dfb[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
explanation
Assume that you want to see what dfb rows are not in dfa, for columns c2 and c3.
To do this, consider the following approach:
Create a column "Combo" in dfa where each row of "Combo" contains a comma separated string, representing the values of the chosen columns to compare (for the row concerned)
dfa['Combo'] = dfa[dfa.columns].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
c1 c2 c3 c4 Combo
0 v 100 tech hhh 100, tech
1 f 110 jjj scb 110, jjj
2 h 235 None kkk 235
3 m 999 iii lop 999, iii
4 s 333 mnp sos 333, mnp
5 d 39 lf kdk 39, lf
Apply the same logic to dfb
c1 c2 c3 c4 Combo1
0 v 100 tech hhh 100, tech
1 h 235 None mckkk 235
2 m 999 iii lok 999, iii
3 f 110 jkl scb 110, jkl
4 L 777 9kdf ooo 777, 9kdf
5 s 333 mnp1 sos1 333, mnp1
Create a list containing the required indices from dfb:
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
or to show the actual row values (not indices):
[[x] for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
Row Index Result
[3, 4, 5]
Row Value Result
[['110, jkl'], ['777, 9kdf'], ['333, mnp1']]

Reshaping strings-as-lists into rows

I have a pandas data frame like this:
df = pandas.DataFrame({
'Grouping': ["A", "B", "C"],
'Elements': ['[\"A1\"]', '[\"B1\", \"B2\", \"B3\"]', '[\"C1\", \"C2\"]']
}).set_index('Grouping')
so
Elements
Grouping
===============================
A ["A1"]
B ["B1", "B2", "B3"]
C ["C1", "C2"]
i.e. some lists are encoded as strings-as-lists. What is a clean way to reshape this into a tidy data set like this:
Elements
Grouping
====================
A A1
B B1
B B2
B B3
C C1
C C2
without resorting to a for-loop? The best I can come up with is:
df1 = pandas.DataFrame()
for index, row in df.iterrows():
df_temp = pandas.DataFrame({'Elements': row['Elements'].replace("[\"", "").replace("\"]", "").split('\", \"')})
df_temp['Grouping'] = index
df1 = pandas.concat([df1, df_temp])
df1.set_index('Grouping', inplace=True)
but that's pretty ugly.
You can use .str.extractall():
df.Elements.str.extractall(r'"(.+?)"').reset_index(level="match", drop=True).rename({0:"Elements"}, axis=1)
the result:
Elements
Grouping
A A1
B B1
B B2
B B3
C C1
C C2
You can to convert your 'list' to list , then we doing apply with pd.Series and stack
import ast
df.Elements=df.Elements.apply(ast.literal_eval)
df.Elements.apply(pd.Series).stack().reset_index(level=1,drop=True).to_frame('Elements')
Elements
Grouping
A A1
B B1
B B2
B B3
C C1
C C2

Categories