FuzzyWuzzy using two pandas dataframes python - python

I want to find the fuzz.ratio of strings that are in two dataframes. Let's say I have 2 dataframes df with columns A, B and bt_df with columns A1, B1.. I want to compare the column df['B'] and bt_df['B1'] and return the best matching score and its corresponding id in df[A] and .
df
Out[8]:
A B
0 11111111111111111111 Cheesesalad
1 22222222222222222222 Cheese
2 33333333333333333333 salad
3 44444444444444444444 BMWSalad
4 55555555555555555555 BMW
5 66666666666666666666 Apple
6 77777777777777777777 Apple####
7 88888888888888888888 Macrooni!
bt_df
Out[9]:
A1 B1
0 180336 NaN
1 154263 Cheese
2 130876 Salad
3 204430 Macrooni
4 153490 NaN
5 48879 NaN
6 185495 NaN
7 105099 NaN
8 8645 Apple
9 54038 NaN
10 156523 NaN
11 18156 BWM
Hence the result should be:
B1 matchedstring score id
Cheese Cheese 100 22222222222222222222
.....
.....
Thanks in advance.

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Filling na in a column in pandas from multiple columns conition

I would love to fill na in pandas dataframe where two columns in the dataframe both has on the same row.
A B C
2 3 5
Nan nan 7
4 7 9
Nan 4 9
12 5 8
Nan Nan 6
In the above dataframe, I would love to replace just row where both column A and B has Nan with "Not Available"
Thus:
A B C
2 3 5
Not available not available 7
4 7 9
Nan 4 9
12 5 8
Not available not available 6
I have tried multiple approach but I am getting undesirable result
If want test only A and B columns use DataFrame.loc with mask test it missing value with DataFrame.all for test if both are match:
m = df[['A','B']].isna().all(axis=1)
df.loc[m, ['A','B']] = 'Not Available'
If need test any 2 columns first count number of missing values and is 2 use fillna:
m = df.isna().sum(axis=1).eq(1)
df = df.where(m, df.fillna('Not Available'))
print (df)
A B C
0 2 3 5
1 Not Available Not Available 7
2 4 7 9
3 NaN 4 9
4 12 5 8
5 Not Available Not Available 6

Attempting to pivot a dataframe with only text columns - "Index contains duplicate entries, cannot reshape"

I'm having issues with pivoting the below data
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 2 B red
6 2 C eight
7 2 C five
8 3 A fish
9 3 B pink
10 3 C one
I am attempting to pivot it by using
df.pivot(index='index', columns='column', values="data")
But I receive the error "Index contains duplicate entries, cannot reshape"
I have looked through a large number of similar posts to this but none of the solutions I tried worked
My desired output is
index A B C
1 cat blue seven
2 dog green eight
2 dog green five
2 dog red eight
2 dog red five
3 fish pink one
What would be the best solution for this?
in this question Pandas pivot warning about repeated entries on index they state that duplicate pairs (so a duplicate pair in the columns 'index' and 'column') are not possible to pivot.
in your dataset, the index 2 has two times the column values B and C.
Can you change the 'index' column?
See my new dataframe as an example:
df = pd.DataFrame({'index': [1,1,1,2,2,3,2,4,3,4,3],
'column': ['A','B','C','A','B','B','C','C','A','B','C'],
'data':['cat','blue','seven', 'dog', 'green', 'red',
'eight','five', 'fish', 'pink', 'one']})
df
out:
index column data
0 1 A cat
1 1 B blue
2 1 C seven
3 2 A dog
4 2 B green
5 3 B red
6 2 C eight
7 4 C five
8 3 A fish
9 4 B pink
10 3 C one
df.pivot('index', 'column', 'data')
out:
column A B C
index
1 cat blue seven
2 dog green eight
3 fish red one
4 NaN pink five
Option_2
If you use unstack with 'append':
testing = df.set_index(['index', 'column'],
append=True).unstack('column')
testing
data
column A B C
index
0 1 cat NaN NaN
1 1 NaN blue NaN
2 1 NaN NaN seven
3 2 dog NaN NaN
4 2 NaN green NaN
5 2 NaN red NaN
6 2 NaN NaN eight
7 3 NaN NaN five
8 3 fish NaN NaN
9 3 NaN pink NaN
10 3 NaN NaN one

How to drop duplicates from a subset of rows in a pandas dataframe?

I have a dataframe like this:
A B C
12 true 1
12 true 1
3 nan 2
3 nan 3
I would like to drop all rows where the value of column A is duplicate but only if the value of column B is 'true'.
The resulting dataframe I have in mind is:
A B C
12 true 1
3 nan 2
3 nan 3
I tried using: df.loc[df['B']=='true'].drop_duplicates('A', inplace=True, keep='first') but it doesn't seem to work.
Thanks for your help!
You can sue pd.concat split the df by B
df=pd.concat([df.loc[df.B!=True],df.loc[df.B==True].drop_duplicates(['A'],keep='first')]).sort_index()
df
Out[1593]:
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3
df[df.B.ne(True) | ~df.A.duplicated()]
A B C
0 12 True 1
2 3 NaN 2
3 3 NaN 3

Categories