Fuzzy Match List with Column in a data frame

Fuzzy Match List with Column in a data frame - python

I have a list of strings that I am trying to match to values in a column. If it is a low match (below 95) I want to return the current column value if it is above 95 then I want to return the best fuzzy match from the list . I am trying to put all returned values into a new column. I keep getting the error "tuple index out of range", I think this maybe because it wants to return a tuple with the score and name but I only want the name. Here is my current code:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
L = [ducks, frogs, doggies]
df
FOO PETS
a duckz
b frags
c doggies
def fuzz_m(column, pet_list, score_t):
for c in column:
new_name, score = process.extractOne(c, pet_list, score_t)
if score<95:
return c
else:
return new_name
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
Desired output:
FOO PETS NEW_PETS
a duckz ducks
b frags frogs
c doggies doggies

Several corrections.
Change
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
to
df['NEW_PETS'] = fuzz_m(df['PETS'], L, fuzz.ratio)
Make your list elements strings.
Fuzzywuzzy's extractOne method accepts both a processor and a scorer, in that order (link to source code.). Your positional argument of fuzz.ratio is mistakenly interpreted as a processor, when it's really a scorer. Change process.extractOne(c, pet_list, score_t) to process.extractOne(c, pet_list, scorer=score_t).
This loop-based code will not work as expected. fuzz_m is only called once, and its return value will be broadcast into all entries of the series df['NEW_PETS'].
A more pandas-friendly way:
L = ['ducks', 'frogs', 'doggies']
def fuzz_m(col, pet_list, score_t):
new_name, score = process.extractOne(col, pet_list, scorer=score_t)
if score<95:
return col
else:
return new_name
df['NEW_PETS'] = df['PETS'].apply(fuzz_m, pet_list=L, score_t=fuzz.ratio)

Related

Calling arguments from 2 separate lists into one function

I'm trying to calculate the alcohol by volume (abv) of some beer by using variables from 2 separate lists (which I took from a dictionary entry). I'm having trouble getting the values from both lists to be applied to the equation that I have for abv (and it's probably not possible to have a for loop with an and statement like the one I have below). Is it possible to get variables from two separate lists to be subbed into the same equation in one for loop?
Right now it's telling me that I have a type error where 'bool' object is not iterable. Here's what I've tried so far in terms of coding:
beers = {"SG": [1.050, 1.031, 1.077, 1.032, 1.042, 1.055, 1.019, 1.089, 1.100, 1.032],
"FG": [1.010, 1.001, 1.044, 1.003, 1.003, 1.013, 1.002, 1.020, 1.056, 1.000],
"grad student 1": [5.264, 3.983, 4.101, 7.216, 2.313, 4.876, 2.255, 8.991, 5.537, 4.251],
"grad student 2": [5.211, 3.008, 4.117, 3.843, 5.168, 5.511, 3.110, 8.903, 5.538, 4.255]}
#separating the SG and FG values from the dictionary entry
SG_val = beers["SG"]
FG_val = beers['FG']
def find_abv(SG = SG_val, FG = FG_val):
abv_list = []
i = 0.0
j = 0.0
for i in SG_val and j in FG_val:
abv = (((1.05/0.79)*((i - j)/j))*100)
abv_list.append(abv)
return abv_list
find_abv()
print(abv_list)```

You cannot use and to iterate two variables in a single for loop. You can use the zip function to do that:
def find_abv(SG = SG_val, FG = FG_val):
abv_list = []
i = 0.0
j = 0.0
for i, j in zip(SG,FG):
abv = (((1.05/0.79)*((i - j)/j))*100)
abv_list.append(abv)
return abv_list
abv_list = find_abv()
print(abv_list)
You also need to assign the result of find_abv() to a variable in order to print it, which you don't, as it seems in your code.
Another thing is that the use of SG_val and FG_val in the loop of your find_abv is pointless, since you have the SG an FG parameters in your function.

You can't use a for loop to directly iterate through multiple lists. Currently, your function is trying to iterate through (SG_val and j in FG_val), which itself is a boolean and can therefore not be iterated through.
If the two lists will always have the same number of items, then you could simply iterate through the indexes:
# len(SG_val) returns the length of SG_val
for i in range(len(SG_val)):
abv = (((1.05/0.79)*((SG_val[i] - FG_val[i])/FG_val[i]))*100)
abv_list.append(abv)
# put the return outside of the for loop so that it can finish iterating before returning the value
return abv_list
If the lists aren't always going to be the same length then you can write for i in range(len(SG_val) if len(SG_val) <= len(FG_val) else len(SG_val)): instead of for i in range(len(SG_val)):so that it iterates until it reaches the end of the smallest list.
Also, to output the value returned by the function you have to assign it to something and then print it or just print it directly:
abv_list = find_abv()
print(abv_list)
# or
print(find_abv())

Parse Pandas Return As List

I run the following code:
df = pd.read_excel(excel_file, columns = ['DeviceNumber','DeviceAddress','DeviceCity','DeviceState','StoreNumber','StoreName','DeviceConnect','Keys'])
df.index.name = 'ID'
def srch_knums(knum_search):
get_knums = df.loc[df['DeviceNumber'] == knum_search]
return get_knums
test = srch_knums(int(13))
print(test)
The output is as follows:
DeviceNumber DeviceAddress DeviceCity DeviceState StoreNumber StoreName DeviceConnect Keys ID
12 13 135 Sesame Street Imaginary AZ 410 Verizon Here On Sit
e
btw, that looks prettier in terminal... haha
What I want to do is take the value test and use various aspects of it, i.e. print it in specific parts of a gui that I am creating. The question is, what is the syntax for accessing the various list values of test? TBH I would rather change the labels when I am presenting it in a gui, and want to know how to do that, for example, take test[0], which should be the value for device number (13), and be able to assign it to a variable. IE, make a label which says "kiosk number" and then prints a variable assigned test[0] beside it, etc. as I would rather format it myself than the weird printout from the return.

If you want return scalar values, first match by testing column col1 and output of column col2 then loc is necessary, also is added next with iter for return default value if no match:
def srch_knums(col1, knum_search, col2):
return next(iter(df.loc[df[col1] == knum_search, col2]), 'no match')
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
410
If want list:
def srch_knums(col1, knum_search, col2):
return df.loc[df[col1] == knum_search, col2].tolist()
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
[410]

Change the line:
get_knums = df.loc[df['DeviceNumber'] == knum_search]
to
get_knums = df[df['DeviceNumber'] == knum_search]
you don't need to use loc.

Spark Streaming updateStateByKey with tuple as a value

is it possible to use updateStateByKey() function with a tuple as a value? I am using PySpark and my input is (word, (count, tweet_id)), which means word is a key and a tuple (count, tweet_id) is a value. The task of updateStateByKey is for each word to sum their counts and create a list of all tweet_ids which contains the word.
I implemented following update function, however I got error list index out of range for new_values with index 1:
def updateFunc(new_values, last_sum):
count = 0
tweets_id = []
if last_sum:
count = last_sum[0]
tweets_id = last_sum[1]
return sum(new_values[0]) + count, tweets_id.extend(new_values[1])
And calling the method:
running_counts.updateStateByKey(updateFunc)

I've found the solution. The problem was with checkpointing which means the current state is persisted to the disk in case of a failure. It caused problems because when I changed my definition of a state, in checkpoint it was in the old state without a tuple. Therefore, I deleted checkpoint from the disk and implement the final solution as:
def updateFunc(new_values, last_sum):
count = 0
counts = [field[0] for field in new_values]
ids = [field[1] for field in new_values]
if last_sum:
count = last_sum[0]
new_ids = last_sum[1] + ids
else:
new_ids = ids
return sum(counts) + count, new_ids
Finally, the answer to my question is: yes, the state can be a tuple or any other data type for storing more values.

Python: Summing a pig tuple containing float values

I'm fairly new to Pig/Python and in need of help. Trying to write a Pig Script that reconciles financial data. The parameters used follow a syntax like (grand_tot, x1, x2,... xn), meaning that the first value should equal the sum of remaining values.
I don't know of a way to accomplish this using Pig alone, so I've been trying to write a Python UDF. Pig passes a tuple to Python; if the sum of x1:xn equals grand_tot, then Python should return a "1" to Pig to show that the numbers match, otherwise it returns a "0".
Here is what I have so far:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d);
A1 = GROUP A ALL;
B = FOREACH A1 GENERATE TOTUPLE($recon1) as flds;
C = FOREACH B GENERATE myfuncs.isReconciled(flds) AS res;
DUMP C;
$recon1 is passed as a parameter, and defined as:
grand_tot, west_region, east_region
I will later pass $recon2 as:
grand_tot, prod_line_a, prod_line_b, prod_line_c, prod_line_d
Sample row of data (in $file_nm) looks like:
grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d
10000,4500,5500,900,2200,450,3700,2750
12500,7500,5000,3180,2770,300,3950,2300
9900,7425,2475,1320,460,3070,4630,1740
Lastly... here is what I'm trying to do with Python UDF code:
#outputSchema("result")
def isReconciled(arrTuple):
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
The error message I receive is:
unsupported operand type(s) for +: 'int' and 'array.array'
I've tried numerous things attempting to convert the array values into numeric and convert to float so that I can sum, but with no success.
Any ideas??? Thanks for looking!

You can do this in PIG itself.
First, specify the datatype in the schema. PigStorage will use bytearray as default data type.Hence your python script is throwing the error.Looks like your sample data has int but in your question you have mentioned float.
Second, add the fields starting from the second field or the fields of your choice.
Third, use the bincond operator to check the first field value with the sum.
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot:float,west_region:float,east_region:float,prod_line_a:float,prod_line_b:float, prod_line_c:float, prod_line_d:float);
A1 = FOREACH A GENERATE grand_tot,SUM(TOBAG(prod_line_a,prod_line_b,prod_line_c,prod_line_d)) as SUM_ALL;
B = FOREACH A1 GENERATE (grand_tot == SUM_ALL ? 1 : 0);
DUMP B;

It is very likely, that your arrTuple is not an array of numbers, but some item is an array.
To check it, modify your code by adding some checks:
#outputSchema("result")
def isReconciled(arrTuple):
# some checks
tmpl = "Item # {i} shall be a number (has value {itm} of type {tp})"
for i, num in enumerate(arrTuple):
msg = templ.format(i=i, itm=itm, tp=type(itm))
assert isinstance(arrTuple[0], (int, long, float)), msg
# end of checks
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
It is very likely, that it will throw an AssertionFailed exception on one of the items. Read the
assertion message to learn, which item is making the troubles.
Anyway, if you want to return 0 or 1 if first number equals sum of the rest of the array, following
would work too:
#outputSchema("result")
def isReconciled(arrTuple):
if arrTuple[0] == sum(arrTuple[1:]):
return 1
else:
return 0
and in case, you would live happy with getting True instead of 1 and False instead of 0:
#outputSchema("result")
def isReconciled(arrTuple):
return arrTuple[0] == sum(arrTuple[1:])

Outputting Bubble sorting

I have this list called countries.txt that list all the countries by their name, area(in km2), population (eg. ["Afghanistan", 647500.0, 25500100]).
def readCountries(filename):
result=[]
lines=open(filename)
for line in lines:
result.append(line.strip('\n').split(',\t'))
for sublist in result:
sublist[1]=float(sublist[1])
sublist[2]=int(sublist[2])
I am trying to sort through the list using a bubble sort according to the are of each country:
>> c = countryByArea(7)
>>> c
>>["India",3287590.0,1239240000]
When typing in the parameter is should return the nth largest area.
I have this but I'm not sure how to output the information
def countryByArea(area):
myList=readCountries('countries.txt')
for i in range(0,len(list)):
for j in range(0,len(list)-1):
if list[j]>list[j+1]:
temp=list[j]
list[j]=list[j+1]
list[j+1]=temp

first of all, implement a generic bubble sort method. this is a correct bubble sort algorithm implementation... Im sure you can find other implementations on http://rosettacode.org
def bubble_sort(a_list,a_key):
changed=True
while changed:
changed = False
for i in range(len(a_list)-1):
if a_key(a_list[i]) > a_key(a_list[i+1]):
a_list[i],a_list[i+1] = a_list[i+1],a_list[i]
changed = True
then simply pass a key function that represents the data you want to sort by (in this case the middle value or index one of each row
import csv
def sort_by_area(fname):
with open(fname) as f:
a = list(csv.reader(f))
bubble_sort(a,lambda row:int(row[1]))
return a
a = sort_by_area("a_file.txt")
print a[-7] #the 7th largest by area
you can take this info and combine it to complete your assignment ... but really this is a question you should have asked a classmate or your teacher for help with ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fuzzy Match List with Column in a data frame - python

Related

Calling arguments from 2 separate lists into one function

Parse Pandas Return As List

Spark Streaming updateStateByKey with tuple as a value

Python: Summing a pig tuple containing float values

Outputting Bubble sorting

Categories

Resources