python naive string token matcher

python naive string token matcher - python

I've written a very naive token string search matcher. It's a little too naive though, as with the following code, it would bring back every artists in the artists list, due to how 'a r i z o n a' is tokenised.
import collections
import re
def __tokenised_match(artist, search_artist):
matches = []
if len(re.split(r'[\\\s/-]', search_artist)) > 1:
a = [artist.sanitisedOne, search_artist]
bag_of_words = [ collections.Counter(re.findall(r'\w+', words)) for words in a]
sumbags = sum(bag_of_words, collections.Counter())
print(sumbags)
for key, value in sumbags.items():
if len(re.findall(r'\b({k})\b'.format(k=key), search_artist)) > 0 and value > 1:
matches.append(artist)
if len(matches):
return matches
artists = [
{ 'artist': 'A R I Z O N A', 'sanitisedOne': 'a r i z o n a'},
{ 'artist': 'Wutang Clan', 'sanitisedOne': 'wutang clan'}
]
search_artist = 'a r i z o n a'
for artist in artists:
print(__tokenised_match(artist, search_artist))
this'll create a sumbags like this:
Counter({'a': 4, 'r': 2, 'i': 2, 'z': 2, 'o': 2, 'n': 2})
Counter({'a': 2, 'wutang': 1, 'clan': 1, 'r': 1, 'i': 1, 'z': 1, 'o': 1, 'n': 1})
this is kind of edge casey, but i wonder how i can tighten up against this kind of edge case. it would be fine for 'wutang clang' to match, but when it's single letters like this... it's a little much and will bring back every artist due to a matching twice.

The basic problem is that you return success on only a single match. This will kill your accuracy for any artist with an easily matched token in the name. We could tune your algorithm for matching a certain percentage of words, or for doing a bag-of-letters, intersection-over-union ratio, but ...
I recommend that you use something a bit stronger, such as string similarity, which is easily found in Python code. Being already packaged, it's much easier to use than coding your own solution.

Related

how to find the most popular letter in a string that also has the lowest ascii value

Implement the function most_popular_character(my_string), which gets the string argument my_string and returns its most frequent letter. In case of a tie, break it by returning the letter of smaller ASCII value.
Note that lowercase and uppercase letters are considered different (e.g., ‘A’ < ‘a’). You may assume my_string consists of English letters only, and is not empty.
Example 1: >>> most_popular_character("HelloWorld") >>> 'l'
Example 2: >>> most_popular_character("gggcccbb") >>> 'c'
Explanation: cee and gee appear three times each (and bee twice), but cee precedes gee lexicographically.
Hints (you may ignore these):
Build a dictionary mapping letters to their frequency;
Find the largest frequency;
Find the smallest letter having that frequency.
def most_popular_character(my_string):
char_count = {} # define dictionary
for c in my_string:
if c in char_count: #if c is in the dictionary:
char_count[c] = 1
else: # if c isn't in the dictionary - create it and put 1
char_count[c] = 1
sorted_chars = sorted(char_count) # sort the dictionary
char_count = char_count.keys() # place the dictionary in a list
max_per = 0
for i in range(len(sorted_chars) - 1):
if sorted_chars[i] >= sorted_chars[i+1]:
max_per = sorted_chars[i]
break
return max_per
my function returns 0 right now, and I think the problem is in the last for loop and if statement - but I can't figure out what the problem is..
If you have any suggestions on how to adjust the code it would be very appreciated!

Your dictionary didn't get off to a good start by you forgetting to add 1 to the character count, instead you are resetting to 1 each time.
Have a look here to get the gist of getting the maximum value from a dict: https://datagy.io/python-get-dictionary-key-with-max-value/
def most_popular_character(my_string):
# NOTE: you might want to convert the entire sting to upper or lower case, first, depending on the use
# e.g. my_string = my_string.lower()
char_count = {} # define dictionary
for c in my_string:
if c in char_count: #if c is in the dictionary:
char_count[c] += 1 # add 1 to it
else: # if c isn't in the dictionary - create it and put 1
char_count[c] = 1
# Never under estimate the power of print in debugging
print(char_count)
# max(char_count.values()) will give the highest value
# But there may be more than 1 item with the highest count, so get them all
max_keys = [key for key, value in char_count.items() if value == max(char_count.values())]
# Choose the lowest by sorting them and pick the first item
low_item = sorted(max_keys)[0]
return low_item, max(char_count.values())
print(most_popular_character("HelloWorld"))
print(most_popular_character("gggcccbb"))
print(most_popular_character("gggHHHAAAAaaaccccbb 12 3"))
Result:
{'H': 1, 'e': 1, 'l': 3, 'o': 2, 'W': 1, 'r': 1, 'd': 1}
('l', 3)
{'g': 3, 'c': 3, 'b': 2}
('c', 3)
{'g': 3, 'H': 3, 'A': 4, 'a': 3, 'c': 4, 'b': 2, ' ': 2, '1': 1, '2': 1, '3': 1}
('A', 4)
So: l and 3, c and 3, A and 4

def most_popular_character(my_string):
history_l = [l for l in my_string] #each letter in string
char_dict = {} #creating dict
for item in history_l: #for each letter in string
char_dict[item] = history_l.count(item)
return [max(char_dict.values()),min(char_dict.values())]
I didn't understand the last part of minimum frequency, so I make this function return a maximum frequency and a minimum frequency as a list!

Use a Counter to count the characters, and use the max function to select the "biggest" character according to your two criteria.
>>> from collections import Counter
>>> def most_popular_character(my_string):
... chars = Counter(my_string)
... return max(chars, key=lambda c: (chars[c], -ord(c)))
...
>>> most_popular_character("HelloWorld")
'l'
>>> most_popular_character("gggcccbb")
'c'
Note that using max is more efficient than sorting the entire dictionary, because it only needs to iterate over the dictionary once and find the single largest item, as opposed to sorting every item relative to every other item.

How to create function that takes a text string and returns a dictionary containing how many times some defined characters occur even if not present?

Hello I asked this question previously and I wanted to adjust the code that I have now. I want to adjust this code so that if a letter is not present in a text string it still returns the value 0 to it assigned.
count = {}
for l in text.lower():
if l in let:
if l in count.keys():
count[l] += 1
else:
count[l] = 1
return count
It currently returns this:
example = "Sample String"
print(func(example, "sao")
{'s': 2, 'a' : 1}
This would be my desired output
example = "Sample String"
print(func(example, "sao"))
{'s': 2, 'a' : 1, 'o' :0}

If you don't mind using tools designed especially for your purpose, then the following will do:
from collections import Counter
def myfunc(inp, vals):
c = Counter(inp)
return {e: c[e] for e in vals}
s = 'Sample String'
print(myfunc(s, 'sao')
Otherwise you can explicitly set all missing values in your functions.
def func(inp, vals):
count = {e:0 for e in vals}
for s in inp:
if s in count:
count[s] += 1
return count

# create a function
def stringFunc(string, letters):
# convert string of letters to a list of letters
letter_list = list(letters)
# dictionary comprehension to count the number of times a letter is in the string
d = {letter: string.lower().count(letter) for letter in letter_list}
return d
stringFunc('Hello World', 'lohdx')
# {'l': 3, 'o': 2, 'h': 1, 'd': 1, 'x': 0}

You can use a Dict Comprehensions and str.count:
def count_letters(text, letters):
lower_text = text.lower()
return {c: lower_text.count(c) for c in letters}
print(count_letters("Sample String", "sao"))
result: {'s': 2, 'a': 1, 'o': 0}

You can use collections.Counter and obtain character counts via the get method:
from collections import Counter
def func(string, chars):
counts = Counter(string.lower())
return {c: counts.get(c, 0) for c in chars}

Program is repeating itself -can't figure out why

This is my program so far...it takes a message (input from user) and tells the user how many A's are in the program, how many B's, etc. Except when I input a message such as "Dad", it'll tell me how many D's there are twice instead of just saying everything once.
It says:
D ... 2
A ... 1
D ... 2
I want it to say:
A ... 1
D ... 2
How do I fix this without using zip, and without importing anything?
message=input("what is your message?").upper()
alphabet=["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
count=[0]*len(alphabet)
for i in message:
if i in alphabet:
count[alphabet.index(i)]+=1
for i in message:
print (i,"...",count[alphabet.index(i)])
(Thanks to Uriel Eli for helping me get the program this far btw).

I don't agree with your approach here. You have actually over complicated this. The proper way to solve this is to actually use a dictionary to keep track of all the letters in the string and keep a count every time the same character comes up. Note, this also sticks to the rule of not importing anything.
Furthermore, this removes the necessity to have a list of letters to check against as well.
Also, if you need to count upper and lower case characters separtely, do not call upper at the end of your input. Just remove it. If you have to count upper and lower case as the same character, then you can leave it.
message=input("what is your message?").upper()
d = {}
for c in message:
if c in d:
d[c] += 1
else:
d[c] = 1
Demo
what is your message?thisisastringofthings
{'H': 1, 'F': 0, 'O': 0, 'R': 0, 'G': 1, 'S': 3, 'T': 2, 'A': 0, 'I': 3, 'N': 1}
To provide an output similar to what you are expecting, you just need to iterate through your final result and print:
for character, count in d.items():
print("{} ... {}".format(character, count))
Finally, just for the sake of showing the best way to do this, is to actually use Counter from collections:
>>> from collections import Counter
>>> Counter("thisisastring")
Counter({'s': 3, 'i': 3, 't': 2, 'h': 1, 'n': 1, 'a': 1, 'r': 1, 'g': 1})

Just for future reference, and I know that you CANT import anything now. The best way probably would be:
from collections import Counter
message=input("what is your message?").upper()
print(Counter(message))
# Counter({'D': 2, 'A': 1})

Your second for loop is iterating through message, so if the user input DAD (well... after upper casing it), you're gonna get:
message == DAD
i = D --> Shows 2
i = A --> Shows 1
i = D --> Shows 2 (again)
Maybe you'd want to iterate through count, keeping the index that you are iterating (to use it latter to match it with the alphabet list). Something like that:
for index, num_occurences in enumerate(count):
if num_occurences > 0:
print("Match found at i=%s which corresponds with alphabet[%s]=%s" %
(index, index, alphabet[index]))
print(alphabet[index], "...", num_occurences)
You should check what enumerate does.
If you still want to iterate through message, you can do it, keeping track of what letter did you already display using an auxiliary set (so you don't show the same letter again)
already_shown_letters = set()
for i in message:
if i not in already_shown_letters:
print (i,"...",count[alphabet.index(i)])
already_shown_letters.add(i)

key error: 'x' --> adding key value pair in for loop, key being char

I am a beginner in python and I am trying to solve a coding problem, got this error. Don't understand why ? I went through a couple of Q/A's here but they don't seem to solve my problem. Essentially what I am trying to do is iterate over a string, through its characters and fill these characters in a dictionary. With characters being the keys and values being the number of times these characters appeared. So I'm trying the following:
def myfunc(mystring):
for i in mystring:
if charCounter[i]:
charCounter[i] += 1
charCounter[i] = 1
mystring = "hello! how are you ?"
myfunc(mystring)
and Im getting following error:
File "xyq.py", line 3, in myfunc
if CharCounter[i]:
KeyError: 'h'
Can someone please suggest, where am I going wrong ? And if possible how can I improve the code ?
Thanks

You need to check if i is in charCounter before you try to retrieve it:
if i in charCounter:
charCounter[i] += 1
else:
charCounter[i] = 1
Or alternatively:
if charCounter.get(i):
...

if charCounter[i]:
throws KeyError if the key does not exist. What you want to do isuse if i in charCounter: instead:
if i in char_counter:
char_counter[i] += 1
else:
char_counter[i] = 1
Alternatively you could use get which gets the value if it exists, or returns the second (optional) value if it didn't exist:
char_counter[i] = char_counter.get(i, 0) + 1
However this counting pattern is so popular that a whole class exists for it: collections.Counter:
from collections import Counter
def my_func(my_string):
return Counter(my_string)
Example:
>>> counts = my_func('hello! how are you ?')
>>> counts
Counter({' ': 4, 'o': 3, 'h': 2, 'l': 2, 'e': 2, '!': 1, 'r': 1, 'a': 1,
'?': 1, 'w': 1, 'u': 1, 'y': 1})
>>> counts[' ']
4
collections.Counter is a subclass of dictionary, so it would behave in the same way that an ordinary dictionary would do with item access and so forth.

Converting CNC drillings from old to new system (using Python)

I have this kind of file (part):
H DX=615 DY=425 DZ=22.15 -AB C=0 T=0 R=999 *MM /"def" BX=2.5 BY=452.5 BZ=25 ;M20150710.
XBO X=100 Y=50 Z=5 V=1000 R=0 x=0 y=0 D=10 N="P" F=1 ;Test F1/10P.
...
which I want to convert to a new programming system. What I want to do is first read the header (H) and put the DX, DY and DZ values in respectively named variables. I managed to do this, but when I came to process my XBO line (a drilling, from which I need X, Y, Z, V, R, x, y, D, N, F and ;, also in separate variables) my code started looking very ugly very fast.
So I started over, and came up with this:
f = open("input.xxl") # open input file
for line in f:
if Debug==1: print line
for char in line:
charbuffr=charbuffr+char
if "H" in charbuffr:
if Debug==1: print'HEADER found!'
charbuffr=""
if "XBO" in charbuffr:
if Debug==1: print'XBO found!'
charbuffr=""
This correctly identifies the separate commands H and XBO, but I'm kind of stuck now. I can use the same method to extract all the variables, from loops inside the H and XBO loops, but this does not seem like good coding...
Can anyone set me on the right foot please? I don't want a full solution, as I love coding (well my main job is coding for CNC machines, which seems easy now compared to Python), but would love to know which approach is best...

Instead of converting data types by hand, you could use ast. literal_eval. This helper function takes a list of the form ['a=2', 'b="abc"'] and converts into a dictionary {'a': 2, 'b': 'abc'}:
import ast
def dict_from_row(row):
"""Convert a list of strings in the form 'name=value' into a dict."""
res = []
for entry in row:
name, value = entry.split('=')
res.append('"{name}": {value}'.format(name=name, value=value))
dict_string = '{{{}}}'.format(', '.join(res))
return ast.literal_eval(dict_string)
Now parsing the file becomes a bit simpler:
for line in f:
row = line.split()
if not row:
continue
if row[0] == 'H':
header = dict_from_row(row[1:4])
elif line[0] == 'XBO':
xbo = dict_from_row(row[1:11])
Results:
>>> header
{'DX': 615, 'DY': 425, 'DZ': 22.15}
>>> xbo
{'D': 10, 'F': 1, 'R': 0, 'V': 1000, 'X': 100, 'Y': 50, 'Z': 5, 'x': 0, 'y': 0}

As an inspiration, you can do something like this:
for raw_line in f:
line = raw_line.split()
if not line:
continue
if line[0] == 'H':
header = {}
for entry in line[1:4]:
name, value = entry.split('=')
header[name] = float(value)
elif line[0] == 'XBO':
xbo = {}
for entry in line[1:11]:
name, value = entry.split('=')
try:
xbo[name] = int(value)
except ValueError:
xbo[name] = value[1:-1] # stripping of the ""
Now headercontains the extensions of your domain:
{'DX': 615.0, 'DY': 425.0, 'DZ': 22.15}
and xbo the other values:
{'D': 10,
'F': 1,
'N': 'P',
'R': 0,
'V': 1000,
'X': 100,
'Y': 50,
'Z': 5,
'x': 0,
'y': 0}
Access the individual values in the dictionaries:
>>> header['DX']
615.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python naive string token matcher - python

Related

how to find the most popular letter in a string that also has the lowest ascii value

How to create function that takes a text string and returns a dictionary containing how many times some defined characters occur even if not present?

Program is repeating itself -can't figure out why

key error: 'x' --> adding key value pair in for loop, key being char

Converting CNC drillings from old to new system (using Python)

Categories

Resources