Extract words from a string

Extract words from a string - python

Sample Input:
'note - Part model D3H6 with specifications X30G and Y2A is having features 12H89.'
Expected Output:
['D3H6', 'X30G', 'Y2A', '12H89']
My code:
split_note = re.split(r'[.;,\s]\s*', note)
pattern = re.compile("^[a-zA-Z0-9]+$")
#if pattern.match(ini_str):
for a in n2:
if pattern.match(a):
alphaList.append(a)
I need to extract all the alpha numeric words from a split string and store them in a list.
The above code is unable to give expected output.

Maybe this can solve the problem:
import re
# input string
stri = "Part model D3H6 with specifications X30 and Y2 is having features 12H89"
# words tokenization
split = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",stri)
# this statment returns words containing both numbers and letters
print([word for word in split if bool(re.match('^(?=.*[a-zA-Z])(?=.*[0-9])', word))])
#output: ['D3H6', 'X30', 'Y2', '12H89']

^ and $ are meant for the end and beginning of a line, not of a word.
Besides your example words don't include lower case, so why adding a-z?
Considering your example, if what you need is to fetch a word that always contains both at least one letter and at least one number and always ends with a number, this is the pattern:
\b[0-9A-Z]+\d+\b
If it may end with a letter rather than a digit, but still requires at least one digit and one letter,then it gets more complex:
\b[0-9A-Z]*\d|[A-Z][0-9A-Z]*\b
\b stands for a word boundary.

Related

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)

You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.

We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

How do I get a program to print the number of words in a sentence and each word in order

I need to print how many characters there are in a sentence the user specifies, print how many words there are in a sentence the user specifies and print each word, the number of letters in the word, and the first and last letter in the word. Can this be done?

I want you to take your time and understand what is going on in the code below and I suggest you to read these resources.
http://docs.python.org/3/library/re.html
http://docs.python.org/3/library/functions.html#len
http://docs.python.org/3/library/functions.html
http://docs.python.org/3/library/stdtypes.html#str.split
import re
def count_letter(word):
"""(str) -> int
Return the number of letters in a word.
>>> count_letter('cat')
3
>>> count_letter('cat1')
3
"""
return len(re.findall('[a-zA-Z]', word))
if __name__ == '__main__':
sentence = input('Please enter your sentence: ')
words = re.sub("[^\w]", " ", sentence).split()
# The number of characters in the sentence.
print(len(sentence))
# The number of words in the sentence.
print(len(words))
# Print all the words in the sentence, the number of letters, the first
# and last letter.
for i in words:
print(i, count_letter(i), i[0], i[-1])
Please enter your sentence: hello user
10
2
hello 5 h o
user 4 u r

Please read Python's string documentation, it is self explanatory. Here is a short explanation of the different parts with some comments.
We know that a sentence is composed of words, each of which is composed of letters. What we have to do first is to split the sentence into words. Each entry in this list is a word, and each word is stored in a form of a succession of characters and we can get each of them.
sentence = "This is my sentence"
# split the sentence
words = sentence.split()
# use len() to obtain the number of elements (words) in the list words
print('There are {} words in the given sentence'.format(len(words)))
# go through each word
for word in words:
# len() counts the number of elements again,
# but this time it's the chars in the string
print('There are {} characters in the word "{}"'.format(len(word), word))
# python is a 0-based language, in the sense that the first element is indexed at 0
# you can go backward in an array too using negative indices.
#
# However, notice that the last element is at -1 and second to last is -2,
# it can be a little bit confusing at the beginning when we know that the second
# element from the start is indexed at 1 and not 2.
print('The first being "{}" and the last "{}"'.format(word[0], word[-1]))

We don't do your homework for you on stack overflow... but I will get you started.
The most important method you will need is one of these two (depending on the version of python):
Python3.X - input([prompt]),.. If the prompt argument is present, it is written
to standard output without a trailing newline. The function then
reads a line from input, converts it to a string (stripping a
trailing newline), and returns that. When EOF is read, EOFError is
raised. http://docs.python.org/3/library/functions.html#input
Python2.X raw_input([prompt]),... If the prompt argument is
present, it is written to standard output without a trailing newline.
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that. When EOF is read,
EOFError is raised. http://docs.python.org/2.7/library/functions.html#raw_input
You can use them like
>>> my_sentance = raw_input("Do you want us to do your homework?\n")
Do you want us to do your homework?
yes
>>> my_sentance
'yes'
as you can see, the text wrote was stroed in the my_sentance variable
To get the amount of characters in a string, you need to understand that a string is really just a list! So if you want to know the amount of characters you can use:
len(s),... Return the length (the number of items) of an object.
The argument may be a sequence (string, tuple or list) or a mapping
(dictionary). http://docs.python.org/3/library/functions.html#len
I'll let you figure out how to use it.
Finally you're going to need to use a built in function for a string:
str.split([sep[, maxsplit]]),...Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).
http://docs.python.org/2/library/stdtypes.html#str.split

Can't convert 'list'object to str implicitly Python

I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).

There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.

I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)

To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))

I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])

import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2

While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()

Check if a string is a possible abbrevation for a name

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example
fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match.
fco should not match fc kopenhavn because no one irl would abbrevate FC Kopenhavn as FCO.
irl is a match for in real life.
ifk is a match for ifk goteborg.
aik is a match for allmanna idrottskluben.
aid is a match for allmanna idrottsklubben. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.
manu is a match for manchester united.
It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.
Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.

This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:
The first letter of the abbreviation must match the first letter of
the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
the remaining words, or
the remaining text starting from
any position in the first word.
tests=(
('fck','fc kopenhavn',True),
('fco','fc kopenhavn',False),
('irl','in real life',True),
('irnl','in real life',False),
('ifk','ifk gotebork',True),
('ifko','ifk gotebork',False),
('aik','allmanna idrottskluben',True),
('aid','allmanna idrottskluben',True),
('manu','manchester united',True),
('fz','faz zoo',True),
('fzz','faz zoo',True),
('fzzz','faz zoo',False),
)
def is_abbrev(abbrev, text):
abbrev=abbrev.lower()
text=text.lower()
words=text.split()
if not abbrev:
return True
if abbrev and not text:
return False
if abbrev[0]!=text[0]:
return False
else:
return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
any(is_abbrev(abbrev[1:],text[i+1:])
for i in range(len(words[0]))))
for abbrev,text,answer in tests:
result=is_abbrev(abbrev,text)
print(abbrev,text,result,answer)
assert result==answer

Here's a way to accomplish what you seem to want to do
import re
def is_abbrev(abbrev, text):
pattern = ".*".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None
The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.
Edit:
Your new update changed the rules a bit. By using "(|.*\s)" instead of ".*" the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.
This will correctly match fck with FC Kopenhavn, but fco will not.
However, matching aik with allmanna idrottskluben will not work, as that requires knowledge of the swedish language and is not as trivial to do.
Here's the new code with the minor modification
import re
def is_abbrev(abbrev, text):
pattern = "(|.*\s)".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None

#Ocaso Protal said in comment how should you decide that aik is valid, but aid is not valid? and he is right.
The algo which came in my mind is to work with word threshold (number of words separated by space).
words = string.strip().split()
if len(words) > 2:
#take first letter of every word
elif len(words) == 2:
#take two letters from first word and one letter from other
else:
#we have single word, take first three letter or as you like
you have to define your logic, you can't find abbreviation blindly.

Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters.
so:
upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
if letter in letters:
abbrevation += letter
for abb in _list_of_abbrevations:
if abb=abbrevation:
great_success()

This might be good enough.
def is_abbrevation(abbrevation, word):
lowword = word.lower()
lowabbr = abbrevation.lower()
for c in lowabbr:
if c not in lowword:
return False
return True
print is_abbrevation('fck', 'FC Kopenhavn')

Find max length word from arbitrary letters

I have 10 arbitrary letters and need to check the max length match from words file
I started to learn RE just some time ago, and can't seem to find suitable pattern
first idea that came was using set: [10 chars] but it also repeats included chars and I don't know how to avoid that
I stared to learn Python recently but before RE and maybe RE is not needed and this can be solved without it
using "for this in that:" iterator seems inappropriate, but maybe itertools can do it easily (with which I'm not familiar)
I guess solution is known even to novice programmers/scripters, but not to me
Thanks

I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.
The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.
Python has a nice function all for testing a set of conditions based on elements in a sequence. all has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.
My first shot at writing this was simply:
matches = []
for word in wordlist:
testset = set(letters)
if all(c in testset for c in word):
matches.append(word)
Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.
I took advantage of the fact that set.remove(item) returns None, which Python treats as a Boolean False, and expanded my generator expression used in calling all. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):
all(c in testset and "remove c from testset" for c in word)
Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:
all(c in testset and not testset.remove(c) for c in word)
Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our all test has now become a destructive test):
for word in wordlist:
testset = set(letters)
if all(c in testset and not testset.remove(c) for c in word):
matches.append(word)
The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin len would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us not len, but -1 * len:
matches.sort(key=lambda wd: -len(wd))
Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.
(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)

I think this code will do what you are looking for:
>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))
If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:
>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))

I assume you are trying to find out what is the longest word that can be made from your 10 arbitrary letters.
You can keep your 10 arbitrary letters in a dict along with the frequency they occur.
e.g., your 4 (using 4 instead of 10 for simplicity) arbitrary letters are: e, w, l, l. This would be in a dict as:
{'e':1, 'w':1, 'l':2}
Then for each word in the text file, see if all of the letters for that word can be found in your dict of arbitrary letters. If so, then that is one of your candidate words.
So:
we
wall
well
all of the letters in well would be found in your dict of arbitrary letters so save it and its length for comparison against other words.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract words from a string - python

Related

Derive words from string based on key words

How do I get a program to print the number of words in a sentence and each word in order

Can't convert 'list'object to str implicitly Python

Check if a string is a possible abbrevation for a name

Find max length word from arbitrary letters

Categories

Resources