Tokenizer method in python without using NLTK

Tokenizer method in python without using NLTK - python

New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!

Depending on the complexity you can simply use the string split function.
# Words independent of sentences
words = raw_text.split(' ')
# Sentences and words
sentences = raw_text.split('. ')
words_in_sentences = [sentence.split(' ') for sentence in sentences]
If you want to do something more sophisticated you can use packages like re, which provides support for regular expressions. [Related question]

I assuming you are talking about a tokenizer for a compiler. Such tokens are usually definable by a regular language for which regular expressions/finite state automata are the natural solutions. An example:
import re
from collections import namedtuple
Token = namedtuple('Token', ['type','value'])
def lexer(text):
IDENTIFIER = r'(?P<IDENTIFIER>[a-zA-Z_][a-zA-Z_0-9]*)'
ASSIGNMENT = r'(?P<ASSIGNMENT>=)'
NUMBER = r'(?P<NUMBER>\d+)'
MULTIPLIER_OPERATOR = r'(?P<MULTIPLIER_OPERATOR>[*/])'
ADDING_OPERATOR = r'(?P<ADDING_OPERATOR>[+-])'
WHITESPACE = r'(?P<WHITESPACE>\s+)'
EOF = r'(?P<EOF>\Z)'
ERROR = r'(?P<ERROR>.)' # catch everything else, which is an error
tokenizer = re.compile('|'.join([IDENTIFIER, ASSIGNMENT, NUMBER, MULTIPLIER_OPERATOR, ADDING_OPERATOR, WHITESPACE, EOF, ERROR]))
seen_error = False
for m in tokenizer.finditer(text):
if m.lastgroup != 'WHITESPACE': #ignore whitespace
if m.lastgroup == 'ERROR':
if not seen_error:
yield Token(m.lastgroup, m.group())
seen_error = True # scan until we find a non-error input
else:
yield Token(m.lastgroup, m.group())
seen_error = False
else:
seen_error = False
for token in lexer('foo = x12 * y / z - 3'):
print(token)
Prints:
Token(type='IDENTIFIER', value='foo')
Token(type='ASSIGNMENT', value='=')
Token(type='IDENTIFIER', value='x12')
Token(type='MULTIPLIER_OPERATOR', value='*')
Token(type='IDENTIFIER', value='y')
Token(type='MULTIPLIER_OPERATOR', value='/')
Token(type='IDENTIFIER', value='z')
Token(type='ADDING_OPERATOR', value='-')
Token(type='NUMBER', value='3')
Token(type='EOF', value='')
The above code defines each token such as IDENTIFIER, ASSIGNMENT, etc. as simple regular expressions and then combines them into a single regular expression pattern using the | operator and compiles the expression as variable tokenizer. It then uses the regular expression finditer method with the input text as its argument to create a "scanner" that tries to match successive input tokens against the tokenizer regular expression. As long as there are matches, Token instances consisting of type and value are yielded by the lexer generator function. In this example, WHITESPACE tokens are not yielded on the assumption that whitespace is to be ignored by the parser and only serves to separate other tokens.
There is a catchall ERROR token defined as the last token that will match a single character if none of the other token regular expressions match (a . is used for this, which will not match a newline character unless flag re.S is used, but there is no need to match a newline since the newline character is being matched by the WHITESPACE token regular expression and is therefor a "legal" match). Special code is added to prevent successive ERROR tokens being generated. In effect, the lexer generates an ERROR token and then throws away input until it can once again match a legal token.

use gensim instead.
tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

Related

wildcard match & replace and/or multiple string wildcard matching

I have two very related questions:
I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?')
and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring
(As for instance in the DOS copy command)
Example: pattern='*.txt' and replacement-pattern='*.doc':
I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc
Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.
Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.
or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.
Example: pattern1='*.txt' and pattern2='*-suffix.txt
Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not
string1='XX.txt' and string2='YY-suffix.txt'
In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.
I am sure there are algorithms for these tasks, however, I am unable to find anything useful.
The Python library has fnmatch but this is does not support what I want to do.

There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.
This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.
import re
def replaceWildcards(string, pattern, replacementPattern):
splitPattern = re.split(r'([*?])', pattern)
splitReplacement = re.split(r'([*?])', replacementPattern)
if (len(splitPattern) != len(splitReplacement)):
raise ValueError("Provided pattern wildcards do not match")
reg = ""
sub = ""
for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
if regexPiece in ["*", "?"]:
if replacementPiece != regexPiece:
raise ValueError("Provided pattern wildcards do not match")
reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
else:
reg += f"({re.escape(regexPiece)})"
sub += f"{replacementPiece}"
return re.sub(reg, sub, string)
Sample output:
replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'
replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'
replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'
replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt
Note f-strings require Python >=3.6.

how to define two tokens as one token?

I am trying to define two words separated by space as one token in my lexical analyzer
but when I pass an input like in out it says LexToken(KEYIN,'in',1,0)
and LexToken(KEYOUT,'out',1,3)
I need it to be like this LexToken(KEYINOUT,'in out',1,0)
PS: KEYIN and KEYOUT are two different tokens as the grammar's definition
Following is the test which causes the problem:
import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here
tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())
t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
t.lexer.skip(1)
lex.lex()
lex.input("in out inout + - ")
while True:
tok = lex.token()
print(tok)
if not tok:
break
Output:
LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None

This is your function which recognizes IDENTIFIERs and keywords:
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:
in
out
in\sout
Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.
But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:
[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)
You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,
t_KEYINOUT = r'in\sout'
will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.
However, that is still not exactly what you want, since it will tokenize
in outwards
as
LexToken(KEYINOUT,'in out',1,0)
LexToken(IDENTIFIER,'wards',1,6)
rather than
LexToken(KEYIN,'in',1,0)
LexToken(IDENTIFIER,'outwards',1,3)
To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:
reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
def t_KEYINOUT(t):
r'in\sout\b'
return t
def t_IDENTIFIER(t):
r'[a-zA-Z][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:
parameter: KEYIN IDENTIFIER
| KEYOUT IDENTIFIER
| KEYIN KEYOUT IDENTIFIER

Python regex: re.search() is extremely slow on large text files

My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?

How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).

I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex

You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group

I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.

Efficient way to search for invalid characters in python

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
words = topic_message.split()
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
***for word in words:
if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
return topic_message
Thanks for any help.

For a regex solution, there are two ways to go here:
Find one invalid char anywhere in the string.
Validate every char in the string.
Here is a script that implements both:
import re
topic_message = 'This topic is a-ok'
# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
print ("RE1: Invalid char detected.")
else:
print ("RE1: No invalid char detected.")
# Option 2: Validate all chars in string.
re2 = re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
print ("RE2: All chars are valid.")
else:
print ("RE2: Not all chars are valid.")
Take your pick.
Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.
Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:
Test data:
r"""
TEST topic_message STRINGS:
ok: 'This topic is A-ok. This topic is A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'
MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""
Results:
r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method Ok-time Bad-time
1 1.054 1.190
2 1.830 1.636
3 4.364 4.577
"""
The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.

You have to be much more careful when using regular expressions - they are full of traps.
in the case of [^<>/\{}[]~] the first ] closes the group which is probably not what you intended. If you want to use ] in a group it has to be the first character after the [ eg []^<>/\{}[~]
simple test confirms this
>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>
regex is overkill for this problem anyway
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
invalid_chars = '^<>/\{}[]~`$'
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
if set(invalid_chars).intersection(topic_message):
raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
return topic_message

If efficiency is a major concern I would re.compile() the re string, since you're going to use the same regex many times.

re.match and re.search behave differently. Splitting words is not required to search using regular expressions.
import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");
if symbols_re.search(self.cleaned_data('topic_message')):
//raise Validation error

I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.

In any case you need to scan the entire message. So wouldn't something simple like this work ?
def checkMessage(topic_message):
for char in topic_message:
if char in "<>/\{}[]~`":
return False
return True

is_valid = not any(k in text for k in '<>/{}[]~`')

I agree with gnibbler, regex is an overkiller for this situation. Probably after removing this unwanted chars you'll want to remove unwanted words also, here's a little basic way to do it:
def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
title2 = ('%s %s') % (title2, word)
#title2 = title2 + u' '+ word
return title2

Example: just tailor to your needs.
### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenizer method in python without using NLTK - python

New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!

use gensim instead. tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

Related

wildcard match & replace and/or multiple string wildcard matching

how to define two tokens as one token?

Python regex: re.search() is extremely slow on large text files

Efficient way to search for invalid characters in python

python regex for repeating string

Categories

Resources