python regex tokenization custom characters to be splitting points

python regex tokenization custom characters to be splitting points - python

I want to tokenize a sentence (purely with regex without having to install NLTK or similar). I want the tokenizer to:
split around hyphens '-' and apostrophes "'" --> e.g. ( I haven't heard good-news:: I haven ' t heard good - news )
split around all other characters if only followed by a space --> e.g. ( I have, some 16,000.13 dollars (A) to spare :: I have , some 16,000.13 dollars ( A ) to spare)
I created this function, but it still does not split around "(":
def tok(txt): #we want to make sure that only apostrophs and hyphens are splitting points, and all other non alpha characters not followed by space to be non-spliting points
sub=re.sub(r'(?u)(\W)(?!\S)',r' \1 ',txt)
sub=re.sub(r"(?u)([\-\'\[\(\{])",r' \1 ',sub) #will need to add more exceptions
return [v for v in re.split('(?u)\s+',sub) if v]

A Python Oddity: re.split doesn't split on zero-width matches
Most regex engines allow you to split on a zero-width match, i.e., a certain position in the string. For instance, you can use the lookbehind (?<=&) to split if the previous character is a &. However, Python does not allow you to split on zero-width matches—unless you use the regex module with the V1 flag turned on. To see this, try:
re.split("(?=&)", "a&fine&tree")
and
regex.split("(?V1)(?=&)", "a&fine&tree")
Splitting vs. Match All
So if we want to stick to re, splitting may be awkward. Luckily, splitting a string and matching all the tokens of interest are two sides of the same coin. In this case, matching is faster and gives you the same array.
Please note that I have adjusted your rules based on your desired output, but we can tweak them back. For instance, it does sound like you want to want to split unconditionally around ( since there is no space after ( in (A. Also digits sound like they should be treated like letters.
Just use this:
result = re.findall("[-'()]|[^a-z0-9 ](?= )|(?:[a-z0-9]|[^-'()a-z0-9 ](?! ))+", subject, re.IGNORECASE)
The tokens:
I
haven
'
t
heard
good
-
news
I
have
,
some
16,000.13
dollars
(
A
)
to
spare
References
The Elements of Good Regex Style (search for "split")
re does not split on zero-width matches

Related

Splitting a string via Regex and Maxsplit returns multiple splits

I have a list of strings taken from chat log data, and I am trying to find the optimal method of splitting the speaker from the content of speech. Two examples are as follows:
mystr = ['bob123 (5:09:49 PM): hi how are you',
'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
Note that, while they are broadly similar, there are some stylistic differences I need to account for (inclusion of dates, period marks, extra spaces etc.). I require a way to standardize and split such strings, and others like these, into something like the following list:
mystrList = [['bob123','hi how are you'],['jane_r16','What day is it today']]
Given that I do not need the times, numbers, or most punctuation, i thought a reasonable first step would be to remove anything non-essential. After doing so, I now have the following:
myCleanstr = ['bob(): hi how are you','janer() : What day is it today?']
Doing this has given me a pretty unique sequence of characters per string (): that is unlikely to appear elsewhere in the same string. My subsequent thinking was to use this as a de-marker to split each string using Regex:
mystr_split = [re.split(r'\(\)( ){,2}:', i, maxsplit=1, flags=re.I) for i in myCleanstr]
Here, my intention was the following:
\(\) Find a sequence of an open followed by a closed parentheses symbol
( ){,2} Then find zero, one, or two whitespaces
: Then find a colon symbol
However, in both instances, I receive three objects per string. I get the correct speaker ID, and speech content. But, in the first string I get an additional NoneType Object, and in the second string I get an additional string filled with a single white-space.
I had assumed that including maxsplit=1 would mean that the process would end after the first split has been found, but this doesn't appear to be the case. Rather than filter my results on the content I need I would like to understand why it is performing as it is.

You can use
^(\S+)\s*\([^()]*\)\s*:\s*(.+)
Or, if the name can have whitespaces:
^(\S[^(]*?)\s*\([^()]*\)\s*:\s*(.+)
See the regex demo #1 and regex demo #2. The regex matches:
^ - start of string
(\S+) - Group 1: any one or more whitespace chars
[^(]*? - zero or more chars other than a ( char, as few as possible
\s* - zero or more whitespaces
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char
\s*:\s* - a colon enclosed with zero or more whitespaces
(.+) - Group 2: any one or more chars other than line break chars, as many as possible (the whole rest of the line).
See the Python demo:
import re
result = []
mystr = ['bob123 (5:09:49 PM): hi how are you', 'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
for s in mystr:
m = re.search(r'^(\S+)\s*\([^()]*\)\s*:\s*(.+)', s)
if m:
result.append([z for z in m.groups()])
print(result)
# => [['bob123', 'hi how are you'], ['jane_r16', 'What day is it today?']]

Pandas/regex based approach to match first string from a list of strings

Apologies if this is cross-listed; I searched for a while!
I'm working with some very large, very messy data in Pandas. The variable of interest is a string, and contains one or more instances of business names with(out) typical business suffixes (e.g., LLC, LP, LTD). For example, I might have "ABC LLC XYZ,LLC XYZ, LTD". My goal is to find the first instance of a suffix, matched from a list. I also need to extract everything up to this first match. For the above example, I'd except to find/extract "ABC LLC". Consider the following data:
sfx = ['LLC','LP','LTD']
dat = pd.DataFrame({'name':['ABC LLC XYZ,LLC XYZ, LTD','IJK LP, ADDRESS']})
So far, I've accomplished this for a single case in a convoluted way that isn't working for me:
one_string = 'ABC LLC XYZ,LLC XYZ, LTD'
indexes=[]
keywords=dict()
for sf in sfx:
indexes.append(one_string.index(sf,0))
keywords[one_string.index(sf,0)]=sf
indexes.sort()
print(one_string[0:indexes[0]]+ keywords[indexes[0]])
I'm looking for a more efficient (possibly vectorized) way of doing this for an entire column. In addition, I need to incorporate regex in order to avoid extracting suffixes when the same letter combinations just happen to appear in the text. The regex pattern I need to match might look something like this (LLC appears after space or comma and is at the end of a word):
reg_pattern = r`(?<=[\s\,])LLC\b|(?<=[\s\,])LP\b|(?<=[\s\,])LTD\b`
UPDATE
Straightforward solution by Wiktor. I also realized once I have extract what precedes the suffix, I will then need to extract everything that comes after it separately. Throwing the solution into a positive look behind didn't work. Very appreciative!

To get the texts that come before and including the keywords, you may use
pattern = r"^(.*?\b(?:{}))(?!\w)".format("|".join(map(re.escape, names)))
and then
df['results'] = df['texts'].str.extract(pat, expand=False)
Adjust the column names to match your code. The pattern will look like ^(.*?\b(?:LLC|LP|LTD))(?!\w) and will mean:
^ - start of string
(.*?\b(?:LLC|LP|LTD)) - Group 1 (this value will be returned by .str.extract):
.*? - any 0+ chars other than line break chars, as few as possible
\b - a word boundary
(?:LLC|LP|LTD) - one of the alternatives: LLC, LP or LTD
(?!\w) - not followed with a word char: letter, digit or _.
To get all text after a match, you may use
pattern = r"\b(?:{})(?!\w)(.*)".format("|".join(map(re.escape, names)))
Here, the pattern will look like \b(?:LLC|LP|LTD))(?!\w)(.*) and it first matches one of the names as a whole word, and then captures into Group 1 all the rest of the line (matched with (.*) - any 0 or more chars other than line break chars).

How to only capture first group in regex? [duplicate]

This question already has answers here:
Regular expression to match string starting with a specific word
(10 answers)
Closed 2 years ago.
This is the text I am referring to:
' High 4:55AM 1.3m Low 11:35AM 0.34m High 5:47PM 1.12m Low 11:40PM 0.47m First Light 5:59AM Sunrise 6:24AM Sunset 5:01PM Last Light 5:27PM '
Using Python and regex, I only want to capture: "High 4:55AM 1.3m Low 11:35AM 0.34" (which is the first part of the text, and ideally I'd like to capture it without the extra spaces).
I've tried this regex so far: .{44}
It manages to capture the group of text I want, which is the first 44 characters, but it also captures subsequent groups of 44 characters which I don't want.

If you really just want the first 44 characters, you don't need a regex: you can simply use the Python string-slice operator:
first_44_characters = s[:44]
However, a regex is much more powerful, and could account for the fact that the length of the section you're interested in might change. For example, if the time is 10AM instead of 4AM the length of that part might change (or might not, maybe that's what the space padding is for?). In that case, you can capture it with a regex like this:
>>> re.match(r'\s+(High.*?)m', s).group(1)
'High 4:55AM 1.3'
\s matches any whitespace character, + matches one or more of the preceding element, the parentheses define a group starting with High and containing a minimal-length sequence of any character, and the m after the parentheses says the group ends right before a lowercase m character.
If you want, you can also use the regex to extract the individual parts of the sequence:
>>> re.match(r'\s+(High)\s+(\d+\:\d+)(AM|PM)\s+(\d+\.\d+)m', s).groups()
('High', '4:55', 'AM', '1.3')

This regex will capture everything starting with the first "High" until the next "High" (not included), or the end of string if no next one. It gets rid of the extra spaces at beginning and end of catured group.
^\s*(High.*?)\s*(?=$|High)
if you want to reduce all multiple spaces to single ones inside the captured group, you can use a replace function by replacing this regex " +" with " " afterwards

Regex - Match words in pattern, except within email address

I'm looking to find words in a string that match a specific pattern.
Problem is, if the words are part of an email address, they should be ignored.
To simplify, the pattern of the "proper words" \w+\.\w+ - one or more characters, an actual period, and another series of characters.
The sentence that causes problem, for example, is a.a b.b:c.c d.d#e.e.e.
The goal is to match only [a.a, b.b, c.c] . With most Regexes I build, e.e returns as well (because I use some word boundary match).
For example:
>>> re.findall(r"(?:^|\s|\W)(?<!#)(\w+\.\w+)(?!#)\b", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c', 'e.e']
How can I match only among words that do not contain "#"?

I would definitely clean it up first and simplify the regex.
first we have
words = re.split(r':|\s', "a.a b.b:c.c d.d#e.e.e")
then filter out the words that have an # in them.
words = [re.search(r'^((?!#).)*$', word) for word in words]

Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \w\.\w and the email ~ any sequence that contains #, you might find this regex to do what you need:
>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c']
The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.
Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.

You may match the email-like substrings with \S+#\S+\.\S+ and match and capture your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):
import re
rx = r"\S+#\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d#e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']
See the Python demo.
See the regex demo.

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8

With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']

Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.

If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex tokenization custom characters to be splitting points - python

Related

Splitting a string via Regex and Maxsplit returns multiple splits

Pandas/regex based approach to match first string from a list of strings

How to only capture first group in regex? [duplicate]

Regex - Match words in pattern, except within email address

Limiting regex length

Categories

Resources