match multiple substrings using findall from re library - python

I have a large array that contains strings with the following format in Python
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE]
I just need to extract the substrings that start with MATH, SCIENCE and ART. So what I'm currently using
my_str = re.findall('MATH_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
my_str = re.findall('SCIENCE_.*? ', some_array )
if len(my_str) !=0:
print(my_str)
my_str = re.findall('ART_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function.

You can use | to match multiple different strings in a regular expression.
re.findall('(?:MATH|SCIENCE|ART)_.*? ', ...)
You could also use str.startswith along with a list comprehension.
res = [x for x in some_array if any(x.startswith(prefix)
for prefix in ('MATH', 'SCIENCE', 'ART'))]

You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:
\b(?:MATH|SCIENCE|ART)_\S*
Regex demo
Or if only word characters \w:
\b(?:MATH|SCIENCE|ART)_\w*
Example
import re
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE']
pattern = re.compile(r"\b(?:MATH|SCIENCE|ART)_\S* ")
for s in some_array:
print(pattern.findall(s))
Output
['MATH_SOME_TEXT_AND_NUMBER ']
['SCIENCE_SOME_TEXT_AND_NUMBER ']
['ART_SOME_TEXT_AND_NUMBER ']

Related

how to get a pattern repeating multiple times in a string using regular expression

I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

string to list conversion in python?

I have a string like :
searchString = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
what I want is list :
["u:sads asdas asdsad","n:sadasda","as:adds sdasd dasd","a:sed eee"]
What I have done is :
values = re.split('\s', searchString)
mylist = []
word = ''
for elem in values:
if ':' in elem:
if word:
mylist.append(word)
word = elem
else:
word = word + ' ' + elem
list.append(word)
return mylist
But I want an optimized code in python 2.6 .
Thanks
Use regular expressions:
import re
mylist= re.split('\s+(?=\w+:)', searchString)
This splits the string everywhere there's a space followed by one or more letters and a colon. The look-ahead ((?= part) makes it split on the whitespace while keeping the \w+: parts
You can use "look ahead" feature offered by a lot of regular expression engines. Basically, the regex engines checks for a pattern without consuming it when it comes to look ahead.
import re
s = "u:sads asdas asdsad n:sadasda as:adds sdasd dasd a:sed eee"
re.split(r'\s(?=[a-z]:)', s)
This means, split only when we have a \s followed by any letter and a colon but don't consume those tokens.

Regex for extraction in Python

I have a string like this:
"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".
I would like to get this as an output:
(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))
I haven't been able to find the proper python regex to achieve that.
>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
if you still want number there you'd need to iterate over the output and convert it to the integer with int.
You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:
regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)
For the example this gives:
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.
[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]
Returns:
[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]
This method works regardless of the number of elements in the {{ }} blocks.
To get the exact output you wrote, you need a regex and a split:
import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))
To get it with the numbers converted, do this:
toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]
Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.
import re
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []
for match in re.finditer('{{.*?}}', s):
# Split on pipe (|) and filter out non-alphanumerics
parts = [filter(str.isalnum, part) for part in match.group().split('|')]
# Convert to int when possible
for index, part in enumerate(parts):
try:
parts[index] = int(part)
except ValueError:
pass
result.append(tuple(parts))
We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.
import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
lst = []
for x in iterable:
try:
lst.append(int(x))
except ValueError:
lst.append(x)
return tuple(lst)
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]
In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.
re.findall() returns a list of groups matched from the pattern.
Finally, a list comprehension splits each string and returns the result as a tuple.
Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.
from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList
LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))
patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
print tuple(p[0] for p in patt.searchString(s))
Prints:
(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Categories