Splitting a string using re module of python - python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.

Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())

If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE

Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Related

split string on any special character using python

currently I can have many dynamic separators in string like
new_123_12313131
new$123$12313131
new#123#12313131
etc etc . I just want to check if there is a special character in string then just get value after last separator like in this example just want 12313131
This is a good use case for isdigit():
l = [
'new_123_12313131',
'new$123$12313131',
'new#123#12313131',
]
output = []
for s in l:
temp = ''
for char in s:
if char.isdigit():
temp += char
output.append(temp)
print(output)
Result: ['12312313131', '12312313131', '12312313131']
Assuming you define 'special character' as anything thats not alphanumeric, you can use the str.isalnum() function to determine the first special character and leverage it something like this:
def split_non_special(input) -> str:
"""
Find first special character starting from the end and get the last piece
"""
for i in reversed(input):
if not i.isalnum():
return input.split(i)[-1] # return as soon as a separator is found
return '' # no separator found
# inputs = ['new_123_12313131', 'new$123$12313131', 'new#123#12313131', 'eefwfwrfwfwf3243']
# outputs = [split_non_special(input) for input in inputs]
# ['12313131', '12313131', '12313131', ''] # outputs
just get value after last separator
the more obvious way is using re.findall:
from re import findall
findall(r'\d+$',text) # ['12313131']
Python supplies what seems to be what you consider "special" characters using the string library as string.punctuation. Which are these characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Using that in conjunction with the re module you can do this:
from string import punctuation
import re
re.split(f"[{punctuation}]", my_string)
my_string being the string you want to split.
Results for your examples
['new', '123', '12313131']
To get just digits you can use:
re.split("\d", my_string)
Results:
['123', '12313131']

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

Python re match last underscore in a string

I have some strings that look like this
S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png
I want to get the string between slash and the last underscore, and also the string between last underscore and extension, so
Desired:
[S25m_16Q, S25m_1_16Q, S25m_2_16Q]
[-2dB, 0dB, 2dB]
I was able to get the whole thing between slash and extension by doing
foo = "S25m\S25m_16Q_-2dB.png"
match = re.search(r'([a-zA-Z0-9_-]*)\.(\w+)', foo)
match.group(1)
But I don't know how to make a pattern so I could split it by the last underscore.
Capture the groups you want to get.
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_16Q_-2dB.png").groups()
('S25m_16Q', '-2dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_1_16Q_0dB.png").groups()
('S25m_1_16Q', '0dB')
>>> re.search(r'([-\w]*)_([-\w]+)\.\w+', "S25m\S25m_2_16Q_2dB.png").groups()
('S25m_2_16Q', '2dB')
* matches the previous character set greedily (consumes as many as possible); it continues to the last _ since \w includes letters, numbers, and underscore.
>>> zip(*[m.groups() for m in re.finditer(r'([-\w]*)_([-\w]+)\.\w+', r'''
... S25m\S25m_16Q_-2dB.png
... S25m\S25m_1_16Q_0dB.png
... S25m\S25m_2_16Q_2dB.png
... ''')])
[('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q'), ('-2dB', '0dB', '2dB')]
A non-regex solution (albeit rather messy):
>>> import os
>>> s = "S25m\S25m_16Q_-2dB.png"
>>> first, _, last = s.partition("\\")[2].rpartition('_')
>>> print (first, os.path.splitext(last)[0])
('S25m_16Q', '-2dB')
I know it says using re, but why not just use split?
strings = """S25m\S25m_16Q_-2dB.png
S25m\S25m_1_16Q_0dB.png
S25m\S25m_2_16Q_2dB.png"""
strings = strings.split("\n")
parts = []
for string in strings:
string = string.split(".png")[0] #Get rid of file extension
string = string.split("\\")
splitString = string[1].split("_")
firstPart = "_".join(splitString[:-1]) # string between slash and last underscore
parts.append([firstPart, splitString[-1]])
for line in parts:
print line
['S25m_16Q', '-2dB']
['S25m_1_16Q', '0dB']
['S25m_2_16Q', '2dB']
Then just transpose the array,
for line in zip(*parts):
print line
('S25m_16Q', 'S25m_1_16Q', 'S25m_2_16Q')
('-2dB', '0dB', '2dB')

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories