Breaking up substrings in Python based on characters - python

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?

You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']

fields = inputString.split('"')
print fields[1], fields[3], fields[5]

You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Related

match multiple substrings using findall from re library

I have a large array that contains strings with the following format in Python
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE]
I just need to extract the substrings that start with MATH, SCIENCE and ART. So what I'm currently using
my_str = re.findall('MATH_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
my_str = re.findall('SCIENCE_.*? ', some_array )
if len(my_str) !=0:
print(my_str)
my_str = re.findall('ART_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function.
You can use | to match multiple different strings in a regular expression.
re.findall('(?:MATH|SCIENCE|ART)_.*? ', ...)
You could also use str.startswith along with a list comprehension.
res = [x for x in some_array if any(x.startswith(prefix)
for prefix in ('MATH', 'SCIENCE', 'ART'))]
You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:
\b(?:MATH|SCIENCE|ART)_\S*
Regex demo
Or if only word characters \w:
\b(?:MATH|SCIENCE|ART)_\w*
Example
import re
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE']
pattern = re.compile(r"\b(?:MATH|SCIENCE|ART)_\S* ")
for s in some_array:
print(pattern.findall(s))
Output
['MATH_SOME_TEXT_AND_NUMBER ']
['SCIENCE_SOME_TEXT_AND_NUMBER ']
['ART_SOME_TEXT_AND_NUMBER ']

split string on any special character using python

currently I can have many dynamic separators in string like
new_123_12313131
new$123$12313131
new#123#12313131
etc etc . I just want to check if there is a special character in string then just get value after last separator like in this example just want 12313131
This is a good use case for isdigit():
l = [
'new_123_12313131',
'new$123$12313131',
'new#123#12313131',
]
output = []
for s in l:
temp = ''
for char in s:
if char.isdigit():
temp += char
output.append(temp)
print(output)
Result: ['12312313131', '12312313131', '12312313131']
Assuming you define 'special character' as anything thats not alphanumeric, you can use the str.isalnum() function to determine the first special character and leverage it something like this:
def split_non_special(input) -> str:
"""
Find first special character starting from the end and get the last piece
"""
for i in reversed(input):
if not i.isalnum():
return input.split(i)[-1] # return as soon as a separator is found
return '' # no separator found
# inputs = ['new_123_12313131', 'new$123$12313131', 'new#123#12313131', 'eefwfwrfwfwf3243']
# outputs = [split_non_special(input) for input in inputs]
# ['12313131', '12313131', '12313131', ''] # outputs
just get value after last separator
the more obvious way is using re.findall:
from re import findall
findall(r'\d+$',text) # ['12313131']
Python supplies what seems to be what you consider "special" characters using the string library as string.punctuation. Which are these characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Using that in conjunction with the re module you can do this:
from string import punctuation
import re
re.split(f"[{punctuation}]", my_string)
my_string being the string you want to split.
Results for your examples
['new', '123', '12313131']
To get just digits you can use:
re.split("\d", my_string)
Results:
['123', '12313131']

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

How to extract just the characters "abc-3456" from the given text in python

i have this code
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m
This prints ['abc-3456'] but i want to get only abc-3456 (without the square brackets and the quotes].
How to do this?
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m[0]
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
findall returns list of strings. If you want the first one then use m[0].
print m[0] will give string without [] and ''.
If you only want the first (or only) result, do this:
import re
text = "this is my desc abc-3456"
m = re.search("\w+\\-\d+", text)
print m.group()
re.findall retuns a list of matches. In that list the result is a string. You can use re.finditer if you want.
In python, a list's representation is in brackets: [member1, member2, ...].
A string ("somestring") representation is in quotes: 'somestring'.
This means the representation of a list of strings is:
['somestring1', 'somestring2', ...]
So you have a string in a list, the characters you want to remove are a part of python's representation and not a part of the data you have.
To get the string simply take the first element from the list:
mystring = m[0]

Search for quotes with regular expression

I'm looking for a way to search a text file for quotes made by author and then print them out. My script so far:
import re
#searches end of string
print re.search('"$', 'i am searching for quotes"')
#searches start of string
print re.search('^"' , '"i am searching for quotes"')
What I would like to do
import re
## load text file
quotelist = open('A.txt','r').read()
## search for strings contained with quotation marks
re.search ("-", quotelist)
## Store in list or Dict
Dict = quotelist
## Print quotes
print Dict
I also tried
import re
buffer = open('bbc.txt','r').read()
quotes = re.findall(r'.*"[^"].*".*', buffer)
for quote in quotes:
print quote
# Add quotes to list
l = []
for quote in quotes:
print quote
l.append(quote)
Develop a regular expression that matches all the expected characters you would expect to see inside of a quoted string. Then use the python method findall in re to find all occurrences of the match.
import re
buffer = open('file.txt','r').read()
quotes = re.findall(r'"[^"]*"',buffer)
for quote in quotes:
print quote
Searching between " and ” requires a unicode-regex search such as:
quotes = re.findall(ur'"[^\u201d]*\u201d',buffer)
And for a document that uses " and ” interchangeably for quotation termination
quotes = re.findall(ur'"[^"^\u201d]*["\u201d]', buffer)
You don't need regular expressions to find static strings. You should use this Python idiom for finding strings:
>>> haystack = 'this is the string to search!'
>>> needle = '!'
>>> if needle in haystack:
print 'Found', needle
Creating a list is easy enough -
>>> matches = []
Storing matches is easy too...
>>> matches.append('add this string to matches')
This should be enough to get you started. Good luck!
An addendum to address the comment below...
l = []
for quote in matches:
print quote
l.append(quote)

Categories