How to extract out a specific field from structured string? - python

I'm now trying to extract the text from a structured string by regex.
For instance,
string = "field1:afield3:bfield2:cfield3:d"
all I want is the values of field3 which are 'b' and 'd'
I try to use the regex = "(field1:.*?)?(field2:.*?)?field3:"
and split the raw string by it.
but I ve got this:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
So, what is the solution?
The real case is:
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon#ENRON To: Michelle
Cash/HOU/ECT#ECTcc: Twanda Sweet/HOU/ECT#ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon#ENRON To: Michelle Cash/HOU/ECT#ECTcc: Twanda
Sweet/HOU/ECT#ECT Subject: 3rd sentence"
(one line string, without \n)
a list
re = ["1st sentence","2nd sentence","3rd sentence"]
is the result needed
Thanks!

Use
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
See the regex demo. NOTE: re.findall returns the contents of the capturing group, thus, you do not need a lookbehind in the pattern, a capturing group will do.
The regex matches:
field3: - a literal char sequence
(.*?) - any 0+ chars other than line break (if you use re.DOTALL modifier, the dot will match a newline, too)
(?=field\d+:|$) - a positive lookahead that requires (but does not consume, does not add to the match or capture) the presence of field, 1+ digits, : or the end of string after the current position.
Python demo:
import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
NOTE: A more efficient (unrolled) version of the same regex is
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
See the regex demo

You could use a positive lookbehind. It will find any character directly after field3 :
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
This would only work for a single-character. I would add a positive lookeahead, but it would become the same answer as Wiktor's.
So here's an alternative with re.split():
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']

A complex solution to get field values by field number using built-in str.replace(), str.split() and str.startswith() functions:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']

Related

How to get the text that is separated by a comma after a keyword

so I have a string that has multiple patterns like
s1 = "foo, bar"
s1 = "x, y"
s2 = "hello, hi"
s3 = "bar, foo."
I'm wondering how I can get the strings that are separated by a comma (insert random text here).
So from this example, I want to get strings ["foo","bar"] and ["x","y"] when I'm looking for "s1", and "hello" & "hi" when I look for s2, etc.
Thanks!
EDIT:
Let's assume using .split(',') is impractical due to a large number of commas outside this specific pattern I listed
The question was edited, but for for the original string:
"s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
You could use a pattern to match the specific part and then use re.split to split on a comma and optional space.
\bs1: ?(\w+(?:, ?\w+)*)
Explanation
\bs1: ? Match s1: and optional space
( Capture group 1
\w+(?:, ?\w+)* Match 1+ word chars, optionally repeat comma, optional space and 1+ word chars
) Close group 1
Regex demo | Python demo
Example code (python 3)
import re
s = "s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
def findByPrefix(prefix, s):
pattern = rf"\b{re.escape(prefix)}: ?(\w+(?:, ?\w+)*)"
res = []
for m in re.findall(pattern, s):
res.append(re.split(", ?", m))
return res
print(findByPrefix("s1", s))
Output
[['foo', 'bar'], ['x', 'y']]
You can use:
my_string.split(',')
It should return a list of every element.
Here is how you can use the re module to split a string by a given delimiter:
import re
re.split(", ", my_string)

Python Regex replace all newline characters directly followed by a char with char

Example String:
str = "test sdf sfwe \n \na dssdf
I want to replace the:
\na
with
a
Where 'a' could be any character.
I tried:
str = "test \n \na"
res = re.sub('[\n.]','a',str)
But how can I store the character behind the \n and use it as replacement?
You may use this regex with a capture group:
>>> s = "test sdf sfwe \n \na dssdf"
>>> >>> print re.sub(r'\n(.)', r'\1', s)
test sdf sfwe a dssdf
Search regex r'\n(.)' will match \n followed by any character and capture following character in group #1
Replacement r'\1' is back-reference to capture group #1 which is placed back in original string.
Better to avoid str as variable name since it is a reserve keyword (function) in python.
If by any character you meant any non-space character then use this regex with use of \S (non-whitespace character) instead of .:
>>> print re.sub(r'\n(\S)', r'\1', s)
test sdf sfwe
a dssdf
Also this lookahead based approach will also work that doesn't need any capture group:
>>> print re.sub(r'\n(?=\S)', '', s)
test sdf sfwe
a dssdf
Note that [\n.] will match any one of \n or literal dot only not \n followed by any character,
Find all the matches:
matches = re.findall( r'\n\w', str )
Replace all of them:
for m in matches :
str = str.replace( m, m[1] )
That's all, folks! =)
I think that the best way for you so you don't have more spaces in your text is the following:
string = "test sdf sfwe \n \na dssdf"
import re
' '.join(re.findall('\w+',string))
'test sdf sfwe a dssdf'

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

How to replace part of string via regex with saving part of pattern?

For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.
You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit
You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'

Find all strings that are in between two sub strings

I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.
Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)

Categories