How to extract out a specific field from structured string?

How to extract out a specific field from structured string? - python

I'm now trying to extract the text from a structured string by regex.
For instance,
string = "field1:afield3:bfield2:cfield3:d"
all I want is the values of field3 which are 'b' and 'd'
I try to use the regex = "(field1:.*?)?(field2:.*?)?field3:"
and split the raw string by it.
but I ve got this:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
So, what is the solution?
The real case is:
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon#ENRON To: Michelle
Cash/HOU/ECT#ECTcc: Twanda Sweet/HOU/ECT#ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon#ENRON To: Michelle Cash/HOU/ECT#ECTcc: Twanda
Sweet/HOU/ECT#ECT Subject: 3rd sentence"
(one line string, without \n)
a list
re = ["1st sentence","2nd sentence","3rd sentence"]
is the result needed
Thanks!

Use
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
See the regex demo. NOTE: re.findall returns the contents of the capturing group, thus, you do not need a lookbehind in the pattern, a capturing group will do.
The regex matches:
field3: - a literal char sequence
(.*?) - any 0+ chars other than line break (if you use re.DOTALL modifier, the dot will match a newline, too)
(?=field\d+:|$) - a positive lookahead that requires (but does not consume, does not add to the match or capture) the presence of field, 1+ digits, : or the end of string after the current position.
Python demo:
import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
NOTE: A more efficient (unrolled) version of the same regex is
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
See the regex demo

You could use a positive lookbehind. It will find any character directly after field3 :
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
This would only work for a single-character. I would add a positive lookeahead, but it would become the same answer as Wiktor's.
So here's an alternative with re.split():
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']

A complex solution to get field values by field number using built-in str.replace(), str.split() and str.startswith() functions:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']

Related

How to get the text that is separated by a comma after a keyword

so I have a string that has multiple patterns like
s1 = "foo, bar"
s1 = "x, y"
s2 = "hello, hi"
s3 = "bar, foo."
I'm wondering how I can get the strings that are separated by a comma (insert random text here).
So from this example, I want to get strings ["foo","bar"] and ["x","y"] when I'm looking for "s1", and "hello" & "hi" when I look for s2, etc.
Thanks!
EDIT:
Let's assume using .split(',') is impractical due to a large number of commas outside this specific pattern I listed

The question was edited, but for for the original string:
"s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
You could use a pattern to match the specific part and then use re.split to split on a comma and optional space.
\bs1: ?(\w+(?:, ?\w+)*)
Explanation
\bs1: ? Match s1: and optional space
( Capture group 1
\w+(?:, ?\w+)* Match 1+ word chars, optionally repeat comma, optional space and 1+ word chars
) Close group 1
Regex demo | Python demo
Example code (python 3)
import re
s = "s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
def findByPrefix(prefix, s):
pattern = rf"\b{re.escape(prefix)}: ?(\w+(?:, ?\w+)*)"
res = []
for m in re.findall(pattern, s):
res.append(re.split(", ?", m))
return res
print(findByPrefix("s1", s))
Output
[['foo', 'bar'], ['x', 'y']]

You can use:
my_string.split(',')
It should return a list of every element.

Here is how you can use the re module to split a string by a given delimiter:
import re
re.split(", ", my_string)

Python Regex replace all newline characters directly followed by a char with char

Example String:
str = "test sdf sfwe \n \na dssdf
I want to replace the:
\na
with
a
Where 'a' could be any character.
I tried:
str = "test \n \na"
res = re.sub('[\n.]','a',str)
But how can I store the character behind the \n and use it as replacement?

You may use this regex with a capture group:
>>> s = "test sdf sfwe \n \na dssdf"
>>> >>> print re.sub(r'\n(.)', r'\1', s)
test sdf sfwe a dssdf
Search regex r'\n(.)' will match \n followed by any character and capture following character in group #1
Replacement r'\1' is back-reference to capture group #1 which is placed back in original string.
Better to avoid str as variable name since it is a reserve keyword (function) in python.
If by any character you meant any non-space character then use this regex with use of \S (non-whitespace character) instead of .:
>>> print re.sub(r'\n(\S)', r'\1', s)
test sdf sfwe
a dssdf
Also this lookahead based approach will also work that doesn't need any capture group:
>>> print re.sub(r'\n(?=\S)', '', s)
test sdf sfwe
a dssdf
Note that [\n.] will match any one of \n or literal dot only not \n followed by any character,

Find all the matches:
matches = re.findall( r'\n\w', str )
Replace all of them:
for m in matches :
str = str.replace( m, m[1] )
That's all, folks! =)

I think that the best way for you so you don't have more spaces in your text is the following:
string = "test sdf sfwe \n \na dssdf"
import re
' '.join(re.findall('\w+',string))
'test sdf sfwe a dssdf'

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']

re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']

I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

How to replace part of string via regex with saving part of pattern?

For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.

You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit

You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'

Find all strings that are in between two sub strings

I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.

Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract out a specific field from structured string? - python

Related

How to get the text that is separated by a comma after a keyword

Python Regex replace all newline characters directly followed by a char with char

Python split with multiple delimiters not working

How to replace part of string via regex with saving part of pattern?

Find all strings that are in between two sub strings

Categories

Resources