I have the following string as an example:
string = "## cat $$ ##dog$^"
I want to extract all the stringa that are locked between "##" and "$", so the output will be:
[" cat ","dog"]
I only know how to extract the first occurrence:
import re
r = re.compile('##(.*?)$')
m = r.search(string)
if m:
result_str = m.group(1)
Thoughts & suggestions on how to catch them all are welcomed.
Use re.findall() to get every occurrence of your substring. $ is considered a special character in regular expressions meaning — "the end of the string" anchor, so you need to escape $ to match a literal character.
>>> import re
>>> s = '## cat $$ ##dog$^'
>>> re.findall(r'##(.*?)\$', s)
[' cat ', 'dog']
To remove the leading and trailing whitespace, you can simply match it outside of the capture group.
>>> re.findall(r'##\s*(.*?)\s*\$', s)
['cat', 'dog']
Also, if the context has a possibility of spanning across newlines, you may consider using negation.
>>> re.findall(r'##\s*([^$]*)\s*\$', s)
Related
I'm now trying to extract the text from a structured string by regex.
For instance,
string = "field1:afield3:bfield2:cfield3:d"
all I want is the values of field3 which are 'b' and 'd'
I try to use the regex = "(field1:.*?)?(field2:.*?)?field3:"
and split the raw string by it.
but I ve got this:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
So, what is the solution?
The real case is:
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon#ENRON To: Michelle
Cash/HOU/ECT#ECTcc: Twanda Sweet/HOU/ECT#ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon#ENRON To: Michelle Cash/HOU/ECT#ECTcc: Twanda
Sweet/HOU/ECT#ECT Subject: 3rd sentence"
(one line string, without \n)
a list
re = ["1st sentence","2nd sentence","3rd sentence"]
is the result needed
Thanks!
Use
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
See the regex demo. NOTE: re.findall returns the contents of the capturing group, thus, you do not need a lookbehind in the pattern, a capturing group will do.
The regex matches:
field3: - a literal char sequence
(.*?) - any 0+ chars other than line break (if you use re.DOTALL modifier, the dot will match a newline, too)
(?=field\d+:|$) - a positive lookahead that requires (but does not consume, does not add to the match or capture) the presence of field, 1+ digits, : or the end of string after the current position.
Python demo:
import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
NOTE: A more efficient (unrolled) version of the same regex is
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
See the regex demo
You could use a positive lookbehind. It will find any character directly after field3 :
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
This would only work for a single-character. I would add a positive lookeahead, but it would become the same answer as Wiktor's.
So here's an alternative with re.split():
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']
A complex solution to get field values by field number using built-in str.replace(), str.split() and str.startswith() functions:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']
I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo
Suppose there is a series of strings. Important items are enclosed in quotes, but other items are enclosed in escaped quotes. How can you return only the important items?
Example where both are returned:
import re
testString = 'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = = '"([^\\\"]*)"'
print re.findall( pattern, testString)
Result prints
['one', 'two']
How can I get python's re to only print
['one']
You can use negative lookbehinds to ensure there's no backslash before the quote:
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = r'(?<!\\)"([^"]*)(?<!\\)"'
# ^^^^^^^ ^^^^^^^
print re.findall(pattern, testString)
regex101 demo
ideone demo
Here even though you are using \" to mark other items but in python it is interpreted as "two" only.You can use python raw strings where \" will be treated as \"
import re
testString = r'this, is a test "one" it should only return the first item \"two\" and not the second'
pattern = '"(\w*)"'
print re.findall( pattern, testString)
How to use regular expressions to only capture a word by itself rather than the word inside another word?
For example, I'd like to replace only the "Co" within "Company & Co."
import re
re.subn('Co','',"Company & Co")
>>('mpany & ', 2) #which i dont want
>> "Company & "#Desired Result
You want word boundaries.
They are expressed with \b in most regex dialects (and with \< and \> in some). Python uses \b.
import re
re.subn(r'\bCo\b', '', "Company & Co")
note the r in front of the pattern.
"Word itself" means that the word is spanned by spaces or beginning/end of the sentence. So...
re.subn('(\s|^)Co(\s|$)','\g<1>\g<2>',"Company & Co")
what about this
import re
print re.subn('Co$','',"Company & Co")
these are called metacharacters, that are very useful and worth looking at.
Use the r"\b" expression to match the empty string at the beginning or end of what you're looking for to ensure that it's a whole word and not part of another word:
>>> import re
>>> pat1 = re.compile("Co")
>>> pat2 = re.compile(r"\bCo\b")
>>> pat1.match("Company")
<_sre.SRE_Match object at 0x106b92780>
>>> pat2.search("Company")
# (fails)
>>> pat2.search("Co")
<_sre.SRE_Match object at 0x106b927e8>
>>> pat2.search("Co & Something")
<_sre.SRE_Match object at 0x106b92780> # succeeds
This syntax works whether the boundary between what you're looking for is:
white space
beginning of string
end of string
i want to print out all substring in a string that has the following pattern +2 characters:
for example get the substrings
$iwantthis*12
$and this*11
from the string;
$iwantthis*1231 $and this*1121
in the monent i use
print re.search('(.*)$(.*) *',string)
and i get $iwantthis*1231 but how can i limit the number of characters after the last pattern symbol * ?
Greetings
In [13]: s = '$iwantthis*1231 $and this*1121'
In [14]: re.findall(r'[$].*?[*].{2}', s)
Out[14]: ['$iwantthis*12', '$and this*11']
Here,
[$] matches $;
.*?[*] matches the shortest sequence of characters followed by *;
.{2} matches any two characters.
import re
data = "$iwantthis*1231 $and this*1121"
print re.findall(r'(\$.*?\d{2})', data)
Output
['$iwantthis*12', '$and this*11']
Debuggex Demo
RegEx101 Explanation