How to write regular expression to use re.split in python

How to write regular expression to use re.split in python - python

I have a string like this:
----------
FT Weekend
----------
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
I want to delete the part from ---------- to the next ---------- included.
I have been using re.sub:
pattern =r"-+\n.+\n-+"
re.sub(pattern, '', thestring)

pattern =r"-+\n.+?\n-+"
re.sub(pattern, '', thestring,flags=re.DOTALL)
Just use DOTALL flag.The problem with your regex was that by default . does not match \n.So you need to explicitly add a flag DOTALL making it match \n.
See demo.
https://regex101.com/r/hR7tH4/24
or
pattern =r"-+\n[\s\S]+?\n-+"
re.sub(pattern, '', thestring)
if you dont want to add a flag

Your regex doesn't match the expected part because .+ doesn't capture new line character. you can use re.DOTALL flag to forced . to match newlines or re.S.but instead of that You can use a negated character class :
>>> print re.sub(r"-+[^-]+-+", '', s)
''
Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>
Or more precise you can do:
>>> print re.sub(r"-+[^-]+-+[^\w]+", '', s)
'Why do we run marathons?
Are marathons and cycling races about more than exercise? What does the
literature of endurance tell us about our thirst for self-imposed hardship?
>>>

The problem with your regex (-+\n.+\n-+) is that . matches any character but a newline, and that it is too greedy (.+), and can span across multiple ------- entities.
You can use the following regex:
pattern = r"(?s)-+\n.+?\n-+"
The (?s) singleline option makes . match any character including newline.
The .+? pattern will match 1 or more characters but as few as possible to match up to the next ----.
See IDEONE demo
For a more profound cleanup, I'd recommend:
pattern = r"(?s)\s*-+\n.+?\n-+\s*"
See another demo

Related

Split according to regex condition

This will be my another question:
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
How can I take all the characters despite it being fullstop, digits, or anything after "Organization: " using regex?
result_organization = re.search("(Organization: )(\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*)", string)
My above code is super long and not wise at all.

I would recommend using find command like this
print(string[string.find("Organization")+14:])

You don't need regex for that, this simple code should give you desired result:
str = "Organization: S.P. Dyer Computer Consulting, Cambridge MA";
if str.startswith("Organization: "):
str = str[14:];
print(str)
You also could use pattern (?<=Organization: ).+
Explanation:
(?<=Organization: ) - positive lookbehind, asserts if what is preceeding is Organization:
.+ - match any character except for newline characters.
Demo

You could use a single capturing group instead of 2 capturing groups.
Instead of specify all the words (\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*) you might choose to match any character except a newline using the dot and then match the 0+ times to match until the end.
But note that that would also match strings like ##$$ ++
^Organization: (.+)
Regex demo | Python demo
For example
import re
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
result_organization = re.search("Organization: (.*)", string)
print(result_organization.group(1))
If you want a somewhat more restrictive pattern you might use a character class and specify what you would allow to match. For example:
^Organization: ([\w.,]+(?: [\w.,]+)*)
Regex demo

extract word and before word and insert between ”_” in regex

I need some help on declaring a regex. My inputs are like the following:
I need to extract word and before word and insert between ”_” in regex:python
Input
Input
s2 = 'Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
# my regex pattern
re.sub(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}diagnosis", r"\1_", s2)
Desired Output:
s2 = 'Some other medical terms and stuff_diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'

You have no capturing group defined in your regex, but are using \1 placeholder (replacement backreference) to refer to it.
You want to replace 1+ special chars other than - and ' before the word diagnosis, thus you may use
re.sub(r"[^\w'-]+(?=diagnosis)", "_", s2)
See this regex demo.
Details
[^\w'-]+ - any non-word char excluding ' and _
(?=diagnosis) - a positive lookahead that does not consume the text (does not add to the match value and thus re.sub does not remove this piece of text) but just requires diagnosis text to appear immediately to the right of the current location.
Or
re.sub(r"[^\w'-]+(diagnosis)", r"_\1", s2)
See this regex demo. Here, [^\w'-]+ also matches those special chars, but (diagnosis) is a capturing group whose text can be referred to using the \1 placeholder from the replacement pattern.
NOTE: If you want to make sure diagnosis is matched as a whole word, use \b around it, \bdiagnosis\b (mind the r raw string literal prefix!).

Match regex with \\n in it

I have the following string:
>>> repr(s)
" NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp
I want to match the string before the \\n -- everything before a whitespace character. The output should be:
['NBCUniversal', 'VOLGAFILMINC']
Here is what I have so far:
re.findall(r'[^s].+\\n\d{1,2}', s)
What would be the correct regex for this?

EDIT: sorry I haven't read carefully your question
If you want to find all groups of letters immediatly before a literal \n, re.findall is appropriate. You can obtain the result you want with:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> re.findall(r'(?i)[a-z]+(?=\\n)', s)
['NBCUniversal', 'VOLGAFILMINC']
OLD ANSWER:
re.findall is not the appropriate method since you only need one result (that is a pair of strings). Here the re.search method is more appropriate:
>>> import re
>>> s = " NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64 Video Service Corp "
>>> res = re.search(r'^(?i)[^a-z\\]*([a-z]+)\\n[^a-z\\]*([a-z]+)', s)
>>> res.groups()
('NBCUniversal', 'VOLGAFILM')
Note: I have assumed that there are no other characters between the first word and the literal \n, but if it isn't the case, you can add [^a-z\\]* before the \\n in the pattern.

If you want to fix your existing code instead of replace it, you're on the right track, you've just got a few minor problems.
Let's start with your pattern:
>>> re.findall(r'[^s].+\\n\d{1,2}', s)
[' NBCUniversal\\n63 VOLGAFILM, INC VOLGAFILMINC\\n64']
The first problem is that .+ will match everything that it can, all the way up to the very last \\n\d{1,2}, rather than just to the next \\n\d{1,2}. To fix that, add a ? to make it non-greedy:
>>> re.findall(r'[^s].+?\\n\d{1,2}', s)
[' NBCUniversal\\n63', ' VOLGAFILM, INC VOLGAFILMINC\\n64']
Notice that we now have two strings, as we should. The problem is, those strings don't just have whatever matched the .+?, they have whatever matched the entire pattern. To fix that, wrap the part you want to capture in () to make it a capturing group:
>>> re.findall(r'[^s](.+?)\\n\d{1,2}', s)
[' NBCUniversal', ' VOLGAFILM, INC VOLGAFILMINC']
That's nicer, but it still has a bunch of extra stuff on the left end. Why? Well, you're capturing everything after [^s]. That means any character except the letter s. You almost certainly meant [\s], meaning any character in the whitespace class. (Note that \s is already the whitespace class, so [\s], meaning the class consisting of the whitespace class, is unnecessary.) That's better, but that's still only going to match one space, not all the spaces. And it will match the earliest space it can that still leaves .+? something to match, not the latest. So if you want to suck all all the excess spaces, you need to repeat it:
re.findall(r'\s+(.+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILM, INC VOLGAFILMINC']
Getting closer… but the .+? matches anything, including the space between VOLGAFILM and VOLGAFILMINC, and again, the \s+ is going to match the first run of spaces it can, leaving the .+? to match everything after that.
You could fiddle with the prefix , but there's an easier solution. If you don't want spaces in your capture group, just capture a run of nonspaces instead of a run of anything, using \S:
re.findall(r'\s+(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
And notice that once you've done that, the \s+ isn't really doing anything anymore, so let's just drop it:
re.findall(r'(\S+?)\\n\d{1,2}', s)
['NBCUniversal', 'VOLGAFILMINC']
I've obviously made some assumptions above that are correct for your sample input, but may not be correct for real data. For example, if you had a string like Weyland-Yutani\\n…, I'm assuming you want Weyland-Yutani, not just Yutani. If you have a different rule, like only letters, just change the part in parentheses to whatever fits that rule, like (\w+?) or ([A-Za-z]+?).

Assuming that the input actually has the sequence \n (backslash followed by letter 'n') and not a newline, this will work:
>>> re.findall(r'(\S+)\\n', s)
['NBCUniversal', 'VOLGAFILMINC']
If the string actually contains newlines then replace \\n with \n in the regular expression.

Regex to extract top level domain from email address

From email address like
xxx#site.co.uk
xxx#site.uk
xxx#site.me.uk
I want to write a regex which should return 'uk' is all the cases.
I have tried
'+#([^.]+)\..+'
which gives only the domain name. I have tried using
'[^/.]+$'
but it is giving error.

The regex to extract what you are asking for is:
\.([^.\n\s]*)$ with /gm modifiers
explanation:
\. matches the character . literally
1st Capturing group ([^.\n\s]*)
[^.\n\s]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
\n matches a fine-feed (newline) character (ASCII 10)
\s match any white space character [\r\n\t\f ]
$ assert position at end of a line
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
g modifier: global. All matches
for your input example, it will be:
import re
m = re.compile(r'\.([^.\n\s]*)$', re.M)
f = re.findall(m, data)
print f
output:
['uk', 'uk', 'uk']
hope this helps.

As myemail#com is a valid address, you can use:
#.*([^.]+)$

You don't need regex. This would always give you 'uk' in your examples:
>>> url = 'foo#site.co.uk'
>>> url.split('.')[-1]
'uk'

Simply .*\.(\w+) won't help?
Can add more validations for "#" to the regular expression if needed.

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')

The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']

I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary

^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.

Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write regular expression to use re.split in python - python

Related

Split according to regex condition

extract word and before word and insert between ”_” in regex

Match regex with \\n in it

Regex to extract top level domain from email address

Match single quotes from python re

Categories

Resources