REGEX extracting specific part non greedy - python

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:
import re
f_hand = open('mail.txt')
for line in f_hand:
line.rstrip()
if re.findall('\S+#\S+?',line): print re.findall('\S+#\S+?',line)
however this is what i"m getting instead of just the email address:
['href="mailto:secretary#abc-mediaent.com">sercetary#a']
What shall I use in re.findall to get just the email out?

If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:
for line in f_hand:
print re.findall(r'href="mailto:([^"#]+#[^"]+)">\1</a>', line)
(re.findall returns only the capture group. \1 stands for the content of the first capture group.)
If the file is a more complicated html file, use a parser, extract the links and filter them.Or eventually use XPath, something like: substring-after(//a/#href[starts-with(., "mailto:")], "mailto:")

\S means not a space. " and > are not spaces.
You should use mailto:([^#]+#[^"]+) as the regex (quoted form: 'mailto:([^#]+#[^"]+)'). This will put the email address in the first capture group.

try this
re.findall('mailto:(\S+#\S+?\.\S+)\"',str))
It should give you something like
['secretary#abc-mediaent.com']

\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of
[a-zA-Z0-9-_.]+#[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+
(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").
This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.

This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.
Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)". (This works without having to depend on the text surrounding an email address.)
The following lines of code -
html_str = r'sachin.gokhale#indiacast.com'
email_regex = r"([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)
yields -
['sachin.gokhale#indiacast.com', 'sachin.gokhale#indiacast.com']
P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/

Related

Extract URL from text without space between URL in Python3

I have problem with python regex, I would like to extract any URL in text except email address. My current regex pattern still can't extract url if there is no space before URL. This is my regex pattern
\b((?:(?:https|ftp|http)?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))
you can check on this regex editor (https://regex101.com/r/lcNc9N/9) , my pattern still can't recognize URL if there's no space before it, any hints or solutions are welcome.
If there's characters other then space before it, it's no longer URL :)
From the RFC:
In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
So I don't really know what you mean by "URL," but replacing the first \b with something like:
[\s\w]*?
Might be what you want. The first group will match URLs even if there are digits, alphabet letters or underscores before them.

REGEX match for portions of a document between two headers

I'm trying and failing to write a python compliant REGEX that captures multiple parts of a document. My code will actually be in Python, but right now i've only tried on regex101.com to get the expression right. (unsuccessfully obviously :) )
My text that is file-based, looks something like this:
<#
.SYNOPSIS
This is the synopsis text, that is a multiline
synopsis - I want to match all of this text
as a capture group.
.PARAMETER
This a another block of
multiline text that I want to capture
.SOMEOTHER HEADER
And some other multiline text
#
I'd like to capture 2 groups (the header and the body text), globally. (i.e for each section).
My ultimate aim is a python array of dictionaries like;
[
{'header':'SYNOPSIS', 'text': }
{'header':'PARAMETER', 'text': }
]
The header section is always anchored to the beginning the line, with '.' and followed by uppercase TEXT. The body of the section includes any words and non-word characters including CR/LF (windows based).
The Header names are not guaranteed to be fixed literals, or in a specific order. Nor do I know how many headers might exist.
Right now it looks like this
(^\.[A-Z]+)([\n\W\w]+)
Right now I can match the header followed by a body, but I'm having a hard time telling REGEX to essentially 'stop looking when you hit the next .HEADERTEXT'.
I've created a Regex101 https://regex101.com/r/YqibeH/4 if its of use (not sure how this might work out). . .
My psuedo code says something like,
Find all lines beginning with ^.[A-Z] as a capture group, then continue to match all text (multiline) after the header as a second capture group. Stop capturing just before the next header that begins ^.[A-Z]
Any help greatly appreciated.
I believe what you're looking for is look aheads. Additionally the search you were doing is greedy and should be changed out for a lazy quantifier. That being said. This should work.
^\.\w+[\n\W\w]+?(?=^\.\w+|^#>)
https://regex101.com/r/YqibeH/7
^\.\w+ Greedily captures your header text.
[\n\W\w]+? Lazily searches for your body text.
(?=^\.\w+|^#>) until it looks ahead and finds either a line beginning with another header text or a line beginning with a header closing tag.
Note that if the greedy quantifier + would be used rather than +? it would continue matching until the last possible instance it could match.
text = '<#\n.SYNOPSIS\nThis is the block of code that I would like to have matched along with the .SYNOPSIS header, ' \
'as this block belongs to SYNOPSIS\n .NOTES\n This block needs to belong with\nNOTES ' \
'header\n.SOMEOTHERHEADER\nAnd resulting text\n\n#> '
pattern = "(\.[A-Z]+\n)+"
import re
print(re.split(pattern, text))
If I got your problem right, I solved it in the following way. This way you have a list with all the elements that you need to be appendend to your dictionary by cleaning the string, of course.

Regex - extract word inside < > brackets

I am trying to extract an email address from a string like
John Smith <jsmith#email.com>
I just need the email address in the < > brackets.
Here is what I have tried so far, but I'm not very good with regex and it doesn't seem to be working, can anyone help?
import re
sender = str(message.sender)
p = re.search(r"\<(\w+)\>", sender)
logging.info(p.group(1))
You can try this:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<(.*?)>', s)[0]
Output:
'jsmith#email.com'
Or, a more email-specific solution:
email = re.findall('(?<=\<)\w+#[a-zA-Z]+\.[a-z]+(?=\>)', s)[0]
Output:
'jsmith#email.com'
Currently your regex is : "\<(\w+)\>"
You do not actually need to escape the <>, so it becomes: "<(\w+)>"
\w matches letters, numbers and the underschore '_'. In an e-mail address there are other characters as well.
You have two options: Either just accept anything inside the <> with a regex like "<(.*)>" or actually parse an e-mail address.
A simple regex for that would be "<\S+#\S+>" (non-whitespace characters followed by # followed by non-whitespace characters.
Restricting ourselves to the more commonly used characters, we can write: "<[a-zA-Z0-9+_.-]+#[a-zA-Z0-9.-]+> This still permits certain illegal e-mail addresses because I have kept it fairly simple.
Use a negative character set:
import re
s = "John Smith <jsmith#email.com>"
email = re.findall('<([^>])>', s)[0]
That matches anything thats not a > character, so everything thats in the angular brackets.

python - make my regex less greedy?

I am looking for some Python regex which I can use to extract a string from a file which has a single line of text (which is actually JavaScript code).
An example of what I'm looking for is to extract the variable name that a substring is being taken from:
So if the line of text I was parsing is:
"var foo = bar.substr(baz % qux, morestuffhere"
I want my match to be bar. I'm using the following, which matches after the equals sign and before the modulo operator:
pat = r"\s?\=\s?(.*?)\.substr\(\s?baz\s?\%\s?"
This works great if the string of interest is on a new line, however when part of a longer string it fails. See here for a failed example:
I think the issue is being less greedy with my regex? Although not sure. Pointers appreciated.
Like revo says in comment you need to be more specific for your match group.
.*? take all type of character
\S*? take non space character
\w*? take word character
so you can try this :
\s?\=\s?(\w*?)\.substr\(\s?baz\s?\%\s?

How to match emails with specific rules

How do I achieve the following with a regex:
Match if string doesn't start with a certain character
Match if there are no two ","'s or any other characters
Match if the string has double ", even if they are not adjacent
Using Python.
Currently I am attempting to match email addresses with these rules included. The current pattern I have is
pattern = '^([A-Z0-9._-\"]|\"[!\,;]\"){1-127}+#[^-][A-Z0-9.-]{3-256}+\.[A-Z]{2,4}[^-]$'
But I am confused with how to implement these rules.
Being more specific:
I want a pattern that matches an email adress consisting of 2 parts (name, domain).
The name part should be no longer then 128 characters and should go before #. It should cosist of a-z0-9 chracters and also ., _, -, ". The name can't have to adjacent dots.
If the name has " then it should be paired with another ". The name can have !;, characters if they are in between paired ".
The domain name should be no longer then 256 and no shorter then 3 characters, should be separated by a dot. The domain name can't begin or end with -.
This information is given to help you understand what I want, the main question is about three rules I stated in the top. I will gladly appreciate it if you tell me how to achieve them.
I am confused about your question. Your title says comma separated list but then you talk about email addresses. There is an official standard regex for emails:
(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

Categories