python regexp parse - python

I have hash:
{'login': u'myemail (myemail#gmail.com)'}
I need parse only email myemail#gmail.com
What regexp I must compose

No regex is needed. Use string manipulation instead. This will split the value on the first space, then strip the () from the second item ([1]) of the returned array.
yourhash = {'login': u'myemail (myemail#gmail.com)'}
email = yourhash['login'].split()[1].strip("()")
print(email)
# myemail#gmail.com

If you really need a regular expression solution (versus the excellent string split options also posted) this will do it for you:
>>> import re
>>> re.match('.*\((.*)\)', 'myemail (myemail#gmail.com)').group(1)
'myemail#gmail.com'
>>>

Use string methods instead:
my_dict['login'].split['('][1].strip(')')

There are many patterns for matching emails. A good resource can be found here.
For example,
^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$

Related

find multiple things in a string using regex in python

My input string contains various entities like this:
conn_type://host:port/schema#login#password
I want to find out all of them using regex in python.
As of now, I am able to find them one by one, like
conn_type=re.search(r'[a-zA-Z]+',test_string)
if (conn_type):
print "conn_type:", conn_type.group()
next_substr_len = conn_type.end()
host=re.search(r'[^:/]+',test_string[next_substr_len:])
and so on.
Is there a way to do it without if and else?
I expect there to be some way, but not able to find it. Please note that every entity regex is different.
Please help, I don't want to write a boring code.
Why don't you use re.findall?
Here is an example:
import re;
s = 'conn_type://host:port/schema#login#password asldasldasldasdasdwawwda conn_type://host:port/schema#login#email';
def get_all_matches(s):
matches = re.findall('[a-zA-Z]+_[a-zA-Z]+:\/+[a-zA-Z]+:+[a-zA-Z]+\/+[a-zA-Z]+#+[a-zA-Z]+#[a-zA-Z]+',s);
return matches;
print get_all_matches(s);
this will return a list full of matches to your current regex as seen in this example which in this case would be:
['conn_type://host:port/schema#login#password', 'conn_type://host:port/schema#login#email']
If you need help making regex patterns in Python I would recommend using the following website:
A pretty neat online regex tester
Also check the re module's documentation for more on re.findall
Documentation for re.findall
Hope this helps!
>>>import re
>>>uri = "conn_type://host:port/schema#login#password"
>>>res = re.findall(r'(\w+)://(.*?):([A-z0-9]+)/(\w+)#(\w+)#(\w+)', uri)
>>>res
[('conn_type', 'host', 'port', 'schema', 'login', 'password')]
No need for ifs. Use findall or finditer to search through your collection of connection types. Filter the list of tuples, as need be.
If you like it DIY, consider creating a tokenizer. This is very elegant "python way" solution.
Or use a standard lib: https://docs.python.org/3/library/urllib.parse.html but note, that your sample URL is not fully valid: there is no schema 'conn_type' and you have two anchors in the query string, so urlparse wouldn't work as expected. But for real-life URLs I highly recommend this approach.

Python Regular Expression For Host Header

Given this string:
GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n
How would I obtain everything in a Python regex group between Host: and \r\n?
In this example, I would like re.match.group(1) to return www.youtube.com
You could use this Regex to match
>>> a = 'GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n'
>>> import re
>>> re.search(r"Host: (.+)\r\n",a).group(1)
'www.youtube.com'
Small Note - It is better to use re.MULTILINE flag as the input string contains \n, though it is not required in this particular case.
Additionally, As Antti Haapala mentions, using the ^ to match the start of the string is also a better option, as there may be Header fields with the name Host. Thus the final regex would be something like re.search(r"^Host: (.+)\r\n",a,re.M).group(1).
using positive look behind and positive look ahead
>>> import re
>>> a = 'GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n'
>>> re.search(r"(?<=Host: )(\S+)(?=\r\n)", a).group(1)
'www.youtube.com'

finding email address in a web page using regular expression

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.
page = urllib.request.urlopen("http://website/category")
reg_ex = re.compile(r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
m = reg_ex.search_all(page)
m.group()
When I ran it, the Python module said that there is an invalid syntax and it is on the line:
m = reg_ex.search_all(page)
Would anyone tell me why it is invalid?
Consider an alternative:
## Suppose we have a text with many email addresses
str = 'purple alice#google.com, blah monkey bob#abc.com blah dishwasher'
## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+#[\w\.-]+', str)
## ['alice#google.com', 'bob#abc.com']
for email in emails:
# do something with each found email string
print email
Source: https://developers.google.com/edu/python/regular-expressions
Besides, reg_ex has no search_all method. And you should pass in page.read().
You don't have closing ) at this line:
reg_ex = re.compile(r'[a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)
Plus, your regex is not valid, try this instead:
"[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
FYI, validating email using regex is not that trivial, see these threads:
Python check for valid email address?
Using a regular expression to validate an email address
there is no .search_all method with the re module
maybe theone you are looking for is .findall
you can try
re.findall(r"(\w(?:[-.+]?\w+)+\#(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)
i assume text is the text to search, in your case should be text = page.read()
or you need to compile the regex:
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)
Note:
.findall returns a list of matches
if you need to iterate to get a match object, you can use .finditer
(from the example before)
r = re.compile(r"(\w(?:[-.+]?\w+)+\#(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
email_addr = email_match.group() #or anything you need for a matched object
Now the problem is what Regex you have to use :)
Change r'[-a-z0-9._]+#([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+#([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

Python regex with question mark literal

I'm using Django's URLconf, the URL I will receive is /?code=authenticationcode
I want to match the URL using r'^\?code=(?P<code>.*)$' , but it doesn't work.
Then I found out it is the problem of '?'.
Becuase I tried to match /aaa?aaa using r'aaa\?aaa' r'aaa\\?aaa' even r'aaa.*aaa' , all failed, but it works when it's "+" or any other character.
How to match the '?', is it special?
>>> s="aaa?aaa"
>>> import re
>>> re.findall(r'aaa\?aaa', s)
['aaa?aaa']
The reason /aaa?aaa won't match inside your URL is because a ? begins a new GET query.
So, the matchable part of the URL is only up to the first 'aaa'. The remaining '?aaa' is a new query string separated by the '?' mark, containing a variable "aaa" being passed as a GET parameter.
What you can do here is encode the variable before it makes its way into the URL. The encoded form of ? is %3F.
You should also not match a GET query such as /?code=authenticationcode using regex at all. Instead, match your URL up to / using r'^$'. Django will pass the variable code as a GET parameter to the request object, which you can obtain in your view using request.GET.get('code').
You are not allowed to use ? in a URL as a variable value. The ? indicates that there are variables coming in.
Like: http://www.example.com?variable=1&another_variable=2
Replace it or escape it. Here's some nice documentation.
Django's urls.py does not parse query strings, so there is no way to get this information at the urls.py file.
Instead, parse it in your view:
def foo(request):
code = request.GET.get('code')
if code:
# do stuff
else:
# No code!
"How to match the '?', is it special?"
Yes, but you are properly escaping it by using the backslash. I do not see where you have accounted for the leading forward slash, though. That bit just needs to be added in:
r'^/\?code=(?P<code>.*)$'
supress the regex metacharacters with []
>>> s
'/?code=authenticationcode'
>>> r=re.compile(r'^/[?]code=(.+)')
>>> m=r.match(s)
>>> m.groups()
('authenticationcode',)

Using Python Regular Expression in Django

I have an web address:
http://www.example.com/org/companyA
I want to be able to pass CompanyA to a view using regular expressions.
This is what I have:
(r'^org/?P<company_name>\w+/$',"orgman.views.orgman")
and it doesn't match.
Ideally all URL's that look like example.com/org/X would pass x to the view.
Thanks in advance!
You need to wrap the group name in parentheses. The syntax for named groups is (?P<name>regex), not ?P<name>regex. Also, if you don't want to require a trailing slash, you should make it optional.
It's easy to test regular expression matching with the Python interpreter, for example:
>>> import re
>>> re.match(r'^org/?P<company_name>\w+/$', 'org/companyA')
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA')
<_sre.SRE_Match object at 0x10049c378>
>>> re.match(r'^org/(?P<company_name>\w+)/?$', 'org/companyA').groupdict()
{'company_name': 'companyA'}
Your regex isn't valid. It should probably look like
r'^org/(?P<company_name>\w+)/$'
It should look more like r'^org/(?P<company_name>\w+)'
>>> r = re.compile(r'^org/(?P<company_name>\w+)')
>>> r.match('org/companyA').groups()
('companyA',)

Categories