Python Regular Expression For Host Header - python

Given this string:
GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n
How would I obtain everything in a Python regex group between Host: and \r\n?
In this example, I would like re.match.group(1) to return www.youtube.com

You could use this Regex to match
>>> a = 'GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n'
>>> import re
>>> re.search(r"Host: (.+)\r\n",a).group(1)
'www.youtube.com'
Small Note - It is better to use re.MULTILINE flag as the input string contains \n, though it is not required in this particular case.
Additionally, As Antti Haapala mentions, using the ^ to match the start of the string is also a better option, as there may be Header fields with the name Host. Thus the final regex would be something like re.search(r"^Host: (.+)\r\n",a,re.M).group(1).

using positive look behind and positive look ahead
>>> import re
>>> a = 'GET /dsadda HTTP/1.1\r\nUser-Agent: curl/7.26.0\r\nHost: www.youtube.com\r\nAccept: */*\r\n\r\n'
>>> re.search(r"(?<=Host: )(\S+)(?=\r\n)", a).group(1)
'www.youtube.com'

Related

In Python, is there a way to capture the following YYYYMMDD-N from a URL

I am looking for a way to capture the following with either a regular expression or a built-in function in Python.
From /url-path/YYYYMMDD-N/url-path-cont I only need YYYYMMDD-N. Sometimes the -N is present and sometimes it is not. I have tried various methods but so far all my attempts either stop at YYYMMDD or capture part of /url-path-cont.
I would like to capture only the YYYYMMDD-N with the -N as optional whenever present.
There are probably better ways of doing this, but as long as there's always the same amount of / then you could use the split method:
url_path = "/url-path/YYYYMMDD-N/url-path-cont"
date_only = url_path.split("/")[2]
print(date_only)
Here is a regular expression that will extract the date from a string.
>>> import re
>>> url = "url-path/YYYYMMDD-N/url-path-cont"
>>> re.compile(r"/(\w+-?\w?)/").search(url).group(1)
'YYYYMMDD-N'
>>>

Find string with regular expression in python

I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.

RegEx in Python for WikiMarkup

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

python regex on variable

Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.
If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']
You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')
You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).

Does anyone see why the first part of my regex isn't working in Python?

I tested this regex out in RegexBuddy
,[A-Z\s]+?,(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?
and it seems to be able to do what I need it to do - capture a piece of data that looks like one of the following:
,POWDER,RO,ML,8/19/2002
,POWDER,RO,,,
,POWDER,RO,,8/19/2002
,POWDER,RO,ML,,
When I use it in a python string:
r",[A-Z\s]+?,(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?"
It misses the first part of the match, and my resulting matches look like: RO,ML,8/19/2002, or RO,ML, or jusr RO,
The first token is a word that is stored as all caps and may have spaces (and/or possibly punctuation that i need to address as well shortly) in it. if I remove the space it still doesn't capture the one word names that it should. Did I miss something obvious?
Yes. You did not capture the first group.
r",([A-Z\s]+),(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?"
# ^ ^
BTW, it seems that you are parsing a CSV file with regex. In Python, there is already a csv module.
The first part of your regex doesn't have capturing parentheses around it. Try the regex:
,([A-Z\s]+?),(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?
#^^ This was [A-Z\s]+?; needs to be ([A-Z\s]+?)
which would be this in python:
r",([A-Z\s]+?),(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?"
Example from the interpreter:
>>> import re
>>> r = re.compile(r",[A-Z\s]+?,(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?")
>>> r.match(",POWDER,RO,ML,8/19/2002").groups()
('RO', 'ML', '8/19/2002')
>>> r = re.compile(r",([A-Z\s]+?),(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?")
>>> r.match(",POWDER,RO,ML,8/19/2002").groups()
('POWDER', 'RO', 'ML', '8/19/2002')
I'm not into python, but you just forgot to use brackets to indicate that you want to capture that part:
,([A-Z\s]+)?,(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})? should do what you want
Yes, you missed the grouping parentheses:
>>> s = ",POWDER,RO,ML,8/19/2002"
>>> pat = r",([A-Z\s]+?),(LA|RO|MU|FE|AV|CA),(ML|FE|MN|FS|UN)?,(\d+/\d+/\d{4})?"
>>> re.match(pat, s).groups()
('POWDER', 'RO', 'ML', '8/19/2002')

Categories