Split from a specific delimiter - python

How to rip a URL like http://www.facebook.com/pages/create.php to have a result like this: www.facebook.com?
I tried this way, but doesn't work:
line.split('/', 2)[2]
My problem is probably with that two forward slashes // and some of the URLs start from the www strings.
Thanks for your help, Adia

You might want to look at Python's urlparse module.
>>> from urlparse import urlparse
>>> o = urlparse('http://www.facebook.com/pages/create.php')
>>> o.netloc
'www.facebook.com'

Probably the best bet would be returning the server part from a regex, ie,
\/[a-z0-9\-\.]*[a-zA-Z0-9\-]+\.[a-z]{2,3}\/
That can cover www.facebook.com, facebook.com, some-domain.tv, www.some-domain.net, etc.
NOTE: the head and trailing slashes are part of the regex and not regex separators.

Try:
line.split("//", 1)[-1].split("/", 1)[0]

I would do:
ch[7 if ch[0:7]=='http://' else 0:].partition('/')[0]
I’m not sure it’s valid for all the cases you’ll encounter
Also:
ch[(ch[0:7]=='http://')*7:].partition('/')[0]

Related

Find string with regular expression in python

I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.

python regex on variable

Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.
If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']
You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')
You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).

How to extract a substring from inside a string?

Let's say I have a string /Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherthing I want to extract just the '0-1-2-3-4-5' part. I tried this:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.find("-")]
But, the result is only 0. How to extract just the '0-1-2-3-4-5' part?
Use os.path.basename and rsplit:
>>> from os.path import basename
>>> name = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> number, tail = basename(name).rsplit('-', 1)
>>> number
'0-1-2-3-4-5'
You're almost there:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.rfind("-")]
rfind will search from the end. This assumes that no dashes appear anywhere else in the path. If it can, do this instead:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
str = os.path.basename(str)
print str[str.find("-")-1:str.rfind("-")]
basename will grab the filename, excluding the rest of the path. That's probably what you want.
Edit:
As pointed out by #bradley.ayers, this breaks down in the case where the filename isn't exactly described in the question. Since we're using basename, we can omit the beginning index:
print str[:str.rfind("-")]
This would parse '/Apath1/Bpath2/Cpath3/10-1-2-3-4-5-something.otherhing' as '10-1-2-3-4-5'.
This works:
>>> str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> str.split('/')[-1].rsplit('-', 1)[0]
'0-1-2-3-4-5'
Assuming that what you want is just what's between the last '/' and the last '-'. The other suggestions with os.path might make better sense (as long as there is no OS confusion over what a a proper path looks like)
you could use re:
>>> import re
>>> ss = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> re.search(r'(?:\d-)+\d',ss).group(0)
'0-1-2-3-4-5'
While slightly more complicated, it seems like a solution similar to this might be slightly more robust...

how to handle '../' in python?

i need to strip ../something/ from a url
eg. strip ../first/ from ../first/bit/of/the/url.html where first can be anything.
what's the best way to achieve this?
thanks :)
You can simply split the path twice at the official path separator (os.sep, and not '/') and take the last bit:
>>> s = "../first/bit/of/the/path.html"
>>> s.split(os.sep, 2)[-1]
'bit/of/the/path.html'
This is also more efficient than splitting the path completely and stringing it back together.
Note that this code does not complain when the path contains fewer than 3+ path elements (for instance, 'file.html' yields 'file.html'). If you want the code to raise an exception if the path is not of the expected form, you can just ask for its third element (which is not present for paths that are too short):
>>> s.split(os.sep, 2)[2]
This can help detect some subtle errors.
EOL has given a nice and clean approach however I could not resist giving a regex alternative to it:)
>>> import re
>>> m=re.search('^(\.{2}\/\w+/)(.*)$','../first/bit/of/the/path.html')
>>> m.group(1)
'../first/'

Python Regex Question

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.
Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

Categories