Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If I have a URL www.somewebsite/category/category-xyz and I want to segregate all such category URLs from a list of URLs that I already have, how do I do it in Python?
Take a look at urlparse
>>> from urllib.parse import urlparse
>>> url = "http://www.test.com:8080/cat1/cat2"
>>> parsed = urlparse(url)
>>> parsed
ParseResult(scheme='http', netloc='www.test.com:8080', path='/cat1/cat2', params='', query='', fragment='')
>>> parsed.path
'/cat1/cat2'
>>> parsed.path.split("/")
['', 'cat1', 'cat2']
If you notice above URL parse will take care of breaking out the things you don't care about, making your string processing easier. In the example above you can see it cleanly breaks out the protocol, host, and port and leaves you with just the path to operate on. Additionally if there was any additional query parameters it would break those out too.
Once you have the "path" string, you can simply parse it as you would any string. since your path will always start with a "/" you can probably just omit that from your string split
>>> parsed.path.split("/")[1:]
['cat1', 'cat2']
Please note; if your URL does not contain a path this will probably fail. If you need more details you should provide the end result you're looking for in the question
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am very new to Python and I am trying to create a list out of string in python.
Input = "<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"
Desired Output = [File1.pdf, File2.ppt, File3.docx]
What is the most efficient and pythonic way to achieve this? Any help will be very much appreciated.
Thanks
You can use beatifulsoup, which has HTML parsing utils.
>>> from bs4 import BeautifulSoup
>>> html = """<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"""
>>> soup = BeautifulSoup(html, parser='html')
>>> files_list = [i.text.split('file: ')[1].replace(')', '') for i in soup.find_all('i')]
>>> print(files_list)
['File1.pdf', 'File2.ppt', 'File3.docx']
There might be a nice way to do this using a HTML parser like shree.pat18 suggested but here is a quick and dirty way using string.split()
Output = [s.split(")")[0] for s in Input.split("file: ")[1:]]
By first splitting on "file: " we get list of strings, the first one contains the first part of the original string so we don't care about that one. The others start with the filenames that we want and the first character we don't care about is ")". So split on ")" and take the first part.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'd like to remove all whitespaces in URLs / Email addresses. The addresses are in a "normal" string, like: "Today the weather is fine. Tomorrow, we'll see. More information: www.weather .com or info #weather.com"
I'm looking for a good regex (using the re module of Python), but my versions can't handle all cases
re.sub(u'(www)([ .])([a-zA-Z\-]+)([ .])([a-z]+)', '\\1.\\3.\\5')
Your expression for url just require a little fixing. The regex expression for email can also be inherited from url expression.
>>> #EXPRESSIONS:
>>> url = "(www)+([ .])+([a-zA-Z\-]+)+([ .])+([a-z]+)"
>>> ema = "([a-zA-Z]+)+([ +#]+)+([a-zA-Z\-]+.com)"
>>>
>>> #IMPORTINGS:
>>> import re
>>>
>>> #YOUR DATA:
>>> string = "Today the weather is fine. Tomorrow, we'll see. More information: www.weather .com or info #weather.com"
>>>
>>> #Scraping Data
>>> "".join(re.findall(url,string)[0])
'www.weather.com'
>>> "".join(re.findall(ema,string)[0]).replace(" ","")
'info#weather.com'
>>>
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have the following link:
https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk
I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.
I don't know how to do so using regex. I am working in python. Kindly help me out.
If each link has the same pattern you do not need regex. You can use string.find() and string cutting
link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")
new_link = link[second_https:end_of_link]
print(new_link)
This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)
I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:
import re
from urlparse import urlparse
url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)
Output:
https://cooking.nytimes.com/learn-to-cook
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am not sure how I can extract the variables or groups I created in my regular expression. Specifically datetime and IP. I have read other postings and the documentation but I am getting a bit confused. I was wondering if someone could generate an example for me to follow. What I would like to do is to be able to extract datetime and IP for later use. Perhaps stored in a variable to be called on later
sample log:
log = 'Oct 7 13:24:36 192.168.10.2 2013: 10:07-13:24:35 httpproxy[15359]: id="0001"
httpproxy515139 = re.compile(r'(?P<datetime>\w\w\w\s+\d+\s+\d\d:\d\d:\d\d)\s+(?P<IP>d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*')
This sample should help you:
>>> import re
>>> sample = 'this is a sample text'
>>> third_word = re.compile(r'\S+ \S+ (?P<word>\S+) .*')
>>> ms = third_word.match(sample)
>>> ms.groupdict()
{'word': 'a'}
You need to access the groupdict() method of the returned match object.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
How to open a webpage and search for a word in python?
This is a little simplified:
>>> import urllib
>>> import re
>>> page = urllib.urlopen("http://google.com").read()
# => via regular expression
>>> re.findall("Shopping", page)
['Shopping']
# => via string.find, returns the position ...
>>> page.find("Shopping")
2716
First, get the page (e.g. via urllib.urlopen). Second use a regular expression to find portions of the text, you are interested in. Or use string.find.
you can use urllib2
import urllib2
webp=urllib2.urlopen("the_page").read()
webp.find("the_word")
hope that helps :D
How to open a webpage?
I think the most convinient way is:
from urllib2 import urlopen
page = urlopen('http://www.example.com').read()
How to search for a word?
I guess you are going to search for some pattern in the page next, so here we go:
import re
pattern = re.compile('^some regex$')
match = pattern.search(page)