Web Scraping - How to get a specific part of a weblink [closed]

Web Scraping - How to get a specific part of a weblink [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have the following link:
https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk
I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.
I don't know how to do so using regex. I am working in python. Kindly help me out.

If each link has the same pattern you do not need regex. You can use string.find() and string cutting
link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")
new_link = link[second_https:end_of_link]
print(new_link)
This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)

I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:
import re
from urlparse import urlparse
url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)
Output:
https://cooking.nytimes.com/learn-to-cook

Related

Python re.search assistance for Django [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I need to replace part of an an html string via a Django custom template tag that I am customising.
The string in it's raw format is as follows:
arg = "<a class='login-password-forget' tabindex='1' href='{% url 'account:pwdreset'%}'>Forgot your password?</a>"
I need to replace the {% url 'account:pwdreset'%} part with a url string using re.search().
The code that I written is clumsy and I would appreciate help with finding a better way of achieving the same.
url_string = re.search("{.*?}", arg)
url_string_inner = re.search("'(.+?)'", url_string.group())
add_html = SafeText(''.join([arg.split('{')[0], reverse(url_string_inner.group(1)), arg.split('}')[1]]))
!!UPDATE!!
The solution that I ran with is as follows:
url_string = re.search("{.*?}", arg)
url_string_inner = re.search("'(.+?)'", url_string.group())
add_html = SafeText(''.join([arg.split('{')[0], reverse(url_string_inner.group(1)), arg.split('}')[1]]))
Thank you Fourth Bird for your help.

If you only want to replace the the part with 'account:pwdreset' you could use re.sub with a capture group and use that group in the replacement between single quotes
'{%\s*url '([^']*)'%}
Regex demo | Python demo
import re
pattern = r"'{%\s*url '([^']*)'%}"
s = "<a class='login-password-forget' tabindex='1' href='{% url 'account:pwdreset'%}>Forgot your password?</a>"
print(re.sub(pattern, r"'\1'", s))
Output
<a class='login-password-forget' tabindex='1' href=account:pwdreset>Forgot your password?</a>

Search and create list from a string Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am very new to Python and I am trying to create a list out of string in python.
Input = "<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"
Desired Output = [File1.pdf, File2.ppt, File3.docx]
What is the most efficient and pythonic way to achieve this? Any help will be very much appreciated.
Thanks

You can use beatifulsoup, which has HTML parsing utils.
>>> from bs4 import BeautifulSoup
>>> html = """<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"""
>>> soup = BeautifulSoup(html, parser='html')
>>> files_list = [i.text.split('file: ')[1].replace(')', '') for i in soup.find_all('i')]
>>> print(files_list)
['File1.pdf', 'File2.ppt', 'File3.docx']

There might be a nice way to do this using a HTML parser like shree.pat18 suggested but here is a quick and dirty way using string.split()
Output = [s.split(")")[0] for s in Input.split("file: ")[1:]]
By first splitting on "file: " we get list of strings, the first one contains the first part of the original string so we don't care about that one. The others start with the filenames that we want and the first character we don't care about is ")". So split on ")" and take the first part.

Need Python regex to extract last two words of URL [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Please, recommend regex expression to extract the last two words from an URL like:
INPUT OUTPUT
'www.abcd.google.com' --> 'google.com'
'www.xyz.stackoverflow.com' --> 'stackoverflow.com'

use this regex with 'negative lookahead' feature:
import re
for url in ['www.abcd.google.com','www.xyz.stackoverflow.com']:
print (re.search (r'\w*\.(?!\w*\.)\w*', url)[0])
google.com
stackoverflow.com
Here is the example at Regexr:

Use split with . as delimetr:
url_strings = 'www.xyz.stackoverflow.com'
s = '.'.join(url_strings.split('.')[-2:])
# stackoverflow.com
print(s)
If input validation is required:
url_strings = 'www.xyz.stackoverflow.com'
def return_last_words(url_string, last_words_count=2):
splitted = url_string.split('.')
if last_words_count < len(splitted):
return '.'.join(splitted[-last_words_count:])
return url_string
print(return_last_words(url_strings))

Removing links from a reddit comments using python and regex [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to remove links in the format Reddit uses
comment = "Hello this is my [website](https://www.google.com)"
no_links = RemoveLinks(comment)
# no_links == "Hello this is my website"
I found a similar question about the same thing, but I don't know how to translate it to python.
I am not that familiar with regex so I would appreciate it if you explained what's happening.

You could do the following:
import re
pattern = re.compile('\[(.*?)\]\(.*?\)')
comment = "Hello this is my [website](https://www.google.com)"
print(pattern.sub(r'\1', comment))
The line:
pattern = re.compile('\[(.*?)\]\(.*?\)')
creates a regex pattern that will search for anything surrounded by square brackets, followed by anything surrounded by parenthesis, the '?' indicates that they should match as little text as possible (non-greedy).
The function sub(r'\1', comment) replaces a match by the first capturing group in this case the text inside the brackets.
For more information about regex I suggest you read this.

How to open a webpage and search for a word in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
How to open a webpage and search for a word in python?

This is a little simplified:
>>> import urllib
>>> import re
>>> page = urllib.urlopen("http://google.com").read()
# => via regular expression
>>> re.findall("Shopping", page)
['Shopping']
# => via string.find, returns the position ...
>>> page.find("Shopping")
2716
First, get the page (e.g. via urllib.urlopen). Second use a regular expression to find portions of the text, you are interested in. Or use string.find.

you can use urllib2
import urllib2
webp=urllib2.urlopen("the_page").read()
webp.find("the_word")
hope that helps :D

How to open a webpage?
I think the most convinient way is:
from urllib2 import urlopen
page = urlopen('http://www.example.com').read()
How to search for a word?
I guess you are going to search for some pattern in the page next, so here we go:
import re
pattern = re.compile('^some regex$')
match = pattern.search(page)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping - How to get a specific part of a weblink [closed] - python

Related

Python re.search assistance for Django [closed]

Search and create list from a string Python [closed]

Need Python regex to extract last two words of URL [closed]

Removing links from a reddit comments using python and regex [closed]

How to open a webpage and search for a word in python [closed]

Categories

Resources