Python copy or print Hyperlink - python

Hey I would like to copy or print the hyperlink from a word.
for example: Gift Cards
With which code is this possible?
Can I use urllib2?
If somebody speaks German it would be simpler :)

You'll want to use BeautifulSoup:
from bs4 import BeautifulSoup
htmlfile = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlfile)
a_tag = soup.find('a') # This finds the first occurrence of an a tag.
print a_tag['href'] # Prints the link which the a_tag links to. (hyperlink)
Of course this is pretty basic code. If you look at the documentation of the module, you'll see many other methods you can use to achieve your results :)

Related

Why is the html in view-source different from what I see in the terminal when I call prettify()?

I have decided to view a website's source code, and chose a class, which is "expanded" (I found it using view-source, prettify() shows different code). I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
I had a look at the site in question and the only class similar was actually named ui_qtext_expanded
When you use findAll / find_all you have to iterate over it to return each item as it is a list of items using .text.. That is if you want the text and not the actual HTML..
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
res = soup.find_all(class_='ui_qtext_expanded')
for i in res:
print i.text
The beginning of the output from your link is
A combination of mechanize, Requests and BeautifulSoup works pretty good for the basic stuff.Learn about mechanize here.Mechanize is sufficient for basic form filling, form submission and that sort of stuff, but for real browser emulation (like dealing with Javascript rendered HTML) you should look into selenium.

beautiful soup parse url from messy output

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Filtering out one string from a print statement in python/BeautifulSoup

I am using BeautifulSoup to scrape a website's many pages for comments. Each page of this website has the comment "[[commentMessage]]". I want to filter out this string so it does not print every time the code runs. I'm very new to python and BeautifulSoup, but I couldn't seems to find this after looking for a bit, though I may be searching for the wrong thing. Any suggestions? My code is below:
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('website url').read()
soup = BeautifulSoup(r, "html.parser")
comments = soup.find_all("div", class_="commentMessage")
for element in comments:
print element.find("span").get_text()
All of the comments are in spans within divs of the class commentMessage, including the unnecessary comment "[[commentMessage]]".
A simple if should do
for element in comments:
text = element.find("span").get_text()
if "[[commentMessage]]" not in text:
print text

How do I only select DIV with similar ID

I am parsing a poorly designed web page using beautiful soup.
At the moment, what I need is the select the comment section of the web page but each comment is treated as a DIV and each have an ID like "IAMCOMMENT_00001" but that's it. No class (This would have helped a lot).
So I am forced to search for all DIVs that start with "IAMCOMMENT" but I can't figure out how to do this. The closest I could find is SoupStrainer but couldn't understand how to even use it.
How would I be able to achieve this?
I would use BeautifulSoup's built in find_all function:
from bs4 import BeautifulSoup
soup = BeautifulSoup(yourhtml)
soup.find_all('div', id_=re.compile('IAMCOMMENT_'))
If you want to parse form comments, first you need to find the comment of the your html. A way to do this is like:
import re
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(myhtml)
comments = soup.find_all(text=lambda text: isinstance(text, Comment))
to find the divs inside a comment,
for comment in comments:
cmnt_soup = BeautifulSoup(comment)
divs = cmnt_soup.find_all('div', attrs={"id": re.compile(r'IAMCOMMENT_\d+')})
# do things with the divs

Web Scraper Not Producing Results Using Python

I am a young grasshopper in need of your help. I've done a lot of research and can't seem to find the solution. I written the following code below. When ran it doesn't pull any of the titles. I believe my regular expressions are correct. Not sure what the problem is. Probably obvious to a seasoned sensei. Thanks in advance.
from urllib import urlopen
import re
url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()
'''
a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>
'''
A = 'a href.*pdf">(expression to pull everything) a>'
B = re.compile(A)
C = re.findall(B,url)
print C
This comes up pretty often here on SO. Rather than using Regular Expressions you should be using an HTML parser that allows you to search/traverse the document tree.
I would use BeautifulSoup:
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ... print anchor.contents
I'll echo the other comment about not using RegEx for parsing HTML, but sometimes it is quick and easy. It looks like the HTML in your example is not quite correct, but I'd try something like:
re.findall('href.*?pdf">(.+?)<\/a>', A)

Categories