beautifulsoup .select returns nothing - python

I'm an amateur python user. Currently I'm trying to figure out the beautiful soup module, but i can't get the select method to find anything.
I have made and example html file (more or less coppied from the book "automate the boring stuff with python") the content of which is:
<html><head><title>The Website Title</title></head>
<body>
<p><strong>Hi There!</strong> here is a link to a website: <a href="http://
inventwithpython.com">a website thing</a>.</p>
<p class="slogan">this is a roundup, this is a low flying panic attack.</p>
<p>By <span id="author">Yonatan.</span></p>
</body></html>
I've entered this code into the shell:
examplefile = open('example.html')
examplesoup = bs4.BeautifulSoup(examplefile.read())
elem = examplesoup.select('#author')
but what i get as elem is an empty list. I've checked examplefile.read() and its the real thing. also tried select('p') and got nothing.
is there something very obvious that I'm missing here? I'm also new to html.

try this
examplefile = open('example.html')
myfile=examplefile.read()
examplesoup = bs4.BeautifulSoup(myfile)
elem = examplesoup.select('#author')
this should work.

Related

How to extract Email, Telephone, Fax number and Address from many different html links by writing a python script?

I tried this code but it isn't working right (not extracting from all sites etc and many other issues with this). Need help!
from bs4 import BeautifulSoup
import re
import requests
allsite = ["https://www.ionixxtech.com/", "https://sumatosoft.com", "https://4irelabs.com/", "https://www.leewayhertz.com/",
"https://stackoverflow.com", "https://www.vardot.com/en", "http://www.clickjordan.net/", "https://vtechbd.com/"]
emails = []
tels = []
for l in allsite:
r = requests.get(l)
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
emails.append(link.get('href'))
for tel in soup.findAll('a', attrs={'href': re.compile("^tel:")}):
tels.append(tel.get('href'))
print(emails)
print(tels)
this is neither a regex nor an html parsing issue. print out r.content and you will notice (e.g. for https://vtechbd.com/) that the actual html source you are parsing isn't the same as the one rendered by your browser when you access the site.
<!-- Contact Page -->
<section class="content hide" id="contact">
<h1>Contact</h1>
<h5>Get in touch.</h5>
<p>Email: <span class="__cf_email__" data-cfemail="2e474048416e585a4b4d464c4a004d4143">[email protected]</span><br />
so I assume the information you are interested in is loaded dynamically by some javascript. python's requests library is an http client, not a web scraper.
...also, it's not cool to ask people to debug your code because it's 5pm, you want to get out of the office and hope somebody will have solved your issue by tomorrow morning...I may be wrong but the way your question is asked leaves me under the impression you spent like 2min pasting your source code in...

HTML source while webscraping seems inconsistent for website

I checked out say:
https://www.calix.com/search-results.html?searchKeyword=C7
And if I inspect element on the first link I get this:
<a class="title viewDoc"
href="https://www.calix.com/content/dam/calix/mycalix-
misc/ed-svcs/learning_paths/C7_lp.pdf" data-
preview="/session/4e14b237-f19b-47dd-9bb5-d34cc4c4ce01/"
data-preview-count="1" target="_blank"><i class="fa fa-file-
pdf-o grn"></i><b>C7</b> Learning Path</a>
I coded:
import requests, bs4
res = requests.get('https://www.calix.com/search-results.html?
searchKeyword=C7',headers={'User-Agent':'test'})
print(res)
#res.raise_for_status()
bs_obj= bs4.BeautifulSoup(res.text, "html.parser")
elems=bs_obj.findAll('a',attrs={"class","title viewDoc"})
print(elems)
And there was [] as output (empty list).
So, I thought about actually looking through the "view-source" for the page.
view-source:https://www.calix.com/search-results.html?searchKeyword=C7
If you search through the "view-source" you will not find the code for the "inspect element" I mentioned earlier.
There is no "a class="title viewDoc"" in the view-source of the page.
That is probably why my code isn't returning anything.
The I went to www.nba.com, and inspected a link
<a class="content_list--item clearfix"
href="/article/2018/07/07/demarcus-cousins-discusses-
stacked-golden-state-warriors-roster"><h5 class="content_list-
-title">Cousins on Warriors' potential: 'Scary'</h5><time
class="content_list--time">in 5 hours</time></a>
The content of "inspect" for this link was in the "view-source" of the page.
And, obviously my code was working for this page.
I have seen a few examples of issue #1.
Just curious why the difference in html formats, or am I missing something?

Selenium how to extract href from attributes

<div class="turbolink_scroller" id="container">
<article><div class="inner- article">
<a style="height:81px;" href="LINK TO EXTRACT">
<img width="81" height="81" src="//image.jpg" alt="code" />
Hello! I'm pretty new to selenium and I've been playing around with how to get sources for my webdriver. So far, I'm trying to extract a href link given an alt code as above and I'm not sure if the documentation has a means to do this. I'm feeling that the answer is find_by_xpath but I'm not entirely sure. Thank you for any tips!
The way is as follows
href = driver.find_element_by_tag_name('a').get_attribute('href')
of course, you may have a lot of 'a' tags in a page, so you may make the path to your respective tag,
e.g
div = driver.find_element_by_id('container')
a = div.find_element_by_tag_name('a')
href = a.get_attribute('href')

Beautiful Soup not recognizing Button Tag

I'm currently experimenting with Beautiful Soup 4 in Python 2.7.6
Right now, I have a simple script to scrape Soundcloud.com. I'm trying to print out the number of button tags on the page, but I'm not getting the answer I expect.
from bs4 import BeautifulSoup
import requests
page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
data = page.text
soup = BeautifulSoup(data)
buttons = soup.findAll('button')
print len(buttons)
When I run this, I get the output
num buttons = 0
This confuses me. I know for a fact that the button tags exist on this page so it shouldn't be returning 0. Upon inspecting the button elements directly underneath the waveform, I find these...
<button class="sc-button sc-button-like sc-button-medium sc-button-responsive" tabindex="0" title="Like">Like</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtoset" tabindex="0" title="Add to playlist">Add to playlist</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtogroup" tabindex="0" title="Add to group">Add to group</button>
<button class="sc-button sc-button-share sc-button-medium sc-button-responsive" title="Share" tabindex="0">Share</button>
At first I thought that the way I was trying to find the button elements was incorrect. However, if I modify my code to scrape an arbitrary youtube page...
page = requests.get('http://www.youtube.com/watch?v=UiyDmqO59QE')
then I get the output
num buttons = 37
So that means that soup.findAll('button') is doing what it's suppose to, just not on soundcloud.
I've also tried specifying the exact button I want, expecting to get a return result of 1
buttons = soup.findAll('button', class_='sc-button sc-button-like sc-button-medium sc-button-responsive')
print 'num buttons =', len(buttons)
but it still returns 0.
I'm kind of stumped on this one. Can anyone explain why this is?
The reason you cannot get the buttons is that there are no button tags inside the html you are getting:
>>> import requests
>>> page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
>>> data = page.text
>>> '<button' in data
False
This means that there is much more involved in forming the page: AJAX requests, javascript function calls etc
Also, note that soundcloud provides an API - there is no need to crawl HTML pages of the site. There is also a python wrapper around the Soundcloud API available.
Also, be careful about web-scraping, study Terms of Use:
You must not employ scraping or similar techniques to aggregate,
repurpose, republish or otherwise make use of any Content.

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

Categories