Getting src code for a section directly with requests in python - python

I want to get the source code only of a section from website instead of whole page and then parsing out the section, as it will be faster than loading whole page and then parsing. I tried passing the section link as url parameter but still getting whole page.
url = 'https://stackoverflow.com/questions/19012495/smooth-scroll-to-div-id-jquery/#answer-19013712'
response = requests.get(url)
print(response.text)

You cannot get specific section directly with requests api, but you can use beautifulsoup for that purpose.
A small sample is given by dataquest website:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
page.content
Running the above script will output this html String.
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>
You can get specific section by finding it through tag type, class or id.
By tag-type:
soup.find_all('p')
By class:
soup.find_all('p', class_='outer-text')
By Id:
soup.find_all(id="first")

HTTPS will not allow you to do that.
You can use the Stackoverflow API instead. You can pass the answer id 19013712. And thus only get that specific answer via the API.
Note, you may still have to register for an APP key

Related

how to find a piece of text between <h3> and </h3> in an html page with python

There is an html page
you need to collect the text in the list, which is contained between the h3 and /h3 tags
<h3 id="basics">1. Creating a Web Page</h3>
<p>
Once you've made your "home page" (index.html) you can add more pages to
your site, and your home page can link to them.
<h3 id="syntax">>2. HTML Syntax</h3>
i dont know how to write a pattern for this, pls help to get values "1. Creating a Web Page" and ">2. HTML Syntax"
you can use library like beautifulsoup for crawling webpages.
import requests
from bs4 import BeautifulSoup
html = requests.get('url to your page')
html.encoding = 'utf-8'
sp = BeautifulSoup(html.text, "html5lib")
# to get all h3 in the page
list_h3 = sp.find_all('h3')
for h3 in list_h3:
print(h3.text)
This should work by eliminating parts of the actual tags
html="<h3 id='basics'>1. Creating a Web Page</h3>"
text=html.replace("<h3","").split(">")[1].split("</")[0]

BS4 breaks HTML trying to repair it

BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html
In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead.
Since I have to work with the broken html of said page, I cannot simply repair the html code.
How can I prevent bs4 from repairing the html or fix the "correction" somehow?
(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)
#!/usr/bin/env python3
from bs4 import BeautifulSoup
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html, 'lxml') #or html.parse
print(str(soup))
Try this lib.
from simplified_scrapy import SimplifiedDoc
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Beautifulsoup how to select all the 'a' tags

I am a newbie to BeautifulSoup and Python. Here is my HTML:
<html>
<head></head>
<body>
Google
Yahoo
</body>
</html>
Now my code:
from bs4 import BeautifulSoup
# Getting page souped inside Requests, this part is not necessary
soup = BeautifulSoup(html,'html.parser')
print(soup.find('a'))
This is giving just one link, but I want to get all.
Thanks in advance
You are using .find(), that will only return the first found, then you have to use .find_all() instead to get a list of the a tags.
print(soup.find_all('a'))
To get href's by for loop:
for link in soup.find_all('a'):
print(link.href)

How to extract Email, Telephone, Fax number and Address from many different html links by writing a python script?

I tried this code but it isn't working right (not extracting from all sites etc and many other issues with this). Need help!
from bs4 import BeautifulSoup
import re
import requests
allsite = ["https://www.ionixxtech.com/", "https://sumatosoft.com", "https://4irelabs.com/", "https://www.leewayhertz.com/",
"https://stackoverflow.com", "https://www.vardot.com/en", "http://www.clickjordan.net/", "https://vtechbd.com/"]
emails = []
tels = []
for l in allsite:
r = requests.get(l)
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
emails.append(link.get('href'))
for tel in soup.findAll('a', attrs={'href': re.compile("^tel:")}):
tels.append(tel.get('href'))
print(emails)
print(tels)
this is neither a regex nor an html parsing issue. print out r.content and you will notice (e.g. for https://vtechbd.com/) that the actual html source you are parsing isn't the same as the one rendered by your browser when you access the site.
<!-- Contact Page -->
<section class="content hide" id="contact">
<h1>Contact</h1>
<h5>Get in touch.</h5>
<p>Email: <span class="__cf_email__" data-cfemail="2e474048416e585a4b4d464c4a004d4143">[email protected]</span><br />
so I assume the information you are interested in is loaded dynamically by some javascript. python's requests library is an http client, not a web scraper.
...also, it's not cool to ask people to debug your code because it's 5pm, you want to get out of the office and hope somebody will have solved your issue by tomorrow morning...I may be wrong but the way your question is asked leaves me under the impression you spent like 2min pasting your source code in...

Exception handling when the input link doesn't have the appropriate form

for instance, i have a list of links like this:
linklists = ['www.right1.com', www.right2.com', 'www.wrong.com', 'www.right3.com']
and the form of each right1,right2 and right3's html is:
<html>
<p>
hi
</p>
<strong>
hello
</strong>
</html>
and the form of www.wrong.com html is(actual html is much more complicated):
<html>
<p>
hi
</p>
</html>
and i'm using a code like this:
from BeautifulSoup import BeautifulSoup
stronglist=[]
for httplink in linklists:
url = httplink
page = urllib2.urlopen(url)
html = page.read()
soup = BeautifulSoup(html)
findstrong = soup.findAll("strong")
findstrong = str(findstrong)
findstrong = re.sub(r'\[|\]|\s*<[^>]*>\s*', '', findstrong) #remove tag
stronglist.append(findstrong)
what i want to do is:
get through html links from the list 'linklists'
find data between <strong>
add them to list 'stronglist'
but the problem is:
there is a wrong link (www.wrong.com) that has no .
then the code says error...
what i want is an exception handling(or something else) that if the link doesn't have 'strong' field(it has error), i want the code to add the string 'null' to the stronglist since it can't get data from the link.
i have been using 'if to solve this but it's a bit hard for me
any suggestions?
There is no need to use exception handling. Just identify when the findAll method returns an empty list and deal with that.
from BeautifulSoup import BeautifulSoup
strong_list=[]
for url in link_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
strong_tags = soup.findAll("strong")
if not strong_tags:
strong_list.append('null')
continue
for strong_tag in strong_tags:
strong_list.append(strong_tag.text)

Categories