Problems making soup: BeautifulSoup not opening entire page source, stopping at /html - python

Hi I'm pretty new to scraping and would appreciate your help.
I am trying to open the following url using:
from bs4 import BeautifulSoup
import urllib2
import csv
import re
amicales = urllib2.urlopen("http://www.journal-officiel.gouv.fr/association/index.php?ACTION=Rechercher&HI_PAGE=1&HI_COMPTEUR=0&original_method=get&WHAT=&JTH_ID=014000%2F014040&JAN_BD_CP=&JRE_ID=%CEle-de-France%2FParis&JAN_LIEU_DECL=&JTY_ID=&JTY_WALDEC=&JTY_SIREN=&JPA_D_D=&JPA_D_F=&rechercher.x=36&rechercher.y=7&rechercher=Rechercher")
soup = BeautifulSoup(amicales)
I want to scrape results from a search query. The problem is, every result that I am interested in ends with /html.
I believe this is forcing beautifulsoup to stop reading the source code after the first search result, such that the remaining 20 or so results are ignored.
Here, for example, only the result "NATION INITIATIVE ET OU MACHROU3 WATTAN" is included:
print(soup.prettify())
Can anyone help me to open the whole page, and not just everything before the first /html tag?

Oh dear, that website is thoroughly broken. You can only have one </html> tag per page. If you look at the source, you see that there is only one <html> tag (as opposed to 50 </html> tags.
One workaround would be to first remove all the </html> tags before passing it to BeautifulSoup.
page = page.replace("</html>", "")
soup = BeautifulSoup(page)

Related

Why is the html in view-source different from what I see in the terminal when I call prettify()?

I have decided to view a website's source code, and chose a class, which is "expanded" (I found it using view-source, prettify() shows different code). I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
I had a look at the site in question and the only class similar was actually named ui_qtext_expanded
When you use findAll / find_all you have to iterate over it to return each item as it is a list of items using .text.. That is if you want the text and not the actual HTML..
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
res = soup.find_all(class_='ui_qtext_expanded')
for i in res:
print i.text
The beginning of the output from your link is
A combination of mechanize, Requests and BeautifulSoup works pretty good for the basic stuff.Learn about mechanize here.Mechanize is sufficient for basic form filling, form submission and that sort of stuff, but for real browser emulation (like dealing with Javascript rendered HTML) you should look into selenium.

parsing html by using beautiful soup and selenium in python

I wanted to practice scraping with a real world example (Airbnb) by using BeautifulSoup and Selenium in python. Specifically, my goal is to get all the listings(homes)ID within LA. My strategy is to open a chrome and go to Airbnb website where I already manually searched homes in LA and starts from here. Up to this process, I decided to use selenium. After that I wanted to parse HTML codes inside of source codes and then find listing IDs that are shown at a current page. Then basically, wanted to just iterate through all the pages.
Here's my codes:
from urllib import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",chrome_options=option)
first_url="https://www.airbnb.com/s/Los-Angeles--CA--United-States/select_homes?refinement_paths%5B%5D=%2Fselect_homes&place_id=ChIJE9on3F3HwoAR9AhGJW_fL-I&children=0&guests=1&query=Los%20Angeles%2C%20CA%2C%20United%20States&click_referer=t%3ASEE_ALL%7Csid%3Afcf33cf1-61b8-41d5-bef1-fbc5d0570810%7Cst%3AHOME_GROUPING_SELECT_HOMES&superhost=false&title_type=SELECT_GROUPING&allow_override%5B%5D=&s_tag=tm-X8bVo"
n=3
for i in range(1,n+1):
if (i==1):
driver.get(first_url)
print first_url
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
after_first_url=first_url+"&section_offset=%d" % i
print after_first_url
driver.get(after_first_url)
#HTML parse using BS
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
listings=soup.findAll("div",{"class":"_f21qs6"})
#print out all the listing_ids within a current page
for i in range(len(listings)):
only_id= listings[i]['id']
print(only_id[8:])
If you find any inefficient codes, please understand since I'm a beginner. I made this codes by reading and watching multiple sources. Anyway, I guess I have correct codes but the issue is that every time I run this, I get a different result. What it means is that it loops over pages, but sometimes it gives the results for only certain number of pages. For example, it loops page1 but doesn't give any corresponding output and loops page2 and gives results but doesn't for page3. Its' so random that it gives results for some pages but doesn't for some other pages. On top of that, sometimes it loops page1,2,3, ... in an order, but sometimes it loops page1 and then move on to the last page (17) and then come back to page2. I guess my codes are not perfect since it gives unstable outputs. Did anyone have similar experience or could someone help me out what the problem is? Thanks.
Try below method
Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:
In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
....: print tag.text
....:
....:
Hacker News

beautiful soup parse url from messy output

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Filtering out one string from a print statement in python/BeautifulSoup

I am using BeautifulSoup to scrape a website's many pages for comments. Each page of this website has the comment "[[commentMessage]]". I want to filter out this string so it does not print every time the code runs. I'm very new to python and BeautifulSoup, but I couldn't seems to find this after looking for a bit, though I may be searching for the wrong thing. Any suggestions? My code is below:
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('website url').read()
soup = BeautifulSoup(r, "html.parser")
comments = soup.find_all("div", class_="commentMessage")
for element in comments:
print element.find("span").get_text()
All of the comments are in spans within divs of the class commentMessage, including the unnecessary comment "[[commentMessage]]".
A simple if should do
for element in comments:
text = element.find("span").get_text()
if "[[commentMessage]]" not in text:
print text

Web scraping using Beautiful Soup separating HTML and Javascript and CSS

I am trying to scrape a web page which comprises of Javascript, CSS and HTML. Now this web page also has some text. When I open the web page using the file handler on running the soup.get_text() command I would only like to view the HTML portion and nothing else. Is it possible to do this?
The current source code is:
from bs4 import BeautifulSoup
soup=BeautifulSoup(open("/home/Desktop/try.html"))
print soup.get_text()
What do I change to get only the HTML portion in a web page and nothing else?
Try to remove the contents of the tags that hold the unwanted text (or style attributes).
Here is some code (tested in basic cases)
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/Desktop/try.html"))
# Clear every script tag
for tag in soup.find_all('script'):
tag.clear()
# Clear every style tag
for tag in soup.find_all('style'):
tag.clear()
# Remove style attributes (if needed)
for tag in soup.find_all(style=True):
del tag['style']
print soup.get_text()
It depends on what you mean by get. Dmralev's answer will clear the other tags, which will work fine. However, <HTML> is a tag within the soup, so
print soup.html.get_text()
should also work, with fewer lines, assuming portion means that the HTML is seperate from the rest of the code (ie the other code is not within <HTML> tags).

Categories