Why does HTML source code change while extracting text from soup object? - python

I am trying to scrape news articles from the results of a search term using Selenium and BeautifulSoup on Python. I have arrived at the final page which contains the text using:
article_page = requests.get(articles.link_of_article[0])
article_soup = BeautifulSoup(article_page.text, "html.parser")
for content in article_soup.find_all('div',{"class":"name_of_class_with_contained_text"}):
content.get_text()
I notice that "name_of_class_with_contained_text" is present when I visually inspect the source code in the browser but the class is not present in the soup object. Also, all the "p" tags are replaced with the following code "\\u003c/p\\u003e\\u003cp\\u003e \\u003c/p\\u003e\\u003cp\\u003e".
I am unable to find the class name or tags to get the text contained.
Any help or reasoning as to why this happens would be appreciated.
P.S: Relatively new to scraping and HTML
UPDATE:Adding the link of the final page here.
https://www.fundfire.com/c/2258443/277443?referrer_module=searchSubFromFF&highlight=eileen%20neill%20verus

Related

Python Web Scraper - Trying to get program to scrape data in one specific location not the whole page

I've browsed around the web and read and watched several guides online as to how to solve my issue but I'm stuck and am hoping for some input.
I'm trying to build a web scraper that will scrape the M&A deals section from Reuters and have successfully managed to write a program that can scrape the headline, summary, date, and link for the article. However the issue that I'm trying to resolve is that I want the program to scrape from only the headlines/articles with a summary, which are located directly underneath the Mergers and Acquisitions column. The current program is scraping ALL of the headlines it sees denoted with the tag "article" and attribute/class "story", and thus as a result is not only scraping headlines from the Mergers and Acquisitions column but also the Market News column as well.
I kept getting Attribute Errors once the bot began to start scraping the headlines from the Market News column since the market news columns don't have any summaries and thus no text to pull, causing my code to terminate. I've attempted to fix this with a try/except logic path thinking it would not pull the headlines from the Market News column however the code kept pulling the headlines.
I've tried writing a new line of code that tells the program instead of looking for all tags with articles, look for all tags with , thinking that if I gave the bot a more direct path to follow it would scrape articles going from a top down approach. However, this failed and now my head just hurts. Thank you all in advance!
Here's my code so far below:
from bs4 import BeautifulSoup
import requests
website = 'https://www.reuters.com/finance/deals/mergers'
source = requests.get(website).text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.div.a.h3.text.strip()
#threw in strip() to fix the issue of a bunch of space being printed before the headline title.
print(headline+ "\n")
date = article.find("span",class_ = 'timestamp').text
print(date)
try: #Put in Try/Except logic to keep the code going
summary = article.find("div", class_="story-content").p.text
print(summary + "\n")
link = article.find('div', class_='story-content').a['href']
#this bit [href] is the syntax needed for me to pull out the URL from the html code
origin = "https://www.reuters.com/finance/deals/mergers"
print(origin + link + "\n")
except Exception as e:
summary = None
link = None
#This section here is another part I'm working on to get the scraper to go to
#the next page and continue scraping for headlines, dates, summaries, and links
next_page = soup.find('a', class_='control-nav-next')["href"]
source = requests.get(website + next_page).text
soup = BeautifulSoup(source, 'lxml')
Only change this line:
for article in soup.select('div[class="column1 col col-10"] article'):
With this syntax, .select() find all the article tags that are beneath <div class="column1 col col-10">, which contains the headers you are interested in, and not the others.
Here the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html?highlight=select#css-selectors

Crawl a webpage which is generated by Javascript

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

BeautifulSoup: Selector not extracting the right data - Yahoo Scrape

I'm trying to extract the text from an element whose class value contains compText. The problem is that it extracts everything but the text that I want.
The CSS selector identifies the element correctly when I use it in the developer tools.
I'm trying to scrape the text that appears in Yahoo SERP when the query entered doesn't have results.
If my query is (quotes included) "klsf gl glkjgsdn lkgsdg" nothing is displayed expect the complementary text "We did not find results blabla" and the Selector extract the data correctly
If my query is (quotes included) "based specialty. Blocks. Organosilicone. Reference". Yahoo will add ads because of the keyword "Organosilicone" and that triggers the behavior described in the first paragraph.
Here is the code:
import requests
from bs4 import BeautifulSoup
url = "http://search.yahoo.com/search?p="
query = '"based specialty chemicals. Blocks. Organosilicone. Reference"'
r = requests.get(url + query)
soup = BeautifulSoup(r.text, "html.parser")
for EachPart in soup.select('div[class*="compText"]'):
print (EachPart.text)
What could be wrong?
Thx,
EDIT: The text extracted seems to be the defnition of the word "Organosilicone" which I can find on the SERP.
EDIT2: This is a snippet of the text I get: "The products created and produced by ‘Specialty Chemicals’ member companies, many of which are Small and Medium Enterprises, stem from original and continuous innovation. They drive the low-carbon, resource-efficient and knowledge based economy of the future." and a screenshot of the SERP when I use my browser

List links of xls files using Beautifulsoup

I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup

Web scraping using Beautiful Soup separating HTML and Javascript and CSS

I am trying to scrape a web page which comprises of Javascript, CSS and HTML. Now this web page also has some text. When I open the web page using the file handler on running the soup.get_text() command I would only like to view the HTML portion and nothing else. Is it possible to do this?
The current source code is:
from bs4 import BeautifulSoup
soup=BeautifulSoup(open("/home/Desktop/try.html"))
print soup.get_text()
What do I change to get only the HTML portion in a web page and nothing else?
Try to remove the contents of the tags that hold the unwanted text (or style attributes).
Here is some code (tested in basic cases)
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/Desktop/try.html"))
# Clear every script tag
for tag in soup.find_all('script'):
tag.clear()
# Clear every style tag
for tag in soup.find_all('style'):
tag.clear()
# Remove style attributes (if needed)
for tag in soup.find_all(style=True):
del tag['style']
print soup.get_text()
It depends on what you mean by get. Dmralev's answer will clear the other tags, which will work fine. However, <HTML> is a tag within the soup, so
print soup.html.get_text()
should also work, with fewer lines, assuming portion means that the HTML is seperate from the rest of the code (ie the other code is not within <HTML> tags).

Categories