BeautifulSoup: Selector not extracting the right data - Yahoo Scrape - python

I'm trying to extract the text from an element whose class value contains compText. The problem is that it extracts everything but the text that I want.
The CSS selector identifies the element correctly when I use it in the developer tools.
I'm trying to scrape the text that appears in Yahoo SERP when the query entered doesn't have results.
If my query is (quotes included) "klsf gl glkjgsdn lkgsdg" nothing is displayed expect the complementary text "We did not find results blabla" and the Selector extract the data correctly
If my query is (quotes included) "based specialty. Blocks. Organosilicone. Reference". Yahoo will add ads because of the keyword "Organosilicone" and that triggers the behavior described in the first paragraph.
Here is the code:
import requests
from bs4 import BeautifulSoup
url = "http://search.yahoo.com/search?p="
query = '"based specialty chemicals. Blocks. Organosilicone. Reference"'
r = requests.get(url + query)
soup = BeautifulSoup(r.text, "html.parser")
for EachPart in soup.select('div[class*="compText"]'):
print (EachPart.text)
What could be wrong?
Thx,
EDIT: The text extracted seems to be the defnition of the word "Organosilicone" which I can find on the SERP.
EDIT2: This is a snippet of the text I get: "The products created and produced by ‘Specialty Chemicals’ member companies, many of which are Small and Medium Enterprises, stem from original and continuous innovation. They drive the low-carbon, resource-efficient and knowledge based economy of the future." and a screenshot of the SERP when I use my browser

Related

Unable to extract text from website containing a filter

I'm trying to get all the locations out of the following website (www.mars.com/locations) using Python, with Requests and BeautifulSoup.
The website has a filter to select continent, country and region, so that it will display only the locations the company has in the selected area. They also include their headquarters at the bottom of the page, and this information is always there regardless of the filter applied.
I have no problem extracting the data for the headquarters using the code below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mars.com/locations'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
HQ = soup.find('div', class_='global-headquarter pr-5 pl-3').text.strip()
print(HQ)
The output of the code is:
Mars, Incorporated (Global Headquarters)
6885 Elm Street
McLean
Virginia
22101
+1(703) 821-4900
I want to do the same for all other locations, but I'm struggling to extract the data using the same approach (adjusting the path, of course). I've tried everything and I'm out of ideas. Would really appreciate someone giving me a hand or at least pointing me in the right direction.
Thanks a lot in advance!
All location data can be retrieved in text format. Decomposing this into a string is one way to do it. I'm not an expert in this field, so I can't help you any more.
content_json = soup.find('div', class_='location-container')
data = content_json['data-location']
i'm not an expert in BeautifulSoup, so i'll use parsel to get the data. all the locations are embedded in a location-container css class, with a data-location attribute.
import requests
from parsel import Selector
response = requests.get(url).text
selector = Selector(text=response)
data = selector.css(".location-container").xpath("./#data-location").getall()

Python, extract text from webpage

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

Python Web Scraper - Trying to get program to scrape data in one specific location not the whole page

I've browsed around the web and read and watched several guides online as to how to solve my issue but I'm stuck and am hoping for some input.
I'm trying to build a web scraper that will scrape the M&A deals section from Reuters and have successfully managed to write a program that can scrape the headline, summary, date, and link for the article. However the issue that I'm trying to resolve is that I want the program to scrape from only the headlines/articles with a summary, which are located directly underneath the Mergers and Acquisitions column. The current program is scraping ALL of the headlines it sees denoted with the tag "article" and attribute/class "story", and thus as a result is not only scraping headlines from the Mergers and Acquisitions column but also the Market News column as well.
I kept getting Attribute Errors once the bot began to start scraping the headlines from the Market News column since the market news columns don't have any summaries and thus no text to pull, causing my code to terminate. I've attempted to fix this with a try/except logic path thinking it would not pull the headlines from the Market News column however the code kept pulling the headlines.
I've tried writing a new line of code that tells the program instead of looking for all tags with articles, look for all tags with , thinking that if I gave the bot a more direct path to follow it would scrape articles going from a top down approach. However, this failed and now my head just hurts. Thank you all in advance!
Here's my code so far below:
from bs4 import BeautifulSoup
import requests
website = 'https://www.reuters.com/finance/deals/mergers'
source = requests.get(website).text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.div.a.h3.text.strip()
#threw in strip() to fix the issue of a bunch of space being printed before the headline title.
print(headline+ "\n")
date = article.find("span",class_ = 'timestamp').text
print(date)
try: #Put in Try/Except logic to keep the code going
summary = article.find("div", class_="story-content").p.text
print(summary + "\n")
link = article.find('div', class_='story-content').a['href']
#this bit [href] is the syntax needed for me to pull out the URL from the html code
origin = "https://www.reuters.com/finance/deals/mergers"
print(origin + link + "\n")
except Exception as e:
summary = None
link = None
#This section here is another part I'm working on to get the scraper to go to
#the next page and continue scraping for headlines, dates, summaries, and links
next_page = soup.find('a', class_='control-nav-next')["href"]
source = requests.get(website + next_page).text
soup = BeautifulSoup(source, 'lxml')
Only change this line:
for article in soup.select('div[class="column1 col col-10"] article'):
With this syntax, .select() find all the article tags that are beneath <div class="column1 col col-10">, which contains the headers you are interested in, and not the others.
Here the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html?highlight=select#css-selectors

Why does HTML source code change while extracting text from soup object?

I am trying to scrape news articles from the results of a search term using Selenium and BeautifulSoup on Python. I have arrived at the final page which contains the text using:
article_page = requests.get(articles.link_of_article[0])
article_soup = BeautifulSoup(article_page.text, "html.parser")
for content in article_soup.find_all('div',{"class":"name_of_class_with_contained_text"}):
content.get_text()
I notice that "name_of_class_with_contained_text" is present when I visually inspect the source code in the browser but the class is not present in the soup object. Also, all the "p" tags are replaced with the following code "\\u003c/p\\u003e\\u003cp\\u003e \\u003c/p\\u003e\\u003cp\\u003e".
I am unable to find the class name or tags to get the text contained.
Any help or reasoning as to why this happens would be appreciated.
P.S: Relatively new to scraping and HTML
UPDATE:Adding the link of the final page here.
https://www.fundfire.com/c/2258443/277443?referrer_module=searchSubFromFF&highlight=eileen%20neill%20verus

Python- Unable to retrieve complete text data for 1 more pages

I'm a newbie in Python Programming and I am facing following issue:
Objective: I need to scrap Freelancers website and store the list of theusers along with their attributes (score, ratings,reviews,details, rate,etc)
into a file. I have following codes but I am not able to get all the users.
Also, sometimes I run the program, the output changes.
import requests
from bs4 import BeautifulSoup
pages = 1
fileWriter =open('freelancers.txt','w')
url = 'https://www.freelancer.com/freelancers/skills/all/'+str(pages)+'/'
r = requests.get(url)
#gets the html contents and stores them into soup object
soup = BeautifulSoup(r.content)
links = soup.findAll("a")
#Finds the freelancer-details nodes and stores the html content into c_data
c_data = soup.findAll("div", {"class":"freelancer-details"})
for item in c_data:
print item.text
fileWriter.write('Freelancers Details:'+item.text+'\t')
#Writes the result into text file
I need to get the user details under specific users. But so far, the output looks dispersed.
Sample Output:
Freelancers Details:
thetechie13
507 Reviews
$20 USD/hr
Top Skills:
Website Design,
HTML,
PHP,
eCommerce,
Volusion
Dear Customer - We are a team of 75 Most Creative People and proud to be
Preferred Freelancer on Freelancer.com. We offer wide range of web
solutions and IT services that are bespoke in nature, can best fit our
clients' business needs and provide them cost benefits.
If you want each individual text component on its own (each assigned a different name), I would advise you to parse the text from from the HTML separately. However if you want it all grouped together you could join the strings:
print ' '.join(item.text.split())
This will place a single space between each word.

Categories