Find a word using BeautifulSoup - python

I want to extract ads that contain two special Persian words "توافق" or "توافقی" from a website. I am using BeautifulSoup and split the content in the soup to find the ads that have my special words, but my code does not work, May you please help me?
Here is my simple code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__body"})
for content in results:
words = content.split()
if words == "توافقی" or words == "توافق":
print(content)

Since that توافقی is appeared in the div tags with kt-post-card__description class, I will use this. Then you can get the adds by using tag's properties like .previous_sibling or .parent or whatever...
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
text = content.text
if "توافقی" in text or "توافق" in text:
print(content.previous_sibling) # It's the h2 title.

so basically you are trying to split bs4 class and hence its giving error. Before splitting it, you need to convert it into text string.
import re
from bs4 import BeautifulSoup
import requests
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
words = content.text.split()
if "توافقی" in words or "توافق" in words:
print(content)

There are differnet issues first one, also mentioned by #Tim Roberts, you have to compare the list items with in:
if 'توافقی' in words or 'توافق' in words:
Second you have to seperate the texts from each of the child elements, so use get_text() with separator:
words=content.get_text(' ', strip=True)
Note: requests do not render dynamic content, it justs focus on static one
Example
import requests
from bs4 import BeautifulSoup
r=requests.get('https://divar.ir/s/tehran')
soup=BeautifulSoup(r.text,'html.parser')
results=soup.find_all('div',attrs={'class':"kt-post-card__body"})
for content in results:
words=content.get_text(' ', strip=True)
if 'توافقی' in words or 'توافق' in words:
print(content.text)
An alternative in this specific case could be the use of css selectors, so you could select the whole <article> and pick elements you need:
results = soup.select('article:-soup-contains("توافقی"),article:-soup-contains("توافق")')
for item in results:
print(item.h2)
print(item.span)

Related

How to get specific text hyperlinks in the home webpage by BeautifulSoup?

I want to search all hyperlink that its text name includes "article" in https://www.geeksforgeeks.org/
for example, on the bottom of this webpage
Write an Article
Improve an Article
I want to get them all hyperlink and print them, so I tried to,
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re
url = 'https://www.geeksforgeeks.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
#print(link.get("href")
if re.search('/article$', href):
links.append(link.get("href"))
However, it get a [] in result, how to solve it?
Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article in the href of all elements a.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]

Webscraping merriam-webster using beautifulsoup

I am using beautifulSoup and trying to scrape only the first definition (very cold) of a word from merriam-webster but it scrapes second line (a sentence) as well. This is my code.
P.S: i want only the "very cold" part. "put on your jacket...." should not be included in the output. Please someone help.
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/dictionary/freezing"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
definition = soup.find("span", {"class" : "dt"})
tag = definition.findChild()
print(tag.text)
Selecting by class is second faster method for css selector matching. Using select_one returns only first match and using next_sibling will take you to the node you want
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.merriam-webster.com/dictionary/freezing')
soup = bs(r.content, 'lxml')
print(soup.select_one('.mw_t_bc').next_sibling.strip())
The way that Merriam-Webster structures their page is a little strange, but you can find the <strong> tag that precedes the definition, grab the next sibling and strip out all whitespace like this:
>>> tag.find('strong').next_sibling.strip()
u'very cold'

Find a tag using text it contains using BeautifulSoup

I am trying to webscrape some parts of this page:
https://markets.businessinsider.com/stocks/bp-stock
using BeautifulSoup to search for some text contained in h2 title of tables
when i do:
data_table = soup.find('h2', text=re.compile('RELATED STOCKS')).find_parent('div').find('table')
It correctly get the table I am after.
When I try to get the table "Analyst Opinion" using the similar line, it returns None:
data_table = soup.find('h2', text=re.compile('ANALYST OPINIONS')).find_parent('div').find('table')
I am guessing that there might be some special characters in the html code, that provides re to function as expected.
I tried this too:
data_table = soup.find('h2', text=re.compile('.*?STOCK.*?INFORMATION.*?', re.DOTALL))
without success.
I would like to get the table that contain this bit of text "Analyst Opinion" without finding all tables but by checking if contains my requested text.
Any idea will be highly appreciated.
Best
You can use CSS selector to locate the <table>:
import requests
from bs4 import BeautifulSoup
url = 'https://markets.businessinsider.com/stocks/bp-stock '
soup = BeautifulSoup(requests.get(url).text, 'lxml')
table = soup.select_one('div:has(> h2:contains("Analyst Opinions")) table')
for tr in table.select('tr'):
print(tr.get_text(strip=True, separator=' '))
Prints:
2/26/2018 BP Outperform RBC Capital Markets
9/22/2017 BP Outperform BMO Capital Markets
More about CSS selectors here.
EDIT: For canse-insensitive method, you can use bs4 API with regular expressions (note the flags=re.I). This is the equivalent of .select() method above:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://markets.businessinsider.com/stocks/bp-stock '
soup = BeautifulSoup(requests.get(url).text, 'lxml')
h2 = soup.find(lambda t: t.name=='h2' and re.findall('analyst opinions', t.text, flags=re.I))
table = h2.find_parent('div').find('table')
for tr in table.select('tr'):
print(tr.get_text(strip=True, separator=' '))

Python BeautifulSoup Paragraph Text only

I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that.
I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day)
Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>
soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?
thanks
Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:
Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.
from bs4 import BeautifulSoup
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
divTag = soup.find_all("div", {"class": "rte"})
for tag in divTag:
pTags = tag.find_all('p')
for tag in pTags[:-2]: # to trim the last two irrelevant looking lines
print(tag.text)
OUTPUT:
Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).
 
If you want the text of all the p tag, you can just loop on them using the find_all method:
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
data = soup.find_all('p')
for p in data:
text = p.get_text()
print(text)
EDIT:
Here is the code in order to have them separatly in a list. You can them apply a loop on the result list to remove empty string, unused characters like\n etc...
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
print(result)
Here is the solution:
from bs4 import BeautifulSoup
import requests
import Clock
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
Clock.schedule_interval(print(result), 60)

Python: BeautifulSoup extract string between div tag by its class

import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='http://www.amazon.in/product-reviews/B00CE2LUKQ/ref=cm_cr_pr_top_link_1?ie=UTF8&showViewpoints=0&sortBy=bySubmissionDateDescending'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
rows =soup.find_all('div',attrs={"class" : "reviewText"})
print rows
This code is used to extract the reviews from the website. I need only the text - but I get them with the div tags.
I need help regarding how the text alone gets extracted. I require the text alone-between the div class tags.
for row in soup.find_all('div',attrs={"class" : "reviewText"}):
print row.text
or:
[row.text for row in rows]

Categories