I am trying to get all the links on a given website but is stuck with some problems about HTML entities. Here's my code that crawls websites using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
.
.
baseRequest = requests.get("https://www.example.com", SOME_HEADER_SETTINGS)
soup = BeautifulSoup(baseRequest.content, "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
.
.
print(pageLinks)
The code becomes problematic when it sees this kind of element:
Link
Instead of printing ["./page?id=123§=2"], it treats the § part as an HTML entity and shows this in the console:
["./page?id=123§=2"]
Is there a solution to preventing the this?
Here is one
from bs4 import BeautifulSoup
soup = BeautifulSoup('Link', "html.parser")
pageLinks = []
for anchor in soup.findAll("a"):
pageLinks.append(anchor["href"])
uncoded = ''.join(i for i in pageLinks).encode('utf-8')
decoded = ''.join(map(lambda x: chr(ord(x)),''.join(i for i in pageLinks)))
print('uncoded =',uncoded)
print('decoded =',decoded)
output
uncoded = b'./page?id=123\xc2\xa7=2'
decoded = ./page?id=123§=2
Related
I want to extract ads that contain two special Persian words "توافق" or "توافقی" from a website. I am using BeautifulSoup and split the content in the soup to find the ads that have my special words, but my code does not work, May you please help me?
Here is my simple code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__body"})
for content in results:
words = content.split()
if words == "توافقی" or words == "توافق":
print(content)
Since that توافقی is appeared in the div tags with kt-post-card__description class, I will use this. Then you can get the adds by using tag's properties like .previous_sibling or .parent or whatever...
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
text = content.text
if "توافقی" in text or "توافق" in text:
print(content.previous_sibling) # It's the h2 title.
so basically you are trying to split bs4 class and hence its giving error. Before splitting it, you need to convert it into text string.
import re
from bs4 import BeautifulSoup
import requests
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
words = content.text.split()
if "توافقی" in words or "توافق" in words:
print(content)
There are differnet issues first one, also mentioned by #Tim Roberts, you have to compare the list items with in:
if 'توافقی' in words or 'توافق' in words:
Second you have to seperate the texts from each of the child elements, so use get_text() with separator:
words=content.get_text(' ', strip=True)
Note: requests do not render dynamic content, it justs focus on static one
Example
import requests
from bs4 import BeautifulSoup
r=requests.get('https://divar.ir/s/tehran')
soup=BeautifulSoup(r.text,'html.parser')
results=soup.find_all('div',attrs={'class':"kt-post-card__body"})
for content in results:
words=content.get_text(' ', strip=True)
if 'توافقی' in words or 'توافق' in words:
print(content.text)
An alternative in this specific case could be the use of css selectors, so you could select the whole <article> and pick elements you need:
results = soup.select('article:-soup-contains("توافقی"),article:-soup-contains("توافق")')
for item in results:
print(item.h2)
print(item.span)
When I write the below code, as you can see in the picture, the result shows me the same sentence twice.
How can I solve this problem?
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, 'html.parser')
data = soup.find_all('p')
for item in data:
print(item.string)
Changing your parser to lxml will fix this.
By inspecting the website, you will see there are empty p tags around the p tag of the reviews, and they are messing up the html parser.
This snippet change on the first line will fix by changing your parser, and then on the data list, you can filter the empty p tags by seeing if they have content in their string attribute.
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.consumeraffairs.com/education/online-courses/coursera.html?page=2#scroll_to_reviews=true")
soup = BeautifulSoup(res.content, "lxml")
data = soup.find_all('p')
for item in data:
if item.string:
print(item.string)
This will print for you the reviews only once!
If you get an error about not having a builder, run
pip install lxml
Beautiful Soup returns part of html tags like "\u041f\u0440\u0438\u043b\u043e\u0436\".
How it could be solved?
Problem with page https://life.com.by/
import requests
from bs4 import BeautifulSoup
result = requests.get('https://life.com.by/company/news')
soup = BeautifulSoup(result.content, 'html.parser')
print(soup)
it returns
var engTranslations = {"test":"test","or":"or","main_page":"Main Page","consultant":"Consultant","android_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0Android","apple_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0iOS","tv_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0Android\u00a0TV","notebook_description":"\u0421\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u0432\u00a0\u0431\u0440\u0430\u0443\u0437\u0435\u0440\u0435","new_sim":"New SIM-card","add_traffic":"Add traffic","change_tariff":"\u0421\u043c\u0435\u043d\u0438\u0442\u044c \u0442\u0430\u0440\u0438\u0444\u043d\u044b\u0439\u00a0\u043f\u043b\u0430\u043d","tariff_changing":"\u0421\u043c\u0435\u043d\u0430 \u0442\u0430\u0440\u0438\u0444\u0430","for_a_day":"For a day","for_a_week":"For a week","for_a_month":"For a month","connect":"Activate","call_back_me":"\u041f\u0435\u0440\u0435\
If I just print part of it, it returns human - readed string.
print('\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435\u043f\u043e\u043c\u043e\u0436\u0435\u043c')
Приложение поможем
I don't exactly understand what you want...
But from what I understood, try this:
soup.encode('utf-8').decode('unicode-escape')
I'm trying to scrap a website and I need to cut the HTML code in half. The problem is that the HTML code is not really well organized and I can't just use findAll.
Here is my code to parse the HTML code :
resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')
What I want to do is to divide bs for each <h2> I find :
The solution might be really simple but I can't find it...
edit : the website, here
this scrapes the whole text without html in it:
import urllib2, json, re
from bs4 import BeautifulSoup
url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()
soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML
print soup
If you want to leave certain information out, you could add this:
soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
.strip()
HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text