Beautiful soup do not decoding part of html page - python

Beautiful Soup returns part of html tags like "\u041f\u0440\u0438\u043b\u043e\u0436\".
How it could be solved?
Problem with page https://life.com.by/
import requests
from bs4 import BeautifulSoup
result = requests.get('https://life.com.by/company/news')
soup = BeautifulSoup(result.content, 'html.parser')
print(soup)
it returns
var engTranslations = {"test":"test","or":"or","main_page":"Main Page","consultant":"Consultant","android_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0Android","apple_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0\u0441\u043c\u0430\u0440\u0442\u0444\u043e\u043d\u043e\u0432 \u0438\u00a0\u043f\u043b\u0430\u043d\u0448\u0435\u0442\u043e\u0432 \u043d\u0430\u00a0iOS","tv_description":"\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435 \u0434\u043b\u044f\u00a0Android\u00a0TV","notebook_description":"\u0421\u043c\u043e\u0442\u0440\u0435\u0442\u044c \u0432\u00a0\u0431\u0440\u0430\u0443\u0437\u0435\u0440\u0435","new_sim":"New SIM-card","add_traffic":"Add traffic","change_tariff":"\u0421\u043c\u0435\u043d\u0438\u0442\u044c \u0442\u0430\u0440\u0438\u0444\u043d\u044b\u0439\u00a0\u043f\u043b\u0430\u043d","tariff_changing":"\u0421\u043c\u0435\u043d\u0430 \u0442\u0430\u0440\u0438\u0444\u0430","for_a_day":"For a day","for_a_week":"For a week","for_a_month":"For a month","connect":"Activate","call_back_me":"\u041f\u0435\u0440\u0435\
If I just print part of it, it returns human - readed string.
print('\u041f\u0440\u0438\u043b\u043e\u0436\u0435\u043d\u0438\u0435\u043f\u043e\u043c\u043e\u0436\u0435\u043c')
Приложение поможем

I don't exactly understand what you want...
But from what I understood, try this:
soup.encode('utf-8').decode('unicode-escape')

Related

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

BeautifulSoup class searching, no results

I'm using BeautifulSoup to parse code of this site and extract URL of the results. But when using find_all command I get an empty list as output. I checked manually the HTML code that I download from the site, and it contains the appropriate class.
If somebody could point out where I make a mistake or show a better solution I would be grateful!
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_ = 'search-item photo')
`
I've also tried to use this code below to just find all links on the site and then separate that what I need, but in this instance, I get only parent tag. if in tag 'a' is nested another tag 'a' it is skipped, and from documentation, I thought it also would be included in the output.
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('a')
BeautifulSoup can't find class that exists on webpage?
I found this answer to a similar question, but in my case, I can see the HTML code that I want to find in my console when I use print(soup.prettify())
the problem you are facing is linked to the way you are parsing page.content.
replace:
soup = BeautifulSoup(page.content, 'html.parser')
with:
soup = BeautifulSoup(page.content, 'lxml')
hope this helps.

Beautiful Soup python .get full information from html

I am trying to get the number of views of my post on Telegram via BeautifulSoup. For example I want to take it from my channel post number 956: https://t.me/dayygesstt/956
<span class="tgme_widget_message_views">3.1K</span>
So "3.1K" is what I need.
import requests
from bs4 import BeautifulSoup
def get_html(url):
r = requests.get(url,'lxml')
return r.text
url='https://t.me/dayygesstt/956'
html=get_html(url)
soup=BeautifulSoup(html, )
x = soup.findAll("div", {"class": "tgme_page tgme_page_post"})
for i in x :
r=i.findAll("div", {"class": "tgme_page_widget"})
print(r)
and it prints:
[<div class="tgme_page_widget" id="widget">
<script async="" data-telegram-post="dayygesstt/956" data-width="100%" src="https://telegram.org/js/telegram-
widget.js?4"></script>
</div>]
I tried different stuff but I can't get more info. Please help me, what am I doing wrong? How to get information properly?
You can use the URL that loads the iframe in your script. Then you get just the widget without the cruft. For this take the original URL and append a query string "embed=1".
import requests
from bs4 import BeautifulSoup
url = 'https://t.me/dayygesstt/956?embed=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
views = soup.find("span", {"class": "tgme_widget_message_views"})
print(views.text)
I think you need to define which parser to use with BeautifulSoup in order for it to parse the HTML correctly, so this line;
soup=BeautifulSoup(html, )
needs to be this;
soup=BeautifulSoup(html, 'html.parser')

cut html in half using python beautifulsoup

I'm trying to scrap a website and I need to cut the HTML code in half. The problem is that the HTML code is not really well organized and I can't just use findAll.
Here is my code to parse the HTML code :
resultats = requests.get(URL)
bs = BeautifulSoup(resultats.text, 'html.parser')
What I want to do is to divide bs for each <h2> I find :
The solution might be really simple but I can't find it...
edit : the website, here
this scrapes the whole text without html in it:
import urllib2, json, re
from bs4 import BeautifulSoup
url = "https://fr.wikipedia.org/wiki/Liste_de_sondages_sur_l'%C3%A9lection_pr%C3%A9sidentielle_fran%C3%A7aise_de_2017#Avril"
resultats = urllib2.urlopen(url)
html = resultats.read()
soup = BeautifulSoup(html, 'html5lib')
soup = soup.get_text() # Extracts Text from HTML
print soup
If you want to leave certain information out, you could add this:
soup = re.sub(re.compile('yourRegex', re.DOTALL), '', soup)\
.strip()

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Categories