Python: BeautifulSoup not always getting all text data - python

i've got a strange problem with my webscraper. I am trying to get the data from a website using BeautifulSoup.
My code works on 90% of all links i've tried out but on a few it does not read the page fully.
The text that intrests me is "1152x864"
When checking the soure code on my browser i clearly see the text:
<li class="x-block-grid-item">
<h3 style="margin: 0 0 0.35em;font-size: 1em;letter-spacing: 0.05em;line-height: 1">Resolution</h3>
<p class="man">1152x864</p>
</li>
But when I try to get the source via BeautifulSoup it only shows this:
<li class="x-block-grid-item">
<h3 style="margin: 0 0 0.35em;font-size: 1em;letter-spacing: 0.05em;line-height: 1">Resolution</h3>
</li>
This is my code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://prosettings.net/counterstrike/fer/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("li",{"class":"x-block-grid-item"})
cont_res = containers[8].p.text
print("Res: " + cont_res)
When I try a different link for example:
my_url = 'https://prosettings.net/counterstrike/fallen/'
Everything works fine.

Try this. It should not disappoint you:
from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://prosettings.net/counterstrike/fer/'
res = urlopen(URL).read()
soup = BeautifulSoup(res, "lxml")
cont_res = ' '.join([item.find(class_="man").text for item in soup.find_all(class_="x-block-grid-item") if "Resolution" in item.text])
# or using .select()
# cont_res = ' '.join([item.select_one(".man").text for item in soup.select(".x-block-grid-item") if "Resolution" in item.text])
print("Res: " + cont_res)
Output:
Res: 1152x864

I'm used to BeautifulSoup.text printing out the text of every tag and its children, but there may be something funny going on with these <p>'s in particular. At any rate, you're not getting the right soup, so maybe try requests instead of urllib, and then dig straight for the <p> tags with bs4.
site = 'https://prosettings.net/counterstrike/fer/'
r = requests.get(site)
soup = BeautifulSoup(r.content, 'html.parser')
list2 = soup.find_all('p', class_='man')
for item in list2:
if item.find('p'):
print(item.text)
Gives me all the class="man" <p> tags' info:
400
2.50
1000
125
1.00
0
6
1
1152x864
4:3
stretched
240
It's literally just the Resolution tag. No idea why.

Related

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

Beautiful soup doesn't return full html

I've noticed that my code doesn't return the full html. Here is the code:
import requests
from bs4 import BeautifulSoup
keys = "blabla" + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query)
soup = BeautifulSoup(req.text, "html.parser")
for tweets in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(tweets.get("data-tweet-id"))
This doesn't print anything and the "li" tag is not even in the soup object, nor the "div class='stream'", even though the twitter search page looks like this:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" data-item-id="1211695349607174144"
id="stream-item-tweet-1211695349607174144"
data-item-type="tweet"
There is also a lot of other things which don't appear in my soup object.

Scraping website, but want to choose an img URL from a srcset and do it nine more times

I'm trying to scrape the BBC Sounds website for **all of the ** 'currently playing' images. I'm not bothered about which size to use, 400w might be a good.
Below is a relevant excerpt from the HTML and my current python script. A variation on this works brilliantly for the 'now playing' text, but I haven't been able to get it to work for the image URLs, which is what I'm after, I think probably because a) there's so many image URLs to choose from and b) there's a whitespace which no doubt the parser doesn't like. Please bear in mind the HTML code below is repeated about 10 times for each of the channels. I've included just one as an example. Thank you!
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.co.uk/sounds"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "sc-o-responsive-image__img sc-u-circle"})
print g_data[0].text
print g_data[1].text
print g_data[2].text
print g_data[3].text
print g_data[4].text
print g_data[5].text
print g_data[6].text
print g_data[7].text
print g_data[8].text
print g_data[9].text
.
<div class="gel-layout__item sc-o-island">
<div class="sc-c-network-item__image sc-o-island" aria-hidden="true">
<div class="sc-c-rsimage sc-o-responsive-image sc-o-responsive-image--1by1 sc-u-circle">
<img alt="" class="sc-o-responsive-image__img sc-u-circle"
src="https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg" srcSet="https://ichef.bbci.co.uk/images/ic/160x160/p07fzzgr.jpg 160w,
https://ichef.bbci.co.uk/images/ic/192x192/p07fzzgr.jpg 192w,
https://ichef.bbci.co.uk/images/ic/224x224/p07fzzgr.jpg 224w,
https://ichef.bbci.co.uk/images/ic/288x288/p07fzzgr.jpg 288w,
https://ichef.bbci.co.uk/images/ic/368x368/p07fzzgr.jpg 368w,
https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg 400w,
https://ichef.bbci.co.uk/images/ic/448x448/p07fzzgr.jpg 448w,
https://ichef.bbci.co.uk/images/ic/496x496/p07fzzgr.jpg 496w,
https://ichef.bbci.co.uk/images/ic/512x512/p07fzzgr.jpg 512w,
https://ichef.bbci.co.uk/images/ic/576x576/p07fzzgr.jpg 576w,
https://ichef.bbci.co.uk/images/ic/624x624/p07fzzgr.jpg 624w"
sizes="(max-width: 400px) 34vw,(max-width: 600px) 25vw,17vw"/>
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bbc.co.uk/sounds")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("img", {'class': 'sc-o-responsive-image__img sc-u-circle'}):
print(item.get("src"))
Output:
https://ichef.bbci.co.uk/images/ic/400x400/p05mpj80.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07dg040.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07zml97.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p0428n3t.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p01lyv4b.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06yphh0.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p05v4t1c.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06z9zzc.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06x0hxb.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06n253f.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p060m6jj.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07l4fjw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03710d6.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p078qrgm.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03crmyc.jpg

using beatifulsoup4 to scrape a specific part of html code

I am wanting to make a variable equal the 1.65 towards the end of the html code. Currently if i was to run my code it will print "price-text". Any help to be able to swap it to print "1.65" would be great.
<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>
html code
uClient.close()
page_soup = soup(page_html, "html.parser")
price_texts = page_soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a =price_text.span["data-automation-id"]
print (a)
The most popular is property .text
price_text.span.text
But there are other properties and methods
price_text.span.text
price_text.span.string
price_text.span.getText()
price_text.span.get_text()
Documentation for method get_text()
Full working code
from bs4 import BeautifulSoup
html = '<div class="priceText_f71sibe"><span class="size14_f7opyze medium_f1wf24vo priceTextSize_frw9zm9" data-automation-id="price-text">1.65</span></div>'
soup = BeautifulSoup(html, "html.parser")
price_texts = soup.findAll("div",{"class":"priceText_f71sibe"})
price_text = price_texts[0]
a = price_text.span["data-automation-id"]
print(price_text.span.text)
print(price_text.span.string)
print(price_text.span.getText())
print(price_text.span.get_text())

Iterate through the resultset bs4

I have used bs4 to extract this resultset in bs4.
<div>
<div>
</div>
Content 1
</div>
<div>
Content 2
</div>
I am trying to extract these 2 elements.
Moi not cute not hot, the ugly bui bui type 1 and Actually, moi also dun know
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
Here is my code. However, how do i iterate through the result set so that it only extracts the content way before the closing div.
letters.find_all('div') returns an empty set.
All the messages:
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
for a in letters:
print [b.strip() for b in a.text.strip().split('\n') if b.strip()]

Categories