Webscraping Python Trying to pull changing "id" - python

Following is my code.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://steamcommunity.com/market/listings/730/Souvenir%20P2000%20%7C%20Chainmail%20%28Factory%20New%29'
# open and read
uClient = uReq(my_url)
page_html = uClient.read()
#close
uClient.close()
#html parse
page_soup = soup(page_html,"html.parser")
#grab all listings
containers = page_soup.findAll("div",{"class":"market_listing_item_name_block"})
for container in containers:
block_container = container.findAll("span",{"class":"market_listing_item_name"})
The block_container returns multiple results all the same, except they have in the <span> and
id = "listing_#_name" where the # is a combination of numbers that changes for each <span>
For example -
</br></div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2060891817875196312_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>
</div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2076653149485426829_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
Can anyone explain how I can grab the id out of all the spans?

You can get the id from the span tags.
Try:
for container in containers:
for block_container in container.findAll("span", class_="market_listing_item_name"):
print(block_container.attrs['id'])
From Beautiful Soup Docs
A tag may have any number of attributes. The tag has
an attribute id whose value is boldest. You can access a tag’s
attributes by treating the tag like a dictionary:
tag['id']
# u'boldest'
You can access that dictionary directly as .attrs:
tag.attrs
# {u'id': 'boldest'}
Reference:
Beautiful Soup Docs - Attributes

Can this help you
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''</br></div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2060891817875196312_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/>
<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>
</div>, <div class="market_listing_item_name_block">
<span class="market_listing_item_name" id="listing_2076653149485426829_name" style="color: #FFD700;">Souvenir P2000 | Chainmail (Factory New)</span>
<br/></div>'''
doc = SimplifiedDoc(html)
containers = doc.getElements(tag='div',value='market_listing_item_name_block')
for container in containers:
block_container = container.span
# or
# block_container = container.getElement(tag='span',value='market_listing_item_name')
# block_containers = container.getElements(tag='span',value='market_listing_item_name').contains('listing_',attr='id')
print (block_container.id, doc.replaceReg(block_container.id,"(listing_|_name)",""))

Related

BeautifulSoup - extracting text from multiple span elements w/o classes

So that's how HTML looks:
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>
I need to extract detail2 & detail3.
But with this piece of code I only get detail1.
info = data.find("p", class_ = "details").span.text
How do I extract the needed items?
Thanks in advance!
Select your elements more specific in your case all sibling <span> of <span> with class number:
soup.select('span.number ~ span')
Example
from bs4 import BeautifulSoup
html='''<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('span.number ~ span')]
Output
['detail2', 'detail3']
You can find all <span>s and do normal indexing:
from bs4 import BeautifulSoup
html_doc = """\
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
spans = soup.find("p", class_="details").find_all("span")
for s in spans[-2:]:
print(s.text)
Prints:
detail2
detail3
Or CSS selectors:
spans = soup.select(".details span:nth-last-of-type(-n+2)")
for s in spans:
print(s.text)
Prints:
detail2
detail3

Web scrape second number between tags

I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)

How to extract exact information by span class using Beautiful Soup

This is my code and output for my price monitoring code:
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="result_0_name").get_text()
price = soup.find("span", class_ = "normal_price")
#converted_price = price[0:3]
print(price.get_text())
print(title.strip())
the output is as follows
Starting at:
$0.70 USD
$0.67 USD
Operation Broken Fang Case
and html of the page is as so
<span class="market_table_value normal_price">Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
as you can see there is no ID, so I cannot use that, I only wish to display the 'normal_price' and not the other data in that span. Any ideas?
Just make the selection of the span more specific, for example use the fact, that it is an element inside an element:
soup.select_one('span > span.normal_price').get_text()
Example
from bs4 import BeautifulSoup
html='''
<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
'''
soup = BeautifulSoup(html,'html.parser')
price = soup.select_one('span > span.normal_price').get_text()
price
Output
$0.69 USD
You can also try like this
from bs4 import BeautifulSoup
html ="""<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>"""
soup = BeautifulSoup(html, 'html.parser')
using attribute
price = soup.select_one('span[data-currency="1"]').get_text()
exact attribute
price = soup.select_one('span[data-currency^="1"]').get_text()
print(price) #$0.69 USD

Web scraping with Beautiful Soup gives empty ResultSet

I am experimenting with Beautiful Soup and I am trying to extract information from a HTML document that contains segments of the following type:
<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>
I have used the following commands:
with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml')
soup.title
out: <title>Sent Invites\n| LinkedIn\n</title>
invites = soup.find_all("div", class_ = "entity-body")
type(invites)
out: bs4.element.ResultSet
len(invites)
out: 0
Why find_all returns empty ResultSet object?
Your advice will be appreciated.
The problem is that the document is not read, it is a just TextIOWrapper (Python 3) or File(Python 2) object. You have to read the documet and pass markup, essentily a string to BeautifulSoup.
The correct code would be:
with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites:
soup = BeautifulSoup(Invites.read(), "html.parser")
soup.title
invites = soup.find_all("div", class_="entity-body")
len(invites)
import bs4
html = '''<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&trk=manage_invitations_profile"
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&trk=manage_invitations_miniprofile"
class="miniprofile"
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
invites = soup.find_all("div", class_ = "entity-body")
len(invites)
out:
1
this code works fine

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Categories