I having hard time to extract href url from given website code
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
Probably, you can use regular expressions for this. Here is example:
s = """<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>"""
re.search('href="(.*jpg)"', s).groups()[0]
# 'https://i.redd.it/nctvpvsnbpsy.jpg'
Related
I am attempting to scrape a site for for some data on an long list of items. Below is my code:
def find_info():
page = requests.get(URL)
soup = bs(page.content, 'html.parser')
soup2 = bs(soup.prettify(), "html.parser")
lists = soup2.find_all('div', class_="featured_item")
#with open('data.csv', 'w', encoding='utf8', newline='') as f:
# thewriter = writer(f)
# header = ['Rank', 'Number', 'Price'] #csv: comma separated values
# thewriter.writerow(header)
for list in lists:
stats = lists.find_all('div', class_="item_stats")
sale = lists.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
find_info()
I get the following error:
Traceback (most recent call last):
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 36, in <module>
find_info()
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 27, in
find_info
stats = lists.find_all('div', class_="item_stats")
File "C:\Users\Cameron Lemon\AppData\Local\Programs\Python\Python310\lib\site-
packages\bs4\element.py", line 2253, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a
list of elements like a single element. Did you call find_all() when you meant to call
find()?
Press any key to continue . . .
This code runs well, until I add the line for list in lists: and use find_all() instead of find(). I am very confused, because in the past I have coded to find data for the first item in the list successfully and then set a for loop and changed my find() to find_all(). Any advice is much appreciated. Thank you
Below is my lists variable, for two of the many items within it.
[<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
You are using find_all method two times. When you use for loop then you will get each item meaning list didn't exist any more. So we can't use find_all without list that's why you are getting that error.
html='''
<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html.parser")
lists = soup.find_all('div', class_="featured_item")
for lis in lists:
stats = lis.find('div', class_="item_stats")
sale = lis.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
Output:
1.0 3251 not for sale
2.0 8628 not for sale
I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences
i want to scrape a web table with selenium and beautifulsoup. The table contains 10 x 'resultMainRow' and 4 x 'resultMainCell'.
Inside the 4th resultMainCell, there have 8 span classes, for each holding an img src.
The following html code represents one of the table rows. I could only print out the relevant source code of the table. How can I iterate through the full web table together with the img src?
<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell_1 tableResult2">
<a href="javascript:genResultDetails(2);"
title="Best of the date">20/006 </a></div>
<div class="resultMainCell_2 tableResult2">21/01/2020</div>
<div class="resultMainCell_3 tableResult2"></div>
<div class="resultMainCell_4 tableResult2">
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_3abc”> </span>
<span class="resultMainCellInner">
<img height="25" src = "/info/images/icon/no_14 " ></span>
<span class="resultMainCellInner">
<img height="25" src "/info/images/icon/no_21 " ></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_28 " ></span>
<span class="resultMainCellInner">
<img height="25" src=" /info/images/icon/no_37 "></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_44 "></span>
<span class="resultMainCellInner">
<img height="6" src="/info/images/icon_happy " ></span>
<span class="resultMainCellInner"
<img height="25" src="/info/images/icon/smile "></span>
</div>
</div>
The table contains 10 x 'resultMainRow' and 4 x 'resultMainCell'.
Inside the 4th resultMainCell, there have 8 span classes, for each holding an img src.
My code is as following:
soup = BeautifulSoup(driver.page_source, 'lxml')
sixsix = soup.findAll("div", {"class": "resultMainTable"})
print (sixsix)
for row in sixsix:
images = soup.findAll('img')
for image in images:
if len(images) == 8:
aaa = images[1].find('src')
bbb = images[2].find('src')
ccc = images[3].find('src')
ddd = images[4].find('src')
eee = images[5].find('src')
fff = images[6].find('src')
ggg = images[7].find('src')
hhh = images[8].find('src')
print ((row.text), (image('src')))
You can try this script to iterate over all rows of the table and extract text from first three cells and 8 URLs from src attributes:
from bs4 import BeautifulSoup
txt = '''
<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell">text1</div>
<div class="resultMainCell">text2</div>
<div class="resultMainCell">text3</div>
<div class="resultMainCell">
<div>
<div>
<span>
<img src="1" />
<img src="2" />
<img src="3" />
<img src="4" />
<img src="5" />
<img src="6" />
<img src="7" />
<img src="8" />
</span>
</div>
</div>
</div>
</div>
<div class="resultMainRow">
<div class="resultMainCell">text3</div>
<div class="resultMainCell">text4</div>
<div class="resultMainCell">text5</div>
<div class="resultMainCell">
<div>
<div>
<span>
<img src="9" />
<img src="10" />
<img src="11" />
<img src="12" />
<img src="13" />
<img src="14" />
<img src="15" />
<img src="16" />
</span>
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for row in soup.select('div.resultMainTable .resultMainRow'):
v1, v2, v3, v4 = row.select('div.resultMainCell')
imgs = [img['src'] for img in v4.select('img')]
print(v1.text, v2.text, v3.text, *imgs)
Prints:
text1 text2 text3 1 2 3 4 5 6 7 8
text3 text4 text5 9 10 11 12 13 14 15 16
EDIT (With real HTML code from edited question):
from bs4 import BeautifulSoup
txt = '''<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell_1 tableResult2">
<a href="javascript:genResultDetails(2);"
title="Best of the date">20/006 </a></div>
<div class="resultMainCell_2 tableResult2">21/01/2020</div>
<div class="resultMainCell_3 tableResult2"></div>
<div class="resultMainCell_4 tableResult2">
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_3abc"> </span>
<span class="resultMainCellInner">
<img height="25" src = "/info/images/icon/no_14 " ></span>
<span class="resultMainCellInner">
<img height="25" src "/info/images/icon/no_21 " ></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_28 " ></span>
<span class="resultMainCellInner">
<img height="25" src=" /info/images/icon/no_37 "></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_44 "></span>
<span class="resultMainCellInner">
<img height="6" src="/info/images/icon_happy " ></span>
<span class="resultMainCellInner"
<img height="25" src="/info/images/icon/smile "></span>
</div>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for row in soup.select('div.resultMainTable .resultMainRow'):
v1, v2, v3, v4 = row.select('div[class^="resultMainCell"]')
imgs = [img['src'] for img in v4.select('img')]
print(v1.text, v2.text, v3.text, *imgs)
Prints:
20/006 21/01/2020 /info/images/icon/no_3abc /info/images/icon/no_14 /info/images/icon/no_28 /info/images/icon/no_37 /info/images/icon/no_44 /info/images/icon_happy
I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>
I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')