How to extract image/href url from div class using scrapy

How to extract image/href url from div class using scrapy - python

I having hard time to extract href url from given website code
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&amp;s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>

Probably, you can use regular expressions for this. Here is example:
s = """<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-66hch1" style="max-width: 534px"> <div class="media-preview-content"> <a href="https://i.redd.it/nctvpvsnbpsy.jpg" class="may-blank"> <img class="preview" src="https://i.redditmedia.com/UELqh-mbh5mwnXr67PoBbi23nwZuNl2v3flNbkmewQE.jpg?w=534&amp;s=1426be7f811e5d5043760f8882674070" width="534" height="768"> </a> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>"""
re.search('href="(.*jpg)&quot', s).groups()[0]
# 'https://i.redd.it/nctvpvsnbpsy.jpg'

Related

Python Trouble with find_all()

I am attempting to scrape a site for for some data on an long list of items. Below is my code:
def find_info():
page = requests.get(URL)
soup = bs(page.content, 'html.parser')
soup2 = bs(soup.prettify(), "html.parser")
lists = soup2.find_all('div', class_="featured_item")
#with open('data.csv', 'w', encoding='utf8', newline='') as f:
# thewriter = writer(f)
# header = ['Rank', 'Number', 'Price'] #csv: comma separated values
# thewriter.writerow(header)
for list in lists:
stats = lists.find_all('div', class_="item_stats")
sale = lists.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
find_info()
I get the following error:
Traceback (most recent call last):
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 36, in <module>
find_info()
File "C:\Users\Cameron Lemon\source\repos\ScrapperGUI\module1.py", line 27, in
find_info
stats = lists.find_all('div', class_="item_stats")
File "C:\Users\Cameron Lemon\AppData\Local\Programs\Python\Python310\lib\site-
packages\bs4\element.py", line 2253, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a
list of elements like a single element. Did you call find_all() when you meant to call
find()?
Press any key to continue . . .
This code runs well, until I add the line for list in lists: and use find_all() instead of find(). I am very confused, because in the past I have coded to find data for the first item in the list successfully and then set a for loop and changed my find() to find_all(). Any advice is much appreciated. Thank you
Below is my lists variable, for two of the many items within it.
[<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>

You are using find_all method two times. When you use for loop then you will get each item meaning list didn't exist any more. So we can't use find_all without list that's why you are getting that error.
html='''
<div class="featured_item">
<div class="featured_item_img">
<a href="/degds/3251/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
1
</span>
</div>
<div class="item_stat">
item no.
<span>
#3251
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
</div>, <div class="featured_item">
<div class="featured_item_img">
<a href="/dgds/8628/">
<img src="URL"/>
</a>
</div>
<div class="featured_image_desc" style="padding-bottom: 5px;">
<div class="item_stats">
<div class="item_stat">
rank
<span>
2
</span>
</div>
<div class="item_stat">
item no.
<span>
#8628
</span>
</div>
</div>
<div class="sale" style="height: 40px; margin-top: 10px; margin-bottom: 10px;">
<span>
not for sale
</span>
</div>
<div class="last_sale" style="font-size: 11px;">
no sale history
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html.parser")
lists = soup.find_all('div', class_="featured_item")
for lis in lists:
stats = lis.find('div', class_="item_stats")
sale = lis.find('div', class_="sale")
rank = float(stats.select('.item_stats span')[0].text)
number = stats.select('.item_stats span')[1].text.strip().replace('#', '')
price = sale.select('span')[0].text.strip().replace('◎', '')
print(rank, number, price)
Output:
1.0 3251 not for sale
2.0 8628 not for sale

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help

Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]

You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

How can I iterate through the full web table with beautifulsoup?

i want to scrape a web table with selenium and beautifulsoup. The table contains 10 x 'resultMainRow' and 4 x 'resultMainCell'.
Inside the 4th resultMainCell, there have 8 span classes, for each holding an img src.
The following html code represents one of the table rows. I could only print out the relevant source code of the table. How can I iterate through the full web table together with the img src?
<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell_1 tableResult2">
<a href="javascript:genResultDetails(2);"
title="Best of the date">20/006 </a></div>
<div class="resultMainCell_2 tableResult2">21/01/2020</div>
<div class="resultMainCell_3 tableResult2"></div>
<div class="resultMainCell_4 tableResult2">
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_3abc”> </span>
<span class="resultMainCellInner">
<img height="25" src = "/info/images/icon/no_14 " ></span>
<span class="resultMainCellInner">
<img height="25" src "/info/images/icon/no_21 " ></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_28 " ></span>
<span class="resultMainCellInner">
<img height="25" src=" /info/images/icon/no_37 "></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_44 "></span>
<span class="resultMainCellInner">
<img height="6" src="/info/images/icon_happy " ></span>
<span class="resultMainCellInner"
<img height="25" src="/info/images/icon/smile "></span>
</div>
</div>
The table contains 10 x 'resultMainRow' and 4 x 'resultMainCell'.
Inside the 4th resultMainCell, there have 8 span classes, for each holding an img src.
My code is as following:
soup = BeautifulSoup(driver.page_source, 'lxml')
sixsix = soup.findAll("div", {"class": "resultMainTable"})
print (sixsix)
for row in sixsix:
images = soup.findAll('img')
for image in images:
if len(images) == 8:
aaa = images[1].find('src')
bbb = images[2].find('src')
ccc = images[3].find('src')
ddd = images[4].find('src')
eee = images[5].find('src')
fff = images[6].find('src')
ggg = images[7].find('src')
hhh = images[8].find('src')
print ((row.text), (image('src')))

You can try this script to iterate over all rows of the table and extract text from first three cells and 8 URLs from src attributes:
from bs4 import BeautifulSoup
txt = '''
<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell">text1</div>
<div class="resultMainCell">text2</div>
<div class="resultMainCell">text3</div>
<div class="resultMainCell">
<div>
<div>
<span>
<img src="1" />
<img src="2" />
<img src="3" />
<img src="4" />
<img src="5" />
<img src="6" />
<img src="7" />
<img src="8" />
</span>
</div>
</div>
</div>
</div>
<div class="resultMainRow">
<div class="resultMainCell">text3</div>
<div class="resultMainCell">text4</div>
<div class="resultMainCell">text5</div>
<div class="resultMainCell">
<div>
<div>
<span>
<img src="9" />
<img src="10" />
<img src="11" />
<img src="12" />
<img src="13" />
<img src="14" />
<img src="15" />
<img src="16" />
</span>
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for row in soup.select('div.resultMainTable .resultMainRow'):
v1, v2, v3, v4 = row.select('div.resultMainCell')
imgs = [img['src'] for img in v4.select('img')]
print(v1.text, v2.text, v3.text, *imgs)
Prints:
text1 text2 text3 1 2 3 4 5 6 7 8
text3 text4 text5 9 10 11 12 13 14 15 16
EDIT (With real HTML code from edited question):
from bs4 import BeautifulSoup
txt = '''<div class="resultMainTable">
<div class="resultMainRow">
<div class="resultMainCell_1 tableResult2">
<a href="javascript:genResultDetails(2);"
title="Best of the date">20/006 </a></div>
<div class="resultMainCell_2 tableResult2">21/01/2020</div>
<div class="resultMainCell_3 tableResult2"></div>
<div class="resultMainCell_4 tableResult2">
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_3abc"> </span>
<span class="resultMainCellInner">
<img height="25" src = "/info/images/icon/no_14 " ></span>
<span class="resultMainCellInner">
<img height="25" src "/info/images/icon/no_21 " ></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_28 " ></span>
<span class="resultMainCellInner">
<img height="25" src=" /info/images/icon/no_37 "></span>
<span class="resultMainCellInner">
<img height="25" src="/info/images/icon/no_44 "></span>
<span class="resultMainCellInner">
<img height="6" src="/info/images/icon_happy " ></span>
<span class="resultMainCellInner"
<img height="25" src="/info/images/icon/smile "></span>
</div>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for row in soup.select('div.resultMainTable .resultMainRow'):
v1, v2, v3, v4 = row.select('div[class^="resultMainCell"]')
imgs = [img['src'] for img in v4.select('img')]
print(v1.text, v2.text, v3.text, *imgs)
Prints:
20/006 21/01/2020 /info/images/icon/no_3abc /info/images/icon/no_14 /info/images/icon/no_28 /info/images/icon/no_37 /info/images/icon/no_44 /info/images/icon_happy

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Parsing all elements which have tag before

I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.

Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb

from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract image/href url from div class using scrapy - python

Related

Python Trouble with find_all()

Beautifulsoup get element with the same class

How can I iterate through the full web table with beautifulsoup?

Unable to scrape h1 class with python/beautiful soup

Parsing all elements which have tag before

Categories

Resources