I'm having trouble with beautiful soup. started today to learn about it but can't manage to find a way to fix my issue.
I want to get only 1 link each time, and what is written in the h1 and p.
article_name_list = soup.find(class_='turbolink_scroller')
#find all links in the div
article_name_list_items = article_name_list.find_all('article')
#loop to print all out
for article_name in article_name_list_items:
names = article_name.find('h1')
color = article_name.find('p')
print(names)
print(color)
the output is:
<h1><a class="name-link" href="/shop/jackets/gw1diqgyr/km21a8hnc">Gonz Logo Coaches Jacket </a></h1>
<p><a class="name-link" href="/shop/jackets/gw1diqgyr/km21a8hnc">Red</a></p>
I would like to get in output :
href="blablabla"
Gonz Logo Coatches Jacket Red
and put it in a variable each time (if possible) like link = href"blablabla" and name = "gonz logo ..." or 3 variables with the color in another one.
EDIT here is how the page looks like:
<div class="turbolink_scroller" id="container" style="opacity: 1;">
<article>
<div class="inner-article">
<a style="height:150px;" href="/shop/jackets/h21snm5ld/jick90fel">
<img width="150" height="150" src="//assets.supremenewyork.com/146917/vi/MCHFhUqvN0w.jpg" alt="Mchfhuqvn0w">
<div class="sold_out_tag" style="">sold out</div>
</a>
<h1><a class="name-link" href="/shop/jackets/h21snm5ld/jick90fel">NY Tapestry Denim Chore Coat</a></h1>
<p><a class="name-link" href="/shop/jackets/h21snm5ld/jick90fel">Maroon</a></p>
</div>
</article>
<article></article>
<article></article>
<article></article>
</div>
EDIT 2: problem resolved (thank you)
here is the solution for others:
article_name_list = soup.find(class_='turbolink_scroller')
#find all links in the div
article_name_list_items = article_name_list.find_all('article')
#loop to print all out
for article_name in article_name_list_items:
link = article_name.find('h1').find('a').get('href')
names = article_name.find('h1').find('a').get_text()
color = article_name.find('p').find('a').get_text()
print(names)
print(color)
print(link)
thank you all for your answers.
I assume you're looking to put each of those into individual lists.
name_list = []
link_list = []
color_list = []
for article_name in article_name_list_items:
names = article_name.find('h1').find('a', class_ = 'name-link').get_text()
links = article_name.find('p').find('a', class_ = 'name-link').get('href')
colors = article_name.find('p').find('a', class_ = 'name-link').get_text()
name_list.append(names)
link_list.append(links)
color_list.append(colors)
Not exactly sure what article_name_list_items looks like but names will get you the text of the <h1> element, links will get you the href of the <p> element, and colors will get you the text of <p> element.
You could also opt to include all elements in a list of lists which would be this (initialize new list list_of_all and replace 3 list appends with the single append in the second line):
list_of_all = []
list_of_all.append([names, links, colors])
I believe you are very close. However, you should tell us a little more about the structure of the page. Are all articles structured in the same h1>a,p>a structure?
Assuming this structure then the following should work:
names = article_name.find('h1').find('a').get('href')
color = article_name.find('p').find('a').get_text()
Related
I tried finding all the <p> tags inside the class content-inner and I don't want all the the <p> tags that talks about copyright (the last <p> tags outside the container class) to appears when filtering the <p> tags and my images shows an empty list or nothing comes out at all and therefore no image is been saved.
main = requests.get('https://url_on_html.com/')
beautify = BeautifulSoup(main.content,'html5lib')
news = beautify.find_all('div', {'class','jeg_block_container'})
arti = []
for each in news:
title = each.find('h3', {'class','jeg_post_title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text,'html5lib')
content = [i.text.strip() for i in soup.find_all('p')]
content = ' '.join(content)
images = [i['src'] for i in soup.find_all('img')]
arti.append({
'Headline': title,
'Link': lnk,
'image': images,
'content': content
})
This website HTML looks like this:
<html><head><title>The simple's story</title></head>
<body>
<div class="content-inner "><div class="addtoany_share_save_cont"><p>He added: “The President king administration has embarked on
railway construction</p>
<p>Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<p> we will not once in Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>the emergency of our matter is Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<br></br>
<script></script>
<p>king of our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<script></script>
<img src="image.png">
<p>he is our Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
<p>some weas Once upon a time there were three little sisters, and their names were and they lived at the bottom of a well.</p>
</div>
</div>
<div>
<p>Copyright © 2021. All Rights Reserved. Vintage Press Limited. Optimized by iNERD360</p>
</div>
This will show an empty list:
content = [i.text.strip() for i in soup.find_all('div', {'class', 'content-inner'}]
As well as for the images this code shows an empty pages too:
images = [i['src'] for i in soup.find_all('img',)]
This will filter all <p> tags in the HTML page and this is what I don't want
content = [i.text.strip() for i in soup.find_all('p')]
How do I filter all the <p> tags except the last <p> tags outside the class? Also, how do I filter images correctly with bs4?
Replace: content = [i.text.strip() for i in soup.find_all('p')]
With:
div_list = [div for div in soup.find_all('div', class_="content-inner")]
p_list = [div.find_all('p') for div in div_list]
content = [item.text.strip() for p in p_list for item in p]
Leave the rest of the code unchanged.
This way, your script returns a list containing everything you ask for (including images), except adds and copyright string.
Get a list of all paragraphs
paragraphs = soup.find_all("p")
Produce a filtered list (list comprehension) of paragraphs not starting with the string "Copyright":
paragraphs = [item.text.strip() for item in paragraphs if not item.text.startswith("Copyright")]
I would like to remove the html tag, but preserve the text in between the tags and maintain it in the list. This is my following code:
comment_list = comment_container.findAll("div", {"class" : "comment-date"})
print(comment_list)
Output is:
[<div class="comment-date">2018-9-11 03:58</div>,
<div class="comment-date">2018-4-4 17:10</div>,
<div class="comment-date">2018-4-26 01:06</div>,
<div class="comment-date">2018-7-19 13:48</div>,
<div class="comment-date">2018-4-12 11:39</div>,
<div class="comment-date">2019-3-14 21:12</div>,
<div class="comment-date">2019-3-4 15:43</div>,
<div class="comment-date">2019-3-12 13:20</div>,
<div class="comment-date">2019-3-10 22:32</div>,
<div class="comment-date">2019-3-8 15:22</div>]
Desired Output:
[2018-9-11 03:58, 2018-4-4 17:10, 2018-4-26 01:06,
2018-7-19 13:48, 2018-4-12 11:39, 2019-3-14 21:12,
2019-3-4 15:43, 2019-3-12 13:20, 2019-3-10 22:32, 2019-3-8 15:22]
I am able to extract the text individually by using a for loop.
for commentDate in comment_list:
comments = commentDate.text
print(comments)
I would like to use the dates for comparison (finding the earliest date), hence i feel that saving the dates into a list will be most manageable.
You can convert your list of div elements to list of dates using list comprehension like this to get desired output:
comment_list = comment_container.findAll("div", {"class" : "comment-date"})
comment_dates = [comment.text for comment in comment_list]
print(comment_dates)
I'm a bit new to python/BeautifulSoup, and was wondering if I could get some direction on how to get the following accomplished.
I have html from a webpage, that is structured as follows:
1) block of code contained within a tag that contains all image names (Name1, Name2, Name3.
2) block of code contained within a tag that has image urls.
3) a date, that appears one on the webpage. I put it into 'date' variable (this has already been extracted)
From the code, I'm trying to extract a list of lists that will contain [['image1','url1', 'date'], ['image2','url2','date']], which i will later convert into a dictionary (via dict(zip(labels, values)) function), and insert into a mysql table.
All I can come up with is how to extract two lists that contain all images , and all url's. Any idea on how to get what i'm trying to do accomplished?
Few things to keep in mind:
1) number of images always changes, along with names (1:1)
2) date always appears once.
P.S. Also, if there is a more elegant way to extract the data via bs4, please let me know!
from bs4 import BeautifulSoup
name = []
url = []
date = '2017-10-12'
text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>'
soup = BeautifulSoup(text, 'lxml')
#print soup.prettify()
#get names
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}):
for imglinks in imgz.find_all('img', src = True):
#print imgz
url.append((imglinks['src']).encode("utf-8"))
#3 get ad URLS
for ultag in soup.find_all('ul'):
for litag in ultag.find_all('li'):
name.append((litag.text).encode("utf-8")) #dump all urls into a list
print url
print name
Here's another possible route to pulling the urls and names:
url = [tag.get('src') for tag in soup.find_all('img')]
name = [tag.text.strip() for tag in soup.find_all('li')]
print(url)
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg']
print(name)
# ['NAME1', 'NAME2', 'NAME3']
As for ultimate list creation, here's something that's functionally similar to what #t.m.adam has suggested:
print([pair + [date] for pair in list(map(list, zip(url, name)))])
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'],
# ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'],
# ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]
Note that map is pretty infrequently used nowadays and its use is outright discouraged in some places.
Or:
n = len(url)
print(list(map(list, zip(url, name, [date] * n))))
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]
I have this html tag that i am trying to scrape
<span class="title NSNTitle">
<small class="text-primary"><strong>
ID 1040-KK-143-6964, 1040001436964
</strong></small>
<br>
<small class="text-primary">
MODIFICATION KIT,
</small>
</span>
I use this code
page_soup = soup(page_html, "html.parser")
FSGcontainer = page_soup.find("h1", {"class": "nopad-top"}).find_all("small", {"class": "text-primary"})
for subcontainer in FSGcontainer:
FSGsubcard = subcontainer
if FSGsubcard is not None:
Nomenclature = FSGsubcard.text
print(Nomenclature)
and I get this output
NSN 1040-KK-143-6964, 1005009927288
MODIFICATION KIT,
what I really want is the text "Modification kit,"
how can I capture just the text and not the IDs ?
Use select_one together with a css selector that selects the second small element.
nomenclature = page_soup.find("h1",
{"class": "nopad-top"}
).select_one(
'small:nth-of-type(2)'
).text.strip()
Try this. It will let you fetch the specific items you want.
for item in soup.find_all(class_="title"):
text_item = item.find_all(class_="text-primary")[1].text
print(text_item)
Result:
MODIFICATION KIT
Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.
Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)
Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]