I've tried to crawl a website using beautifulsoup and I've encountered with:
<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>
and this, for some reason, doesn't allow me to use beatifulsoup's features.
title = soup.find_all(class_={"text, text--gray70, text--subtitle2"})
I think data-v-57d17052 is causing this difficulty.
Does anyone know how to solve this issue?
I have tried all beautifulsoup's features and it doesn't work at all.
Remove the comma from the class_=:
from bs4 import BeautifulSoup
html_doc = ''''<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.find(class_={"text text--gray70 text--subtitle2"})
print(title.text)
Prints:
Hello
You can use also a CSS selector:
title = soup.select_one(".text.text--gray70.text--subtitle2")
print(title.text)
Related
I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help
If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()
I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.
You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.
Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r
from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
naujienos = soup.findAll('a', {'class':'title'})
print naujienos
Here is important part of HTML:
<div class="title">
<span class="feedbacks"></span>
</div>
I get empty list. Where is my mistake?
EDIT:
Thanks it worked. Now I want to print news titles. This is how I am trying to do it:
nba = soup.select('div.title > a')
for i in nba:
print ""+i.string+"\n"
I get max 5 titles and error occurs: cannot concatenate 'str' and 'NoneType' objects
soup.findAll('a', {'class':'title'})
This says, give me all a tags that also have class="title". That's obviously not what you're trying to do.
I think you want a tags that are the direct descendant of a tag with class="title". You can try using a css selector:
soup.select('div.title > a')
Out[58]:
[Blatche'as: âGarantuoju, kad laimÄsimeâ,
<a href="/news-73147-rockets-veikiausiai-pasiliks-mchalea.html">âRocketsâ veikiausiai pasiliks McHaleâÄ
</a>,
# snip lots of other links
]
I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")