import re
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import os
import httplib2
def make_soup(s):
match=re.compile('https://|http://|www.|.com|.in|.org|gov.in')
if re.search(match,s):
http = httplib2.Http()
status, response = http.request(s)
page = BeautifulSoup(response,parse_only=SoupStrainer('a'))
return page
else:
return None
def is_a_valid_link(href):
match1=re.compile('http://|https://')
match2=re.compile('/r/WritingPrompts/comments/')
match3=re.compile('modpost')
return re.search(match1,href) and re.search(match2,href) and not re.search(match3,href)
def parse(s):
c=0
flag=0
soup=make_soup(s)
match4=re.compile('comments')
if(soup!=None):
for tag in soup.find_all('a',attrs={'class':['title may-blank loggedin']}):
#if(link['class']!=['author may-blank loggedin']):
#if(not re.search(re.compile('/r/WritingPrompts/comments/'),link['href'])):
print(tag.string)
#break
flag=1
c=c+1
def count_next_of_current(s):
soup=make_soup(s)
match=re.compile('https://www.reddit.com/r/WritingPrompts/?count=')
for link in soup.find_all('a',{'rel':['next']}):
href=link['href']
return href
def read_reddit_images():
global f
f=open('spaceporn.txt','w')
i=int(input('Enter the number of NEXT pages from the front WritingPrompts page that you want to scrape\n'))
s='https://www.reddit.com/r/WritingPrompts/'
soup=make_soup(s)
parse(s)
count=0
while(count<i):
s=count_next_of_current(s)
if(s!=None):
parse(s)
count=count+1
else:
break
f.close()
read_reddit_images()
I am trying this code to give me text out of posts. The first step I want is to extract the header text only, then the comments and submitter. I am stuck in the first step. Why can't it find the specific class I mentioned? Isn't that absolutely unique here?
Yes I do know about PRAW but its absolutely frustrating to get it to work. I have read it not-so-well-written documentation twice and there's huge limitation to the number of posts that can be accessed at once. This is not the case with beautifulsoup. Any recommendations relating web scraping in python or in any other language?
Each class attributes stored as individual class in BS4. It is easier to use CSS selector via select() method to match by multiple CSS classes. For example, you can use the following CSS selector to match <a class="title may-blank loggedin"> :
for tag in soup.select('a.title.may-blank.loggedin'):
.....
.....
The syntax for searching for a class with find_all() is
soup.find_all(class_="className")
Note the underscore after class. If you leave it out Python will throw you an exception as it thinks you're trying to instantiate a new class.
Related
I'm trying to scrape data in a widget using python and the requests-html library.
The the value I want is in a gauge with an arrow pointing to five possible results.
Each label on the gauge is the same on all pages of the website. The problem I face is I cannot use a css selector on the gauge labels to extract the text, I need to extract the value of the arrow itself as it will be pointing to a label. The arrow doesn't have a text attribute so if I use a css selector I get none as a response.
Each arrow has a unique class name.
<div class="arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo arrowStrongBuyShudder-3xsGK8k5">
https://www.tradingview.com/symbols/NASDAQ-MDB/
StrongBuy:
<div class="arrow-F-uE7IX8 arrowToBuy-1R7d8UMJ arrowBuyShudder-3GMCnG5u">
https://www.tradingview.com/symbols/NYSE-XOM/
Buy:
<div class="arrow-F-uE7IX8 arrowToStrongSell-3UWimXJs arrowStrongSellShudder-2UJhm0_C">
https://www.tradingview.com/symbols/NASDAQ-IDEX/
StrongSell:
What can I do to ensure I get the correct value? I'm not sure how I can check if the selector contains the arrowTo{foo} and store as variable.
import pyppdf.patch_pyppeteer
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_page():
code = 'NASDAQ-MDB'
r = await asession.get(f'https://www.tradingview.com/symbols/{code}/')
await r.html.arender(wait=3)
return r
results = asession.run(get_page)
for result in results:
arrow_class_placeholder = "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]"
arrow_class_name = result.html.xpath(arrow_class_placeholder,first=True)
if arrow_class_name == "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]":
print('StrongBuy')
else:
print('not strong buy')
You can use BeautifulSoup4 (bs4), which is a a Python library for pulling data out of HTML and XML files, with a combination of Regular Expressions (RegEx). In this case I used the python re library for the RegEx purposes.
Something like this is what you want (source):
In the example above soup.find_all(class_=re.compile("itle")) returns all instances where the word "itle" is found in the class tag, such as class = "title" from the html document shown below.
For your RegEx it would look something like "arrowTo*" or even just "arrowTo". soup.find_all(class_=re.compile("arrowTo")).
Your final code should look something like:
from bs4 import BeautifulSoup
#i think result was your html document from requests library
#the first parameter is your html document variable
soup = BeautifulSoup(result, 'html.parser')
myArrowToList = soup.find_all(class_=re.compile("arrowTo"))
If you wanted "arrowToStrongBuy" just use that in the regex input to the find_all function.
soup.find_all(class_=re.compile("arrowToStrongBuy"))
I use BeautifulSoup for parsing a Google search, but I get empty list. I want to make a spellchecker by using Google's "Did you mean?".
import requests
from bs4 import BeautifulSoup
import urllib.parse
text = "i an you ate goode maan"
data = urllib.parse.quote_plus(text)
url = 'https://translate.google.com/?source=osdd#view=home&op=translate&sl=auto&tl=en&text='
rq = requests.get(url + data)
soup = BeautifulSoup(rq.content, 'html.parser')
words = soup.select('.tlid-spelling-correction spelling-correction gt-spell-correct-message')
print(words)
The output is just: [], but expected: "i and you are good man" (sorry for such a bad text example)
First, the element you are looking for is loaded using javascript. Since BeautifulSoup does not run js, the target elements don't get loaded into the DOM hence the query selector can't find them. Try using Selenium instead of BeautifulSoup.
Second, The CSS selector should be
.tlid-spelling-correction.spelling-correction.gt-spell-correct-message`.
Notice the . instead of space in front of every class name.
I have verified it using JS query selector
The selector you were using .tlid-spelling-correction spelling-correction gt-spell-correct-message was looking for an element with class gt-spell-correct-message inside an element with class spelling-correction which itself was inside another element with class tlid-spelling-correction.
By removing the space and putting a dot in front of every class name, the selector looks for an element with all three of the above mentioned classes.
I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]
I have half-written a code to pull the titles and links from an RSS feed but it results in the above error. The error is in both the functions while getting the text. I want to strip the entered string of the title and link tags.
from bs4 import BeautifulSoup
import urllib.request
import re
def getlink(a):
a= str(a)
bsoup=BeautifulSoup(a)
a=bsoup.find('link').getText()
return a
def gettitle(b):
b=str(b)
bsoup=BeautifulSoup(b)
b=bsoup.find('title').getText()
return b
webpage= urllib.request.urlopen("http://feeds.feedburner.com/JohnnyWebber?format=xml").read()
soup=BeautifulSoup(webpage)
titlesoup=soup.findAll('title')
linksoup= soup.findAll('link')
for i,j in zip(titlesoup,linksoup):
i = getlink(i)
j= gettitle(j)
print (i)
print(j)
print ("\n")
EDIT: falsetru's method worked perfectly.
I have one more question. Can text be extracted out of any tag by just doing getText ?
I expect the problem is in
def getlink(a):
...
a=bsoup.find('a').getText()
....
Remember find matches tag names, there is no link tag but an a tag. BeautifulSoup will return None from find if there is no matching tag, thus the NoneType error. Check the docs for details.
Edit:
If you really are looking for the text 'link' you can use bsoup.find(text=re.compile('link'))
i, j is title, link already. Why do you find them again?
for i, j in zip(titlesoup, linksoup):
print(i.getText())
print(j.getText())
print("\n")
Beside that, pass features='xml' to BeautifulSoup if you parse xml file.
soup = BeautifulSoup(webpage, features='xml')
b=bsoup.find('title') returns None
try checking your input
I'd like to get items from a website with BeautifulSoup.
<div class="post item">
The target tag is this.
The tag has two attrs and white space.
First, I wrote,
roots = soup.find_all("div", "post item")
But, it didn't work.
Then I wrote,
html.find_all("div", {'class':['post', 'item']})
I could get items with this,but I am nost sure if this is correct or not.
is this code correct?
//// Additional ////
I am sorry,
html.find_all("div", {'class':['post', 'item']})
didn't work properly.
It also extracts class="item".
And, I had to write,
soup.find_all("div", class_="post item")
not = but _=. Although this doesn't work for me...(>_<)
Target url:
https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb
mycode:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
def main():
target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
html = urlopen(target)
soup = BeautifulSoup(html, "html.parser")
roots = soup.find_all("div", class_="post item")
print(roots)
for root in roots:
print("##################")
if __name__ == '__main__':
main()
You could use a css select:
soup.select("div.post.item")
Or use class_
.find_all("div", class_="post item")
The docs suggest that *If you want to search for tags that match two or more CSS classes, you should use a CSS selector as per the first example.
The give example of both uses:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
Why your code fails why and any of the above solutions would fail has more to do with the fact the class does not exist in the source, it it were there they would all work:
In [6]: r = requests.get("https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb")
In [7]: cont = r.content
In [8]: "post item" in cont
Out[8]: False
If you look at the browser source and do a search you won't find it either. It is generated dynamically and can only be seen if you crack open a developer console or firebug. They also only contain some styling and a react ids so not sure what you expect to pull from it even if you did get them.
If you want to get the html that you see in the browser, you will need something like selenium
First of all, note that class is a very special multi-valued attribute and it is a common source of confusion in BeautifulSoup.
html.find_all("div", {'class':['post', 'item']})
This would find all div elements that have either post class or item class (or both, of course). This may produce extra results you don't want to see, assuming you are after div elements with strictly class="post item". If this is the case, you can use a CSS selector:
html.select('div[class="post item"]')
There is also some more information in a similar thread:
BeautifulSoup returns empty list when searching by compound class names