How to scrape text of specific cell from website using BeautifulSoup - python

I've been trying to scrape text from a website for the past hour and have made no progress, simply because I have very little knowledge on how to actually use BSoup.
def select_ticker():
url = "https://www.barchart.com/stocks/performance/gap/gap-up?screener=nasdaq"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
find = soup.findAll('td, {"data-ng-if:"row.blankRow"}')
print(find)
I'm going to this website and trying to get the first symbol from the table. Right now that symbol is BFBG
I know this should be extremely easy for someone who actually knows what they're doing with BSoup but I don't understand searching for things and this website doesn't make it easy to search either.
I appreciate your time and thanks for the help!

Actually, you cannot scrap the first symbol from the html get request. You need to fetch the json.
import urllib3
import json
http = urllib3.PoolManager()
r = http.request('GET', 'https://core-api.barchart.com/v1/quotes/get?lists=stocks.gaps.up.nasdaq&orderDir=desc&fields=symbol,symbolName,lastPrice,priceChange,gapUp,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions&orderBy=gapUp&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1')
print(json.loads(r.data)['data'][0]['symbol'])
And there you got the first symbol.
With the Json you can also find every information you probably want to scrap.
Here is how you can usually find those Jsons :
Going into the console, network tab, xhr tab and reload the page. If there are a lot of ressources fetched, you can also filter by the name of the domain ! :)
However, this syntax is wrong:
soup.findAll('td, {"data-ng-if:"row.blankRow"}')
you need to give a dictionnary to the find_all method according to BS4 doc
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
soup.find_all('td', {'data-ng-if':'row.blankRow'})
Hope this helps

Related

Is there a way to scrape URLs without scraping links?

Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?

Scrape data from website with frames or flexbox using python requests and BeautifulSoup

I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.

How to download all the comments from a news article using Python?

I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?
I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.
Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.

Scraping Google Patents with requests only returns style and scripts tags

I'm trying to scrape Google Patents using the following code.
url = 'https://patents.google.com/?q=usb'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
But when I try to inspect the document, using
print(soup.prettify)
I cannot get anything other than this https://pastebin.com/Xu81LdfE .
I checked the requests status and it is returning 200. Where am I going wrong?
The results on that page come for a different url:
https://patents.google.com/xhr/query?url=q%3Dusb&exp=
So instead of using BeautifulSoup, you could do r.json(), and find what you want in the dictionary it creates.
The data is not in the HTML, but loaded with JavaScript.
Therefore, beautifulsoup cannot scrape it.
Consider using the official APIs, as other usage likely violates the Google terms of service, and they will likely block you then.

Using BeautifulSoup to parse facebook

so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:
for urls in my_urls:
try:
page = urllib2.urlopen(urls)
soup = BeautifulSoup(page)
info = soup.find_all("div", class_="fsl fwb fcb")
info2 = info.findall('a')
The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.
This line successfuly grabs the pageTitle:
info = soup.find_all("title", attrs={"id": "pageTitle"})
This line can get pretty far down the list of elements, but can't go any farther.
info = soup.find_all(id="pagelet_timeline_main_column")
Here's a sample page that I'm trying to parse, I want current city from it:
https://www.facebook.com/100004210542493
and heres a quick screenshot of what the part I want looks like:
http://prntscr.com/1t8xx6
I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!
EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.
Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.
wget https://www.facebook.com/100004210542493
You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).
Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too

Categories