Scraperwiki Python Loop Issue

Scraperwiki Python Loop Issue - python

I'm creating a scraper through ScraperWiki using Python, but I'm having an issue with the results I get. I'm basing my code off the basic example on ScraperWiki's docs and everything seems very similar, so I'm not sure where my issue is. For my results, I get the first documents title/URL that is on the page, but there seems to be a problem with the loop, as it does not return the remaining documents after that one. Any advice is appreciated!
import scraperwiki
import requests
import lxml.html
html = requests.get("http://www.store.com/us/a/productDetail/a/910271.htm").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.downloads'):
document = {
'title': entry.cssselect('a')[0].text_content(),
'url': entry.cssselect('a')[0].get('href')
}
print document

You need to iterate over the a tags inside the div with class downloads:
for entry in dom.cssselect('.downloads a'):
document = {
'title': entry.text_content(),
'url': entry.get('href')
}
print document
Prints:
{'url': '/webassets/kpna/catalog/pdf/en/1012741_4.pdf', 'title': 'Rough In/Spec Sheet'}
{'url': '/webassets/kpna/catalog/pdf/en/1012741_2.pdf', 'title': 'Installation and Care Guide with Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1204921_2.pdf', 'title': 'Installation and Care Guide without Service Parts'}
{'url': '/webassets/kpna/catalog/pdf/en/1011610_2.pdf', 'title': 'Installation Guide without Service Parts'}

Related

Unable to locate div.class element in html using BeautifulSoup

I am trying to run the following BeautifulSoup code on https://apps.npr.org/best-books/#view=list&year=2022 to locate the book title of the books listed on this page. I am using the below code, which I have confirmed generally works a basic web scraper:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.npr.org/best-books/#view=list&year=2022'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('div',{'class':'title'})
Which I would expect would yield a list of all the book titles. Instead, I am getting an empty list, which essentially means it is not finding the html I'm searching for.
For reference, an example of an html string that has the information I want (which can equivalently be found by inspecting the source of the page I linked above):
<div class="title">(Serious) New Cook: Recipes, Tips, and Techniques</div>
Any tips on how to troubleshoot this?

Simply fecth the data from the api, you could find it in the dev tools of your browser, check the xhr tab:
import requests
requests.get('https://apps.npr.org/best-books/2022.json').json()
Output
[{'title': 'The School for Good Mothers: A Novel',
'author': 'Jessamine Chan',
'dimensions': {'width': 329, 'height': 500},
'cover': '1982156120',
'tags': ['sci fi, fantasy & speculative fiction',
'book club ideas',
'eye-opening reads',
'family matters',
'identity & culture',
'the states we’re in',
'staff picks',
'the dark side'],
'id': 1},
{'title': 'Young Mungo',
'author': 'Douglas Stuart',
'dimensions': {'width': 336, 'height': 500},
'cover': '0802159559',
'tags': ['realistic fiction',
'book club ideas',
'family matters',
'identity & culture',
'love stories',
'seriously great writing',
'tales from around the world',
'staff picks'],
'id': 2},...]

This is most likely because the content is loaded using javascript. You could either fetch from the API like HedgeHog suggested or you could load the javascript in the page using a automated browser like selenium.

Web Scraping - Every scroll the data changes and only 20 entries are showed

I am trying to web scrape this web site, but the pages content changes when scroll and only 20 entries are shown.
As shown my code below, it only gives 20 enteries and if I scroll down before running the code, the 20 entries change.
I want to get the whole 896 entries all at once.
main = requests.get("https://www.sarbanes-oxley-forum.com/category/20/general-sarbanes-oxley- discussion/21")
soup = BeautifulSoup(main.content,"lxml")
main_dump = soup.find_all("h2",{"class":"title","component":{"topic/header"}})
for k in range(len(main_dump)):
main_titles.append(main_dump[k].find("a").text)
main_links.append("https://www.sarbanes-oxley-forum.com/"+main_dump[k].find("a").attrs["href"])
print(len(main_links))
Output: 20

You do not need BeautifulSoup nor selenium in this case, cause you could get all data structured from their api. Simply request the first page, check the number of topics and iterate over:
https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/1
Example
Note: Replace 894 with 1 to start over from first page, I just limited the number of requests for demo here starting from page 894:
import requests
api_url = 'https://www.sarbanes-oxley-forum.com/api/category/20/general-sarbanes-oxley-discussion/'
data = []
for i in range(894,requests.get(api_url+'1').json()['totalTopicCount']+1):
for e in requests.get(api_url+str(i)).json()['topics']:
data.append({
'title':e['title'],
'url': 'https://www.sarbanes-oxley-forum.com/topic/'+e['slug']
})
data
Output
[{'title': 'We have a team of expert',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8550/we-have-a-team-of-expert'},
{'title': 'What is the privacy in Google Nest Wifi?',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8552/what-is-the-privacy-in-google-nest-wifi'},
{'title': 'Reporting Requirements _and_amp; Financial Results Release Timin 382',
'url': 'https://www.sarbanes-oxley-forum.com/topic/6214/reporting-requirements-_and_amp-financial-results-release-timin-382'},
{'title': 'Use of digital signatures instead of wet ink signatures on control documentation',
'url': 'https://www.sarbanes-oxley-forum.com/topic/8476/use-of-digital-signatures-instead-of-wet-ink-signatures-on-control-documentation'},...]

Python Requests-HTML - Can't find specific data

I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table ,
below image shows (red rounded data) the data what i want to get.
My code is like this,
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)
web page html like this,
Problem is when i scrape the web page i can't find the data i want, in HTML code i can see
the data inside first div tags in body section.
Any specific suggestions for how to solve this.
Thanks.

The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:
https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
So:
import requests
response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
results = response.json();
print(results)
for t in results['graph']['data']:
print(t)
Prints:
{'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
['2010-12-31', 91.4]
['2011-12-31', 96]
['2012-12-31', 100.1]
['2013-12-31', 101.2]
['2014-12-31', 103.2]
['2015-12-31', 100.8]
['2016-12-31', 105.8]
['2017-12-31', 105.4]
['2018-12-31', 106.1]
['2019-12-31', 106.9]
How I Came Up with the URL
And when you click on the last message:

Getting Image URL's in Scrapy

I am very new to any form of coding. I started the learning process by attempting to make a simple crawler with Scrapy. It kinda works, but for some reason I can't get an image URL to output properly. It spits out some "data:image/gif;base64..." value instead of the actual link in the src attribute. I've looked for answers but I can't seem to find anything that gives me a definitive answer (Plus I may not fully understand the issue as well). Any help would be greatly appreciated.
def parse(self, response):
for data in response.css("a.styles__link--2pzz4"):
yield {
'title': data.css('a::attr(title)').get(),
'price': data.css('span::text').get(),
'url': data.css('a::attr(href)').get(),
'image url': data.css('img::attr(src)').get(),
}
next_page = response.css('li span a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Can you give us link that you want to scrape?
Sometimes websites have lazy loads and hide normal links in other img attributes. For example, data-original, data-src, etc. Or keep links to images in jsons, stored in script on page.

Your website might be defining the image data as a base64 encoded blob using a data URI. Basically, the image data is embedded in the HTML, so there is no normal URL available.
Read more here: https://css-tricks.com/data-uris/

Link tag (<a>) in Python post requests

I am trying to post a request to the telegra.ph api (a simple online publishing tool). In order to edit the content of the page, I am using python requests. From the sample code I got from the api documentation, I so far have:
import requests
params = {
'path': '/mypage',
'title': 'My Title',
'content':[{"tag":"p","children":["WHAT IS GOING ON"]}],
'author_name': 'My Name',
'author_url': None,
'return_content': 'true'
}
url = 'https://api.telegra.ph/editPage'
r = requests.post(url, json=params)
r.raise_for_status()
response = r.json()
Very simple code, and it works fine. My issue is that I would now like to add a link to my content. I tried changing the tag from "p" to "a" but that results in no tags at all in the resulting page. Does anyone know what format they are using for their content, and how I can change the paragraph tag to a link tag?

I used this to create a page with a link:
import requests
params = {
'access_token': "",
'path': '/mytestpage',
'title': 'My Title',
'content':[ {"tag":"p","children":["A link to Stackoverflow ",{"tag":"a","attrs":{"href":"http://stackoverflow.com/","target":"_blank"},"children":["http://stackoverflow.com"]}]} ],
'author_name': 'My Name',
'author_url': None,
'return_content': 'true'
}
url = 'https://api.telegra.ph/createPage'
r = requests.post(url, json=params)
r.raise_for_status()
response = r.json()
print response
You should be able to do something similar for editPage.
Tip: You can use something like Postman or Chrome dev tools to figure what is being posted using the telegra.ph UI.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraperwiki Python Loop Issue - python

Related

Unable to locate div.class element in html using BeautifulSoup

Web Scraping - Every scroll the data changes and only 20 entries are showed

Python Requests-HTML - Can't find specific data

Getting Image URL's in Scrapy

Link tag (<a>) in Python post requests

Categories

Resources