Get request returns unfriendly response Python - python

I am trying to perform a get request on TCG Player via Requests on Python. I checked the sites robots.txt which specifies:
User-agent: *
Crawl-Delay: 10
Allow: /
Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This is my first time seeing a robots.txt file.
My code is as follows:
import requests
url = "http://www.tcgplayer.com"
r = requests.get(url)
print(r.text)
I cannot include r.text in my post because the character limit would be exceeded.
I would have expected to be recieve the HTML content of the webpage, but I got an 'unfriendly' response instead. What is the meaning of the text above? Is there a way to get the HTML so I can scrape the site?
By 'unfriendly' I mean:
The HTML that is returned does not match the HTML that is produced by typing the URL into my web browser.

This is probably due to some server-side rendering of web content, as indicated by the empty <div id="app"></div> block in the scraped result. To properly handle such content, you will need to use a more advanced web scraping tool, like Selenium. I'd recommend this tutorial to get started.

Related

Python POST requests - how to extract html of request destination

Scraping data of mortgage from official mortgage registry. The problem is that I can't extract the html of particular document. Everything happens on POST behalf - I have all of the data required to precise the POST request, but still when i'm printing the request.url it shows me the welcome screen page. It should retrieve html from particular document. All data like number of mortgage or current page are listed in dev tools > netowrk > Form Data, so I bet it must be possible. I'm quite new in web python so I will apprecaite any help.
My code:
import requests
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
r = requests.post('https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW', data=data)
print(r.url), print(r.content)
You are getting the welcome screen because you aren't sending all the requests required to view the next page.
Go to Chrome > Network tabs, and you will see that when you click the submit/search button, a bunch of other GET requests are being sent to different URLs after that first POST request.
You need to replicate that in your script. Depending upon the website it can be tough to get the response, so you should consider using Selenium
That said, it's not impossible to do this with requests:
session = requests.Session()
You need to send the POST request, and all other GET requests that follow in the same session.
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
session.post(URL, headers=headers, params=data)
# Start sending the GET requests
session.get(URL_1, headers=headers)
session.get(URL_2, headers=headers)

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

Bypassing intrusive cookie statement with requests library

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

Parsing webpage with beautifulsoup to get dynamic content

I am trying to parse the following page
http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936 for the list of similar songs.
The list of similar songs is not present in the page source but is present when I use 'Inspect Element' in the browser.
How do I do it??
Current code:
url = 'http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936'
request = urllib2.Request(url)
lyricsPage = urllib2.urlopen(request).read()
soup = BeautifulSoup(lyricsPage)
The code to generate the links is:
for p in soup.find_all('p'):
s = p.find('a', { "class" : 'title' }).get('href')
Which methods are available to do this??
This is handled probably by some ajax calls so it will not be in the source,
I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.
i.e. a random picked request URL from this page:
http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943
to get/see the response enter the above URL in the browser and see the content of the response.
so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

Categories