Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.
You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7
Related
Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.
There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.
I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!
i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count
I want to retrieve data from a website named as myip.ms. I'm using requests to send data to form and then I want the response page back to me. When I run the script it returns the same page (homepage) in response. I want the next page using the query I provide. I'm new in WebScraping. Here's the code I'm using to achieve this.
import requests
from urllib.parse import urlencode, quote_plus
payload={
'name':'educationmaza.com',
'value':'educationmaza.com',
}
payload=urlencode(payload)
r=requests.post("http://myip.ms/s.php",data=payload)
infile=open("E://abc.html",'wb')
infile.write(r.content)
infile.close()
I'm no expert, but it appears that when interacting with the webpage, the post is processed by jQuery, which requests does not do well with.
As such, you would have to use the Selenium module to interact with it.
The following code will execute as desired:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://myip.ms/s.php")
driver.find_element_by_id("home_txt").send_keys('educationmaza.com')
driver.find_element_by_id("home_submit").click()
html = driver.page_source
infile=open("stack.html",'w')
infile.write(html)
infile.close()
You will have to install the Selenium package, as well as Phantom.JS.
I have tested this code, and it works fine. Let me know if you need any further help!
I am using Python requests to get information from the mobile website of the german railways company (https://mobile.bahn.de/bin/mobil/query.exe/dox')
For instance:
import requests
query = {'S':'Stuttgart Hbf', 'Z':'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
which in this case gives the correct page.
However, using the following query:
query = {'S':'Cottbus', 'Z':'München Hbf'}
It gives another response, where the user is required to choose one of the given options (The server is confused about the starting stations, since there are many beginning with 'Cottbus')
Now, my question is: given this response, how can I choose one of the given options, and then repeat the request without getting this error ?
I tried to look at the cookies, to use a session instead of a simple get request. But nothing worked so far.
I hope you can help me.
Thanks.
You can use Beautifulsoup to parse the response and get the options if there is a select on the response:
import requests
from bs4 import BeautifulSoup
query = {'S': u'Cottbus', 'Z': u'München Hbf'}
rsp = requests.get('https://mobile.bahn.de/bin/mobil/query.exe/dox', params=query)
soup = BeautifulSoup(rsp.content, 'lxml')
# check if has choice dropdown
if soup.find('select'):
# Get list of tuples with text and input values that you will nee do use in the next POST request
options_value = [(option['value'], option.text) for option in soup.find_all('option')]
import requests
from bs4 import BeautifulSoup
a = requests.Session()
soup = BeautifulSoup(a.get("https://www.facebook.com/").content)
payload = {
"lsd":soup.find("input",{"name":"lsd"})["value"],
"email":"my_email",
"pass":"my_password",
"persistent":"1",
"default_persistent":"1",
"timezone":"300",
"lgnrnd":soup.find("input",{"name":"lgnrnd"})["value"],
"lgndim":soup.find("input",{"name":"lgndim"})["value"],
"lgnjs":soup.find("input",{"name":"lgnjs"})["value"],
"locale":"en_US",
"qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
}
soup = BeautifulSoup(a.post("https://www.facebook.com/",data = payload).content)
print([i.text for i in soup.find_all("a")])
Im playing around with requests and have read several threads here in SO about it so I decided to try it out myself.
I am stumped by this line. "qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
because it returns empty thereby cause an error.
however looking at chrome developer tools this "qsstamp" is populated what am I missing here?
the payload is everything shown in the form data on chrome dev tools. so what is going on?
Using Firebug and search for qsstamp gives matched results directs to: Here
You can see: j.createHiddenInputs({qsstamp:u},v)
That means qsstamp is dynamically generated by JavaScript.
requests will not run JavaScript(since what it does is to fetch that page's HTML.) You may want to use something like dryscape or using emulated browser like Selenium.