getting no response from a url? - python

Getting no response from a url by using requests.get on the other hand if I past the url in Firefox then it's responding. The provided url is a link of a json file. I don't know what's happening? here is my code
from urllib.request import urlopen,Request
import requests
import pprint
import json
import pandas as pd
url = "https://www.nseindia.com/api/option-chain-equities?symbol=ACC"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
print(response.status_code)
##data_json = json.loads(response.read())
df = pd.read_json(response)
pprint.pprint(df['records'][1])

This website protects itself from bots. There are so many ways to detect bots, some of them are:
requests rate
disabled javascript
empty cookies
not using mouse to click buttons
etc.
To enable javascript and cookies, you can use selenium.
The website you want to scrape has powerful bot detection methods. I couldn't access the link that you have shared. But when I first tried website main page and after that your link, It shows json file.
But this is not easy to make a bot for. I tried selenium and clicked the website button by moving the mouse, but it detected that I'm a bot. So we can conclude that the website uses cookies. You need to generate fake cookies to access the webpage.

Related

Web scrapping with Beautifulsoup returns no text eventhough it is in the html

I'm new to web scrapping and using Beautifulsoup. I need help as I don't understand why my code is returning no text when there is text in the inspect view on the website.
Here is my simple code:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nummerplade.net/nummerplade/Dd97487.html")
soup = BeautifulSoup(source.text,"html.parser")
name = soup.find("span",id="debitorer_name1")
print(name)
The output of running my code is:
<span id="debitorer_name1"></span>
When I inspect the HTML on the website I can see the desired name I want to extract, but not when running my script. Can anyone help me solve this issue?
Thanks!
If you reload site the data is reflecting in right side pane it takes same time so where it is uses dynamic data loading and it will not be visible in soup
How to find URL which renders dynamic data:
Go to Network tab and reload site and in left side just type the data that you want to search it will give you URL
Now go to Headers and copy user-agent, referer for headers and it will return data as in json format and you can extract what so data you want
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", "referer": "https://www.nummerplade.net/"}
res=requests.get("https://data3.nummerplade.net/bilbogen2.php?stelnr=salza2bt3nh162519",headers=headers)
Output:
'Sebastian Carl Schwabe'
Image:

headers in Zillow website - where to get it

The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here

python requests from usnews.com timing out other websites work fine

url = "https://www.usnews.com"
page = requests.get(url, timeout = 5)
soup = BeautifulSoup(page.content,"html.parser")
requests from usnews.com is not working properly. The code runs forever or times out after five seconds as instructed. I have tried using other websites which work perfectly fine (wikipedia.org, google.com).
They are using a special protection against web scrapers like you. Whenever you go to a website, your web browser sends a special piece of data called a User-Agent. It tells the website what type of browser you are using and if you are on a phone or computer. By default, the requests module doesn't do this.
You can set your own User-Agent pretty easily. Using your website as an example:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
url = "https://www.usnews.com"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content,"html.parser")
This code tells the website that we are an actual person and not a bot.
You can learn more about User Agents here (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent).
you should try something like selenium this code is similar to selenium but a bit user friendly
from requests_html import HTMLSession
import re
#from fake_useragent import UserAgent
#create the session
#ua = UserAgent()
session = HTMLSession()
#define our URL
url = "https://www.usnews.com"
#use the session to get the data
r = session.get(url)
#Render the page, up the number on scrolldown to page down multiple times on a page
r.html.render(sleep=1,timeout = 30, keep_page=True, scrolldown=1)
print(r.text)
this code mimics a real search engine and should bypass the bot detection

python - login a page and get cookies

First, thanks for taking the time to read this and maybe trying to help me.
I am trying to make a script to easily login in a site. I wanted to get the login cookies too, so maybe I could reuse them later. I made the script and it logs me in correctly. But I can not get the cookies. When I try to print them, I see just this:
<RequestsCookieJar[]>
Obviously this can't help me, I think. So now I would be interested in knowing how to get the real cookie data. Thanks a lot to whoever can hep me reaching that.
My code:
import requests
import cfscrape
from bs4 import BeautifulSoup as bs
header = {"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
s = requests.session()
scraper=cfscrape.create_scraper(sess=s) #i use cfscrape because the page uses cloudflare anti ddos page
scraper.get("https://www.bstn.com/einloggen", headers=header)
myacc={"login[email]": "my#mail.com", #obviously change
"login[password]": "password123"}
entry=scraper.post("https://www.bstn.com/einloggen", headers=header, data=myacc)
soup=bs(entry.text, 'lxml')
accnm=soup.find('meta',{'property':'og:title'})['content']
print("Logged in as: " + accnm)
aaaa=scraper.get("https://www.bstn.com/kundenkonto", headers=header)
print(aaaa.cookies)
If I print the ccokies, I just get the <RequestsCookiesJar[]> like described earlier... It would be really nice if anyone could help me getting the "real" cookies
If you want to get your login cookie that you ought to use the response which after posting, because you are doing login action! Server will send back session cookies if you input correct email&password. And why you got empty cookies in aaaa is website didn't want to set or change your cookies.
entry = scraper.post("https://www.bstn.com/einloggen", allow_redirects=False, headers=header, data=myacc)
print(entry.cookies)

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

Categories