I am trying to use the below code to access websites in python 3 using urllib
url = "http://www.goal.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
r = urllib.request.Request(url=url, headers=headers)
urllib.request.urlopen(r).read(1000)
It works fine when it access "yahoo.com", but it always returned error 403 when accessing sites such as "goal.com, hkticketing.com.hk" and I cannot figure out what I am missing. Appreciate for your help.
In python 2.x version , you can use urllib2 to fetch the contents. You can invoke the add headers function to add the header information. Then invoke the open method and read the contents. Finally print them.
import urllib2
import sys
print sys.version
url = urllib2.build_opener()
url.addheaders = [('User-agent', 'Mozilla/5.0(Windows NT 6.1; WOW64; rv:23.0)Gecko/20100101 Firefox/23.0')]
print url.open('http://hkticketing.com.hk').read()
Related
Like the title said, im trying to send request a url using requests with headers, but when I try to print the status code it doesn't print anything in the terminal, I checked my internet connection and changed to test it but nothing changes.
Here's my code ;
import requests
from bs4 import BeautifulSoup
from requests.exceptions import ReadTimeout
link="https://www.exampleurl.com"
header={
"accept-language": "tr,en;q=0.9,en-GB;q=0.8,en-US;q=0.7",
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.36'
}
r=requests.get(link)
print(r.status_code)
When I execute this command, nothing appears, don't know why. If someone can help me I will be so glad.
you can use request.head(link) like below:
r=requests.head(link)
print(r.status_code)
I get the same problem. The get() never returns.
Since you have created a header variable I thought about using that:
r = requests.get(link, headers=header)
Now I get status 200 returned.
I am unable to fetch a response from this url. While it works in browser, even in incognito mode. Not sure why it is not working. It is just keep running without any output. No errors. I even tried request headers by setting 'user-agent' key but again received no response
Following is the code used:
import requests
response = requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ')
print(response.text)
I want html text from the response page for further use.
Your server is checking to see if you are sending the request from a web browser. If not, it's not returning anything. Try this:
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
r=requests.get('https://www1.nseindia.com/ArchieveSearch?h_filetype=eqbhav&date=04-12-2020§ion=EQ', timeout=3, headers=headers)
print(r.text)
I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests
Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.
I am using python 3.5.2. I want to scrap a webpage where cookies are required. But when I use requests.session() the cookies maintained in the session are not updated, thus my scraping failed constantly. Following is my code snippet.
import requests
from bs4 import BeautifulSoup
import time
import requests.utils
session = requests.session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://www.beianbaba.com/"
session.get(url)
print(session.cookies.get_dict())
Do you guys have any idea about this?Thank you so much in advance.
It seems like that website request is not providing any cookies. I used the exact same code but requested for https://google.com:
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://google.com/"
session.get(url)
print(session.cookies.get_dict())
And got this output:
{}
{'NID': 'a cookie that i removed'}
I am receiving following response from the server
ctrlDateTime%24txtSpecifyFromDate=05%2F02%2F2015&
ctrlDateTime%24rgApplicable=rdoApplicableFor&
ctrlDateTime%24txtSpecifyToDate=05%2F02%2F2015&
I am trying with
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
How can I fix ControlnotfoundError? Here is my code:
Any idea how to solve it?
import mechanize
import re
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0')]
response = br.open("http://marketinformation.natgrid.co.uk /gas/frmDataItemExplorer.aspx")
br.select_form(nr=0)
br.set_all_readonly(False)
mnext = re.search("""<a id="lnkNext" href="javascript:__doPostBack('(.*?)','(.*?)')">XML""", html)
br["tvDataItem_ExpandState"]="cccccccceennncennccccccccc";
br["tvDataItem_SelectedNode"]="";
br["__EVENTTARGET"]="lbtnCSVDaily";
br["__EVENTARGUMENT"]="";
br["tvDataItem_PopulateLog"]="";
br["__VIEWSTATE"]="%2FwEP.....SNIP....%2F90SB9E%3D";
br["__VIEWSTATEGENERATOR"]="B2D04314";
br["__EVENTVALIDATION"]="%2FwEW...SNIP...uPceSw%3D%3D";
br["txtSearch"]="";
br["tvDataItemn11CheckBox"]="on";
br["tvDataItemn15CheckBox"]="on";
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
br["btnViewData"]="View+Data+for+Data+Items";
br["hdnIsAddToList"]="";
response = br.submit()
print(response.read());
Thanks in advance.
P.
This is solved in two steps: 1) I replaced %24 with '$'; 2) some of the parameters required a true parameter to be passed on and some to be passed on as ['',]