pycurl is working in an unexpected way - python

I have written a code that visits a URL using pycurl. I have tor enabled.
The URL gets redirected to some other url.
Below is the code.
import pycurl
curl = pycurl.Curl()
curl.setopt(pycurl.URL, URL)
curl.setopt(pycurl.PROXY, '127.0.0.1')
curl.setopt(pycurl.PROXYPORT, 9050)
curl.setopt(pycurl.PROXYTYPE, pycurl.PROXYTYPE_SOCKS5_HOSTNAME)
curl.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
curl.perform()
It prints the expected html content.
But whenever there is a visit to a URL, there is an increment to a count somewhere else.
Now, when I run the script, I get the html content, but there is no increment in the count, but when the same html output is run in some online html rendering website(htmledit.squarefree.com/
), the count is incremented.
Any help to increment the count automatically, using the script itself?
Thanks.

Any kind of updation of some data on server when client visit their website is possibly done through javascript.
When some website content is loaded on client machine, it has got some javascript, which gets executed onto client's machine to notify the server. Now when the webpage is visited through browser, javascript are executed(if the browser is enabled to do it). But when the webpage is visited through curl, it can't execute javascript.
I managed to do it using dryscrape. Dryscrape uses http protocol. You can read here for work around to enable socks5 protocol for dryscrape.

Related

Url requests not working while the flask app is hosted

I have a flask web app running a just-dial scraper code, In my code, I have to request multiple pages of the Justdial site to use it in the bs4 module to extract the data and fill it in the excel sheet. I use requests.Session() to do the process.
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"})
url=f"{entry}/page-{page_number}"
session.verify = False
r = session.get(url).text
Then this "r" is passed into the bs4 module and the extraction process takes place.
Whenever I run this code in the local host my program works fine, the data is getting extracted and the values are getting stored in the excel file. But when I host this as webapp in heroku and try the same process in heroku, I am not getting the desired output, there are no errors shown in except and try as well. Also I am getting empty excel file as output.
I tried using Urllib, requests.get() and also requests.get(url, verify-False) but the same problem exists.
This warning pops up while i run the program in localhost
/home/disciple/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:846: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn((

How to complete geetest (captcha) when scraping, by python-requests, while request values are taken by solving captcha manually?

I'm trying to scrape website, which use datadome and after some requests I have to complete geetest (slider captcha puzzle).
Here is a sample link to it:
captcha link
I've decided to don't use selenium (at least for now) and I'm trying to solve my problem by python module: Requests.
My idea was to complete geetest by myself then send the same request in my program, that my web browser is sending after completing that slider.
At the beginning, I've scraped html code which I got on website after captcha prompt:
<head><title>allegro.pl</title><style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var dd={'cid':'AHrlqAAAAAMAsB0jkXrMsMAsis8SQ==','hsh':'77DC0FFBAA0B77570F6B414F8E5BDB','t':'fe','s':29701,'host':'geo.captcha-delivery.com'}</script><script src="https://ct.captcha-delivery.com/c.js"></script></body></html>
I couldn't access iframe where most important info is, but I found out that link to to that iframe can be build with info from that html code above. As u can see in link above:
cid is initialCid, hsh is hash etc., one part of the link, cid is a cookie that I got at the moment when captcha appeared.
I've seen there are available services which can solve captcha for u, so I've decided to complete captcha for myself, then send exact request, including cookies and headers, to my program then send request in my program by requests. For now I'm doing it by hand, but it doesn't work. Response is 403, when manually it's 200 and redirect.
Here is a sample request that my browser is sending after completing captcha:
sample request
I'm sending it in program by:
s = requests.Session()
s.headers = headers
s.cookies.set(cookie_from_web_browser)
captcha = s.get(request)
Response is 403 and I have no idea how to make it work, help me.
Captcha's are really tricky in the web scraping world, most of the time you can bypass this by solving the captcha and then manually taking the returned source's cookie and plugging it into your script. Depending on the website the cookie could hold for 15minutes, a day, or even longer.
The other alternative is to use captcha solving services such as https://www.scraperapi.com/ where you would have to pay a fee for x amount of requests but you won't run into the captcha issue as they solve them for you
Use a header parameter to solve this problem. Just like so
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
Test it with web cache before running with real url

get request returns 403 status code even after using header

I'm trying to scrape data from autotrader page and I managed to grab link to every offer on that page but when I'm trying to get data from every offer I get 403 requests status even though I'm using a header.
What more can I do to get past it?
headers = {"User Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.121 Safari/537.36'}
page = requests.get("https://www.autotrader.co.uk/car-details/202010145012219", headers=headers)
print(page.status_code) # 403 forbidden
content_of_page = page.content
soup = bs4.BeautifulSoup(content_of_page, 'lxml')
title = soup.find('h1', {'class': 'advert-heading__title atc-type-insignia atc-type-insignia--medium '})
print(title.text)
[for people that are in the same position: autotrader uses cloudflare to protect every "car-detail" page, so I would suggest using selenium for example]
If you can manage to get the data via your browser, i.e. you somehow see this data in a website, then you can likely replicate that with requests.
Briefly, you need headers in your request to match the Browser's request:
Open dev tools in you browser (e.g. F12 or cmd+opt+I or click on menu)
Open Network tab
Reload the page (the whole website or the target request's url only, whatever provides a desired response from the server)
Find a http request to the desired url in the Network tab. Right click it, click 'Copy...', and choose the option (e.g. curl) you need.
Your browser sends tons of extra headers, you never know which ones are actually checked by the server so this technique will save you much time.
However, this might fail if there's some protection against blunt request copies, e.g. some temporary tokens, so the requests cannot be reused. In this case you need Selenium (browser emulation/automation), it's not difficult so it worth using.

Web Scraping using python(Beautifulsoup)

I am just started learning web scraping using python Beautifulsoup and requests library and using Pycharm tool.
import requests
from bs4 import BeautifulSoup
result1 = requests.get("https://www.grainger.com/")
print('result1 is '+ str(result1.status_code))
While I am using this website its keeps on loading and if I use google.com it's giving output.
I wonder why I didn't get output for the above website?
To get status 200 from this site, specify User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
result1 = requests.get("https://www.grainger.com/", headers=headers)
print('result1 is '+ str(result1.status_code))
Prints:
result1 is 200
The reason why this is works is because some sites will ignore requests that don't appear to be made from a web browser. By default, requests uses the User-Agent python-requests, so the website can tell you are not requesting the website from a web browser. The reason why your request hangs and eventually times out is likely because their server is ignoring your request.
Hmm... there are a couple of things.
The website might not exist
You're using http instead of https
That site blocks scraping (send a user agent header)
It might be a problem with requests. Try using a different library.

Python requests: Url will show table in browser but not when I use requests

I am trying to scrape a table in a webpage, or even download the .xlsx file of this table, using the requests library.
Normal workflow:
I log into the site. Go to my reporting page, choose report, click button that says "Test" and a second window opens up with my table and gives me the option to download the .xlsx file.
When I try to access this url I can copy and paste it into any chrome browser that I am currently logged into. When I try with requests, even when passing an auth into my get() i get a 200 response but it is a simple page with one line of text telling me to "contact my tech staff to receive the proper url to enter your username and password". This is the same as when i paste the url into a browser where I am not logged into the site. Except when I do that i am redirected to a new url that has the same sentence.
So I imagine there is a slug for the organization that is not passed in the url but somewhere in the headers or cookies when I access this site in my browser. How do i identify this parameter in the HTTP header? Then how do I send it to requests so I can get my table and move on to try and automate downloading the .xlsx.
import requests
url = 'myorganization.com/adhocHTML.xsl?x=adhoc.AdHocFilter-listAdhocData&filterID=45678&source=live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
data = requests.get(url, headers=headers, auth=('username', 'Password'))
Any help would be greatly appreciated as I am new to the requests library and just trying to automate some data flow before it ever gets to analyzing it.
You need to login with requests. You can do this with making a session and make other requests with this session (it will save all cookies and other stuffs).
Before going for code you should do a few steps:
make sure you are logged out. open browser Inspect in log in page. go to network tab. log in and find a POST request in network tab that is related to your login request. at the end of this tab you find some parameters for login. make does parameters a dictionary (login_data) in your code and go as below:
session = requests.Session()
session.post('url_to_login_page', login_data)
data = session.get(url, headers=headers)
Login data for each website are different from others so I can't give you a specific example. You should be able to find it as I said above. If you had problem with those, tell me.

Categories