Connection error during image scraping from Craigslist

Connection error during image scraping from Craigslist - python

As part of a project to scrape data from Craigslist, I include image scraping. I've noticed in testing that sometimes the connection is refused. Is there a way around this, or do I need to incorporate error catching for this in my code? I recall the twitter API limits queries, so a sleep timer is incorporated. Curious if I have the same situation with Craigslist. See code and error below.
import requests
from bs4 import BeautifulSoup
#loops through each image and stores it in a local folder
for img in soup_test.select('a.thumb'):
imgcount += 1
filename = (pathname + "/" + motoid + " - "+str(imgcount)+".jpg")
with open(filename, 'wb') as f:
response = requests.get(img['href'])
f.write(response.content)
ConnectionError: HTTPSConnectionPool(host='images.craigslist.org', port=443): Max retries exceeded with url: /00707_fbsCmug4hfR_600x450.jpg (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
I have 2 questions about this behavior.
Do CL servers have any rules or protocols such as blocking nth request within a certain time frame?
Is there a way to pause the loop after a connection has been denied? Or do I just incorporate error catching so that it doesn't halt my program?

Related

Why do I get a connection error using qBittorrentAPI?

I'm trying to run some code from this website but I don't understand why I get this error:
qbittorrentapi.exceptions.APIConnectionError: Failed to connect to qBittorrent. Connection Error: ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v2/auth/login (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001FA519F5840>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))"))
The code in question:
import qbittorrentapi
# instantiate a Client using the appropriate WebUI configuration
qbt_client = qbittorrentapi.Client(
host='localhost',
port=8080,
username='admin',
password='adminadmin',
)
# the Client will automatically acquire/maintain a logged-in state
# in line with any request. therefore, this is not strictly necessary;
# however, you may want to test the provided login credentials.
try:
qbt_client.auth_log_in()
except qbittorrentapi.LoginFailed as e:
print(e)
# display qBittorrent info
print(f'qBittorrent: {qbt_client.app.version}')
print(f'qBittorrent Web API: {qbt_client.app.web_api_version}')
for k,v in qbt_client.app.build_info.items(): print(f'{k}: {v}')
# retrieve and show all torrents
for torrent in qbt_client.torrents_info():
print(f'{torrent.hash[-6:]}: {torrent.name} ({torrent.state})')
# pause all torrents
qbt_client.torrents.pause.all()
I'd really appreciate some help with this, thanks ahead :)

Max retries exceeded > with url when attempting to use the Confluence rest API in python

I trying to use the Confluence api with this code, its just a test to see if I can script creating a page in my confluence space in python.
from atlassian import Confluence
confluence = Confluence(
url='http://localhost:8090',
username='admin',
password='admin')
status = confluence.create_page(
space='DEMO',
title='This is the title',
body='This is the body. You can use <strong>HTML tags</strong>!')
print(status)
For some reason I keep getting the error below. Any information is helpful. Google wasn't much help.
HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded
with url: /rest/api/content (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x0000028908E62A60>: Failed to establish a new connection: [WinError
10061] No connection could be made because the target machine actively
refused it'))

The solution to my problem was to drop the port at the end. Once this was done the code exicuted.

urllib.request.urlopen connection error max URL retries reached: how to scrape website bypassing this?

I am trying to scrape all of the pages on this site. I wrote this code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib
from urllib.request import urlopen
output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
if each_page%1000 == 0: #this is because of download limit
time.sleep(5) #this is because of download limit
url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
tabs = soup.find_all('table')
pd_list = pd.read_html(str(tabs[0]))
temp_list = []
for i in range(22):
temp_list.append(str(pd_list[0][2][i]).strip())
output.write(str(temp_list[1]).strip() + '\t' + str(temp_list[3]).strip() + '\t' + str(temp_list[7]).strip() + '\t' + str(temp_list[15]).strip() + '\t')
pd_list2 = pd.read_html(str(tabs[1]))
output.write(str(pd_list2[0][0][1]) + '\t' + str(pd_list2[0][2][1]) + '\n')
My connection is being refused for trying the URL too many times (I know this because when I run the code with requests, instead or url.request.urlopen, the error says 'Max retries exceeded with url:':
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ebc0e48>: Failed to establish a new connection: [Errno 61] Connection refused'))
Other methods suggested here also didn't work, one of the forum users from that post suggest I make a different post for specifically this issue.
I have looked into scrapy, but I don't really understand how to make it link with the script above. Can anyone show me how to edit the above script, so that I avoid errors such as:
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.signalpeptide.de', port=443): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused'
ConnectionRefusedError: [Errno 61] Connection refused
I also tried using urllib3:
from bs4 import BeautifulSoup
import requests
import pandas as pd
#import urllib
#from urllib.request import urlopen
import urllib3
http = urllib3.PoolManager()
output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
if each_page%1000 == 0: #this is because of download limit
time.sleep(5) #this is because of download limit
url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='
page = http.request('GET',url)
soup = BeautifulSoup(page, 'html.parser')
with the error:
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ce6f5f8>: Failed to establish a new connection: [Errno 61] Connection refused'))
Note that I think the first time you run this script it will run, it was just when I was running it a few times while I was testing it/writing it, that now I have it written and i know it works. Then it ran for the first 400 entries and then I got the errors above, and now it won't let me run it at all.
If anyone has an idea for how to edit this script to get around the max number of URL retries, particularly be aware that I am already getting the connection refused error, I would appreciate it.

For While loop keeps running

The script bellow is supposed to send the data to the url when the Google Compute Engine instance (using Container-Optimized OS image) is started and the dockerized app working.
Unfortunately, even if it fails to post the data, the data is received when the app is working.
The output is:
('Error',
ConnectionError(MaxRetryError("HTTPConnectionPool(host='34.7.8.8',
port=12345): Max retries exceeded with url: /didi.json (Caused by
NewConnectionError(': Failed to establish a new connection: [Errno 111]
Connection refused',))",),))
Does it come from GCE ?
Here is the python code:
for i in range(0,100):
while True:
try:
response = requests.post('http://%s:12345/didi.json' % ip_of_instance, data=data)
except requests.exceptions.RequestException as err:
print ("Error",err)
time.sleep(2)
continue
break
Edit - here are the parameters of the post request:
data = {
'url': 'www.website.com',
'project': 'webCrawl',
'spider': 'indexer',
'setting': 'ELASTICSEARCH_SERVERS=92.xx.xx.xx',
'protocol': 'https',
'scraper': 'light'
}

What I see is that you are using a while true loop, when it exceeds maximum retrys you get an error because you are being banned by the server but this status does not long forever, and when the banning is removed you start to get more data because the while still running.
If my theory is not right you can take a look at this other thread.
Max retries exceeded with URL

Making requests through tor, requests.exceptions.ConnectionError Errno 61: Connection Refused

I'm trying to make a simple request to a whatsmyip site while connected to tor but no matter what I try I continue to get this error:
requests.exceptions.ConnectionError: SOCKSHTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /get (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x1018a7438>: Failed to establish a new connection: [Errno 61] Connection refused'))
I've looked at a lot of posts on here with similar issues but I can't seem to find a fix that works.
This is the current code but I've tried multiple ways and its the same error every time:
import requests
def main():
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
r = requests.get('https://httpbin.org/get', proxies=proxies)
print(r.text)
if __name__ == '__main__':
main()

Well the error says Max retries exceeded with url:, so possibly could be too many requests has been made from the tor exit nodes ip. Attempt to do it with a new Tor identity and see if that works.
If you wanted to you could catch the exception and put it in a loop to attempt every number of seconds, but this may lead to that ip address being refused by the server for longer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Connection error during image scraping from Craigslist - python

Related

Why do I get a connection error using qBittorrentAPI?

Max retries exceeded > with url when attempting to use the Confluence rest API in python

urllib.request.urlopen connection error max URL retries reached: how to scrape website bypassing this?

For While loop keeps running

Making requests through tor, requests.exceptions.ConnectionError Errno 61: Connection Refused

Categories

Resources