I try scrape pages contain underscore in subdomain, etc: https://taxi-3-extreme-rush_1.en.softonic.com
I check specifications and i see that subdomain can contain underscore.
Also i was try use link.encode('idna'), but also not works.
And i have error:
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1297, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib64/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 60, in download_request
return agent.download_request(request)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 285, in download_request
method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1596, in request
endpoint = self._getEndpoint(parsedURI)
File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1580, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1456, in endpointForURI
uri.port)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
File "/usr/lib64/python2.7/site-packages/twisted/internet/_sslverify.py", line 1201, in __init__
self._hostnameBytes = _idnaBytes(hostname)
File "/usr/lib64/python2.7/site-packages/twisted/internet/_sslverify.py", line 87, in _idnaBytes
return idna.encode(text)
File "/usr/lib/python2.7/site-packages/idna/core.py", line 355, in encode
result.append(alabel(label))
File "/usr/lib/python2.7/site-packages/idna/core.py", line 276, in alabel
check_label(label)
File "/usr/lib/python2.7/site-packages/idna/core.py", line 253, in check_label
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
InvalidCodepoint: Codepoint U+005F at position 20 of u'taxi-3-extreme-rush_1' not allowed
A workaround:
import idna
idna.idnadata.codepoint_classes['PVALID'] = tuple(
sorted(list(idna.idnadata.codepoint_classes['PVALID']) + [0x5f0000005f])
)
Seems like it's an issue with Twisted.
There's an issue with a solution regarding it here:
Looking at Twisted's code, it'll use the idna library if available. If I pip uninstall idna and issue the same request again, it is successful.
idna gets installed with either pip install twisted[tls] or pip install treq.
I've tried uninstalling idna via pip uninstall idna and the request, indeed, goes through.
I tried with Selenium and it can parse correctly. I can verify this because if I disable the spider's middleware (where my Selenium code is). The same error is thrown.
raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 3 of 'xyx_abc' not allowed
Related
I really don't know how to help myself, being unfamiliar with this kind of error, and not finding anything on the Google landscape really. My last hope is one of you guys since I don't know where else to go with this. I tried reinstalling all libraries and setting up a new venv. For more action I don't trust myself enough in these kinds of things.
The code triggering the error:
from wetterdienst import DWDObservationData
observations_daily = DWDObservationData(
station_ids=station_ids_d,
parameter=params_daily,
time_resolution=TimeResolution.DAILY,
start_date="2015-01-01",
end_date="2020-10-10",
tidy_data=True,
humanize_column_names=True,
)
for df in observations_hourly.collect_data():
name = str(df.STATION_ID.iloc[0]).strip(".0")
df.to_csv('./data/hourly/{}.csv'.format(name))
print('{} done'.format(name))
API is found here: https://github.com/earthobservations/wetterdienst
Error:
Traceback (most recent call last):
File "/Users/sashakaun/PycharmProjects/wetter2.0/main.py", line 83, in <module>
for df in observations_hourly.collect_data():
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/api.py", line 178, in collect_data
df_parameter = self._collect_parameter_from_station(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/api.py", line 243, in _collect_parameter_from_station
df_period = collect_climate_observations_data(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 82, in collect_climate_observations_data
filenames_and_files = download_climate_observations_data_parallel(remote_files)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 106, in download_climate_observations_data_parallel
return list(zip(remote_files, files_in_bytes))
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/wetterdienst/dwd/observations/access.py", line 124, in _download_climate_observations_data
return BytesIO(__download_climate_observations_data(remote_file=remote_file))
File "<decorator-gen-2>", line 2, in __download_climate_observations_data
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 1356, in get_or_create_for_user_func
return self.get_or_create(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 954, in get_or_create
with Lock(
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 185, in __enter__
return self._enter()
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 94, in _enter
generated = self._enter_create(value, createdtime)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/lock.py", line 178, in _enter_create
return self.creator()
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/region.py", line 920, in gen_value
self.backend.set(key, value)
File "/Users/sashakaun/PycharmProjects/wetter2.0/venv/lib/python3.8/site-packages/dogpile/cache/backends/file.py", line 239, in set
dbm[key] = pickle.dumps(value, pickle.HIGHEST_PROTOCOL)
_gdbm.error: Database needs recovery
Thanks a lot!!
A GDBM file has been corrupted. You need to use gdbmtool to recover the database. Install gdbmtool then run
gdbmtool FILENAME
Where FILENAME is the name of the GDBM database. A prompt will appear, then you can enter
gdbmtool> recover summary
If the database can be recovered it will display a summary of the recovery results, eg:
Recovery succeeded.
Keys recovered: 6870650, failed: 5, duplicate: 0
Buckets recovered: 64830, failed: 2
I am trying to set up a simple email notification system for every build using Buildbot's reporter.MailNotifier. I have implemented it on two Windows computers, and one linux machine and replicated the same error. Here is the code snippet
from buildbot.plugins import *
c = BuildmasterConfig = {}
#Added workers, protocols, and other configurations
#Test scheduler
c['schedulers'] = [schedulers.Periodic(name="tester", builderNames=["runtest"], periodicBuildTimer=60)]
####### BUILDBOT SERVICES
mn = reporters.MailNotifier(fromaddr='email#gmail.com', sendToInterestedUsers=False,
relayhost="smtp.gmail.com",smtpPort=587, useTls=True,
extraRecipients=["email#gmail.com"],
smtpUser="email#gmail.com", smtpPassword="email_password")
c['services'] = [mn]
However, every time I receive the following error in twistd.log:
2017-06-15 21:20:14-0700 [ESMTPSender,client] SMTP Client retrying server. Retry: 1
2017-06-15 21:20:15-0700 [ESMTPSender,client] Unhandled Error
Traceback (most recent call last):
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\python\log.py", line 103, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\python\log.py", line 86, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\python\context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\python\context.py", line 85, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\internet\selectreactor.py", line 149, in _doReadOrWrite
why = getattr(selectable, method)()
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\internet\tcp.py", line 208, in doRead
return self._dataReceived(data)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\internet\tcp.py", line 214, in _dataReceived
rval = self.protocol.dataReceived(data)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\protocols\tls.py", line 330, in dataReceived
self._flushReceiveBIO()
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\protocols\tls.py", line 295, in _flushReceiveBIO
ProtocolWrapper.dataReceived(self, bytes)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\protocols\policies.py", line 120, in dataReceived
self.wrappedProtocol.dataReceived(data)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\protocols\basic.py", line 571, in dataReceived
why = self.lineReceived(line)
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\mail\smtp.py", line 995, in lineReceived
why = self._okresponse(self.code, b'\n'.join(self.resp))
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\mail\smtp.py", line 1044, in smtpState_to
return self.smtpState_toOrData(0, b'')
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\mail\smtp.py", line 1062, in smtpState_toOrData
self.sendLine(b'RCPT TO:' + quoteaddr(self.lastAddress))
File "c:\users\me\buildbot\sandbox\lib\site-packages\twisted\mail\smtp.py", line 179, in quoteaddr
res = email.utils.parseaddr(addr)
File "c:\python27\Lib\email\utils.py", line 214, in parseaddr
addrs = _AddressList(addr).addresslist
File "c:\python27\Lib\email\_parseaddr.py", line 457, in __init__
self.addresslist = self.getaddrlist()
File "c:\python27\Lib\email\_parseaddr.py", line 217, in getaddrlist
while self.pos < len(self.field):
exceptions.TypeError: object of type 'module' has no len()
Quick info: Error appears on both Buildbot 0.9.8 and 0.9.1 on Windows 10 (64-bit) and Ubuntu 14.04. The error log is from Python 2.7.13 virtualenv 15.1.0 twisted 17.5.0. Inserting the following code in _parseaddr.py works but I'm looking for a better fix.
if str(type(self.field)) == "<type 'module'>":
return [('',u'email#gmail.com')]
It was a typo in the recent release of Twisted 17.5.0
Around line 1900 of twisted/mail/smtp.py in the constructor of class SMTPSenderFactory:
toEmailFinal.append(email)
should have been
toEmailFinal.append(_email)
The former passed the entire email module instead of passing just the parsed email, which produced the error. The newer releases will probably fix that or you can manually replace the line in the file. The fix (by rodrigc) can be found in this GitHub commit
I am trying to check if a certain dataset exists in bigquery using the Google Api Client in Python. It always worked untill the last update where I got this strange error I don't know how to fix:
Traceback (most recent call last):
File "/root/miniconda/lib/python2.7/site-packages/dsUtils/bq_utils.py", line 106, in _get
resp = bq_service.datasets().get(projectId=self.project_id, datasetId=self.id).execute(num_retries=2)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/util.py", line 140, in positional_wrapper
return wrapped(*args, **kwargs)
File "/root/miniconda/lib/python2.7/site-packages/googleapiclient/http.py", line 755, in execute
method=str(self.method), body=self.body, headers=self.headers)
File "/root/miniconda/lib/python2.7/site-packages/googleapiclient/http.py", line 93, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/client.py", line 598, in new_request
self._refresh(request_orig)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/client.py", line 864, in _refresh
self._do_refresh_request(http_request)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/client.py", line 891, in _do_refresh_request
body = self._generate_refresh_request_body()
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/client.py", line 1597, in _generate_refresh_req
uest_body
assertion = self._generate_assertion()
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/service_account.py", line 263, in _generate_ass
ertion
key_id=self._private_key_id)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/crypt.py", line 97, in make_signed_jwt
signature = signer.sign(signing_input)
File "/root/miniconda/lib/python2.7/site-packages/oauth2client/_pycrypto_crypt.py", line 101, in sign
return PKCS1_v1_5.new(self._key).sign(SHA256.new(message))
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Signature/PKCS1_v1_5.py", line 112, in sign
m = self._key.decrypt(em)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/PublicKey/RSA.py", line 174, in decrypt
return pubkey.pubkey.decrypt(self, ciphertext)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/PublicKey/pubkey.py", line 93, in decrypt
plaintext=self._decrypt(ciphertext)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/PublicKey/RSA.py", line 235, in _decrypt
r = getRandomRange(1, self.key.n-1, randfunc=self._randfunc)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Util/number.py", line 123, in getRandomRange
value = getRandomInteger(bits, randfunc)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Util/number.py", line 104, in getRandomInteger
S = randfunc(N>>3)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Random/_UserFriendlyRNG.py", line 202, in read
return self._singleton.read(bytes)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Random/_UserFriendlyRNG.py", line 178, in read
return _UserFriendlyRNG.read(self, bytes)
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Random/_UserFriendlyRNG.py", line 137, in read
self._check_pid()
File "/root/miniconda/lib/python2.7/site-packages/Crypto/Random/_UserFriendlyRNG.py", line 153, in _check_pid
raise AssertionError("PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()")
AssertionError: PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()
Is someone understanding what is hapening?
Note that I also get this error with other bricks like GCStorage.
Note also that I use the following command to load my Google credentials:
from oauth2client.client import GoogleCredentials
def get_credentials(credentials_path): #my json credentials path
logger.info('Getting credentials...')
try:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentials_path
credentials = GoogleCredentials.get_application_default()
return credentials
except Exception as e:
raise e
So if anyone know a better way to load my google credentials using my json service account file, and which would avoid the error, please tell me.
It looks like the error is in the PyCrypto module, which appears to be used under the hood by Google's OAuth2 implementation. If your code is calling os.fork() at some point, you may need to call Crypto.Random.atfork() afterward in both the parent and child process in order to update the module's internal state.
See here for PyCrypto docs; search for "atfork" for more info:
https://github.com/dlitz/pycrypto
This question and answer might also be relevant:
PyCrypto : AssertionError("PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()")
coincidentally, I run pip search django command and I got time out error. even specifing a high value of timeout
Below the logs:
D:\PERFILES\rmaceissoft\virtualenvs\fancy_budget\Scripts>pip search django --timeout=300
Exception:
Traceback (most recent call last):
File "D:\PERFILES\Marquez\rmaceissoft\Workspace\virtualenvs\fancy_budget\lib\s
ite-packages\pip-1.1-py2.7.egg\pip\basecommand.py", line 104, in main
status = self.run(options, args)
File "D:\PERFILES\Marquez\rmaceissoft\Workspace\virtualenvs\fancy_budget\lib\s
ite-packages\pip-1.1-py2.7.egg\pip\commands\search.py", line 34, in run
pypi_hits = self.search(query, index_url)
File "D:\PERFILES\Marquez\rmaceissoft\Workspace\virtualenvs\fancy_budget\lib\s
ite-packages\pip-1.1-py2.7.egg\pip\commands\search.py", line 48, in search
hits = pypi.search({'name': query, 'summary': query}, 'or')
File "C:\Python27\Lib\xmlrpclib.py", line 1224, in __call__
return self.__send(self.__name, args)
File "C:\Python27\Lib\xmlrpclib.py", line 1575, in __request
verbose=self.__verbose
File "C:\Python27\Lib\xmlrpclib.py", line 1264, in request
return self.single_request(host, handler, request_body, verbose)
File "C:\Python27\Lib\xmlrpclib.py", line 1297, in single_request
return self.parse_response(response)
File "C:\Python27\Lib\xmlrpclib.py", line 1462, in parse_response
data = stream.read(1024)
File "C:\Python27\Lib\httplib.py", line 541, in read
return self._read_chunked(amt)
File "C:\Python27\Lib\httplib.py", line 574, in _read_chunked
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
Storing complete log in C:\Users\reiner\AppData\Roaming\pip\pip.log
however, another search command finish without problems:
pip search django-registration
Is that a bug of pip due to the big amount of packages name that contains "django"?
Note: speed internet connection = 2 Mbits
the --timeout option doesn't seem to work properly.
I can install django properly by using either:
pip --default-timeout=60 install django
or
export PIP_DEFAULT_TIMEOUT=60
pip install django
Note: using pip version 1.2.1 on RHEL 6.3
Source: DjangoDay2012-Brescia.pdf, page 11
The pypi is probably overloaded. Just enable mirror fallback and caching in pip. Maybe tune the timeout a bit. Add these in ~/.pip/pip.conf:
[global]
default-timeout = 60
download-cache = ~/.pip/cache
[install]
use-mirrors = true
There is too short default timeout set for pip by default. You should really set this environment variable PIP_DEFAULT_TIMEOUT to at least 60 (1 minute)
Source: http://www.pip-installer.org/en/latest/configuration.html
Can anyone help me with this strange error I'm getting when adding a new document to a Whoosh index?
Here's the code:
def add_to_index(self, doc):
ix = index.open_dir(self.index_dir)
writer = AsyncWriter(ix) # use async writer to prevent write lock errors
writer.add_document(**self.get_doc_args(doc))
writer.commit()
def get_doc_args(self, doc):
return {
'id': u""+str(doc['id']),
'org': doc['org__id'],
'created': doc['created_date'],
'date': doc['received_date'],
'from_addr': doc['from_addr'],
'subject': doc['subject'],
'body': doc['messagebody__cleaned_message']
}
I get the following error:
TypeError('ord() expected a character, but string of length 0 found',)
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/celery/execute/trace.py", line 36, in trace
return cls(states.SUCCESS, retval=fun(*args, **kwargs))
File "/usr/local/lib/python2.6/dist-packages/celery/app/task/__init__.py", line 232, in __call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/celery/app/__init__.py", line 172, in run
return fun(*args, **kwargs)
File "/mnt/deploy/prod/chorus/src/chorus/../chorus/search/__init__.py", line 131, in index_message
MessageSearcher().add_to_index(message)
File "/mnt/deploy/prod/chorus/src/chorus/../chorus/search/__init__.py", line 29, in add_to_index
writer.commit()
File "/usr/local/lib/python2.6/dist-packages/whoosh/writing.py", line 423, in commit
self.writer.commit(*args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filewriting.py", line 501, in commit
new_segments = mergetype(self, self.segments)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filewriting.py", line 78, in MERGE_SMALL
reader = SegmentReader(writer.storage, writer.schema, seg)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filereading.py", line 63, in __init__
self.termsindex = TermIndexReader(tf)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 590, in __init__
super(TermIndexReader, self).__init__(dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 502, in __init__
OrderedHashReader.__init__(self, dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 379, in __init__
HashReader.__init__(self, dbfile)
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/filetables.py", line 187, in __init__
self.hashtype = dbfile.read_byte()
File "/usr/local/lib/python2.6/dist-packages/whoosh/filedb/structfile.py", line 219, in read_byte
return ord(self.file.read(1))
Strangely, the exact same code using a standard writer (i.e. not AsyncWriter) works just fine. What am I missing here? Note that in production I have to use AsyncWriter in order to avoid LockErrors.
This error is caused by some kind of index corruption. In my case the machine crashed by another reason while index was being rebuild.
You can easily solve it by deleting whoosh_index folder contents completely and rebuilidng index.
Ended up finding a solution; it's called Solr :-)