How can I add a string to a dictionary in Python? - python

I'm trying to take some http proxies and append them to a list and then test them individually by opening them with urllib but I get the following type error. I have tried wrapping 'proxy' with str() in the test function but that returns another error.
proxies = []
with open('working_proxies.txt', 'rb') as working_proxies:
for proxy in working_proxies:
proxy.rstrip()
proxies.append(proxy)
def test(proxy):
try:
urllib.urlopen(
"http://google.com",
proxies={'http': proxy}
)
except IOError:
print "Connection error! (Check proxy)"
else:
working_proxy = True
working_proxy = False
while working_proxy == False:
myProxy = proxies.pop()
test(myProxy)
My error:
Connection error! (Check proxy)
Traceback (most recent call last):
File "proxy_hand.py", line 26, in <module>
test(proxy)
File "proxy_hand.py", line 16, in test
proxies={'http': proxy}
File "/usr/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 193, in open
urltype, proxyhost = splittype(proxy)
File "/usr/lib/python2.7/urllib.py", line 1074, in splittype
match = _typeprog.match(url)
TypeError: expected string or buffer

You opened the file with proxies as binary here:
with open('working_proxies.txt', 'rb') as working_proxies:
The b in the 'rb' mode string means you'll be reading binary, e.g. bytes objects.
Either open the file in text mode (and perhaps specify a codec other than your system default) or decode your bytes objects to str using an explicit bytes.decode() call:
proxies.append(proxy.decode('ascii'))
I'd expect ASCII to be sufficient to decode hostnames suitable to be used as proxies.
Note that your working_proxy flag won't work; it is not marked as global in test. Perhaps you want to catch the IOError exception outside of test instead, or move the loop into that function. You'll also need to figure out what you'll do when you run out of proxies (so when none of them work).

Related

How to properly serialize and deserialize paging_size in Python?

In my Python application, I make the query to the Cassandra database. I'm trying to implement pagination through the cassandra-driver package. As you can see from the code below, paging_state returns the bytes data type. I can convert this value to the string data type. Then I send the value of the str_paging_state variable to the client. If this client sends me str_paging_state again I want to use it in my query.
This part of code works:
query = "select * from users where user_type = 'clients';"
statement = SimpleStatement(query, fetch_size=10)
results = session.execute(statement)
paging_state = results.paging_state
print(type(paging_state)) # <class 'bytes'>
str_paging_state = str(paging_state)
print(str_paging_state) # "b'\\x00C\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x03_hk\\x00\\x00\\x00\\x11P]5C#\\x8bGD~\\x8b\\xc7g\\xda\\xe5rH\\xb0\\x00\\x00\\x00\\x03_rk\\x00\\x00\\x00\\x18\\xee\\x14\\xf7\\x83\\x84\\x00tTmw[\\x00\\xec\\xdb\\x9b\\xa9\\xfd\\x00\\xb9\\xff\\xff\\xff\\xff\\xfe\\x01\\x00'"
This part of code raise error:
results = session.execute(
statement,
paging_state=bytes(str_paging_state.encode())
)
Error:
[ERROR] NoHostAvailable: ('Unable to complete the operation against any hosts')
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 51, in lambda_handler
    results = cassandra_connection.execute(statement, paging_state=bytes(paging_state.encode()))
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 2618, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 4877, in result
    raise self._final_exceptionEND RequestId: 4b7bf588-a2d2-45e5-ad7e-8611f1704313
In Java documentation I found the .fromString method which creates a PagingState object from a string previously generated with toString(). Unfortunately, I didn't find an equivalent for this method in Python.
I also tried to use codecs package to decode and encode the paging_state.
str_paging_state = codecs.decode(paging_state, encoding='utf-8', errors='ignore')
# "\u0000C\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0003_hk\u0000\u0000\u0000\u0011P]5C#GD~grH\u0000\u0000\u0000\u0003_rk\u0000\u0000\u0000\u0018\u0014\u0000tTmw[\u0000ۛ\u0000\u0001\u0000"
# Raise error
results = session.execute(statement, paging_state=codecs.encode(str_paging_state, encoding='utf-8', errors='ignore'))
In this case I see next error:
[ERROR] ProtocolException: <Error from server: code=000a [Protocol error] message="Invalid value for the paging state">
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 50, in lambda_handler
    results = cassandra_connection.execute(
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 2618, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
  File "/opt/python/lib/python3.8/site-packages/cassandra/cluster.py", line 4877, in result
    raise self._final_exceptionEND RequestId: 979f098a-a566-4904-821a-2ce06522d909
In my case, protocol version is 4.
cluster = Cluster(..., protocol_version=4)
I would appreciate any help!
Just convert the binary data into hex string or base64 - use binascii module for that. For example, for first case functions hexlify/unhexlify (or in Python 3 use .hex method of binary data), and for base64 - use functions b2a_base64/a2b_base64

Compress in Java, decompress in Python - snappy/redis-py-cluster

I am writing cron script in python for a redis cluster and using redis-py-cluster for only reading data from a prod server. A separate Java application is writing to redis cluster with snappy compression and java string codec utf-8.
I am able to read data but not able to decode it.
from rediscluster import RedisCluster
import snappy
host, port ="127.0.0.1", "30001"
startup_nodes = [{"host": host, "port": port}]
print("Trying connecting to redis cluster host=" + host + ", port=" + str(port))
rc = RedisCluster(startup_nodes=startup_nodes, max_connections=32, decode_responses=True)
print("Connected", rc)
print("Reading all keys, value ...\n\n")
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#uncompress = snappy.uncompress(value, decoding="utf-8")
print(key, value)
print('\n')
print("Done. exit()")
exit()
decode_responses=False is working fine with the comment. however changing decode_responses=True is throwing error. My guess is its not able to get the correct decoder.
Traceback (most recent call last):
File "splooks_cron.py", line 22, in <module>
print(key, rc.get(key))
File "/Library/Python/2.7/site-packages/redis/client.py", line 1207, in get
return self.execute_command('GET', name)
File "/Library/Python/2.7/site-packages/rediscluster/utils.py", line 101, in inner
return func(*args, **kwargs)
File "/Library/Python/2.7/site-packages/rediscluster/client.py", line 410, in execute_command
return self.parse_response(r, command, **kwargs)
File "/Library/Python/2.7/site-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/Library/Python/2.7/site-packages/redis/connection.py", line 636, in read_response
raise e
: 'utf8' codec can't decode byte 0x82 in position 0: invalid start byte
PS: Uncommenting this line uncompress = snappy.uncompress(value, decoding="utf-8") is breaking with error
Traceback (most recent call last):
File "splooks_cron.py", line 27, in <module>
uncompress = snappy.uncompress(value, decoding="utf-8")
File "/Library/Python/2.7/site-packages/snappy/snappy.py", line 91, in uncompress
return _uncompress(data).decode(decoding)
snappy.UncompressError: Error while decompressing: invalid input
After hours of debugging, I was finally able to solve this.
I am using xerial/snappy-java compressor in my Java code which is writing to redis cluster. Interesting thing is that during compression xerial SnappyOutputStream adds some offset at the beginning of the compress data. In my case this looks something like this
"\x82SNAPPY\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x01\xb6\x8b\x06\\******actual data here*****
Due to this, the decompressor was not able to figure out. I modified code as below and remove offset form the value. it's working fine now.
for key in rc.scan_iter("uidx:*"):
value = rc.get(key)
#in my case offset was 20 and utf-8 is default ecoder/decoder for snappy
# https://github.com/andrix/python-snappy/blob/master/snappy/snappy.py
uncompress_value = snappy.decompress(value[20:])
print(key, uncompress_value)
print('\n')

Tornado write_message not sending dict/json

I am trying to send the file over the tornado websocket like this
in_file = open("/home/rootkit/Pictures/test.png", "rb")
data = in_file.read()
in_file.close()
d = {'file': base64.b64encode(data), 'filename': 'test.png'}
self.ws.write_message(message=d)
as per tornado documentation.
The message may be either a string or a dict (which will be encoded as json). If the binary argument is false, the message will be sent as utf8; in binary mode any byte string is allowed.
But I am getting this exception.
ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=TypeError("Expected bytes, unicode, or None; got <class 'dict'>",)>
Traceback (most recent call last):
File "/home/rootkit/.local/lib/python3.5/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/home/rootkit/PycharmProjects/socketserver/WebSocketClient.py", line 42, in run
self.ws.write_message(message=d, binary=True)
File "/home/rootkit/.local/lib/python3.5/site-packages/tornado/websocket.py", line 1213, in write_message
return self.protocol.write_message(message, binary=binary)
File "/home/rootkit/.local/lib/python3.5/site-packages/tornado/websocket.py", line 854, in write_message
message = tornado.escape.utf8(message)
File "/home/rootkit/.local/lib/python3.5/site-packages/tornado/escape.py", line 197, in utf8
"Expected bytes, unicode, or None; got %r" % type(value)
TypeError: Expected bytes, unicode, or None; got <class 'dict'>
The documentation which you're citing is for WebSocketHandler which is meant for serving a websocket connection.
Whereas you're using a websocket client. You'll have to manually convert your dictionary to json.
from tornado.escape import json_encode
self.ws.write_message(message=json_encode(d))

Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)

I've written a script that is supposed to retrieve html pages off a site and update their contents. The following function looks for a certain file on my system, then attempts to open it and edit it:
def update_sn(files_to_update, sn, table, title):
paths = files_to_update['files']
print('updating the sn')
try:
sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]
except Exception:
print('no sns were found')
pass
new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
new_sn_number = sn
htm_text = open(sn_htm, 'rb').read().decode('cp1252')
content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S)
minus_content = htm_text.replace(content[0], '')
table_soup = BeautifulSoup(table, 'html.parser')
new_soup = BeautifulSoup(minus_content, 'html.parser')
head_title = new_soup.title.string.replace_with(new_sn_number)
new_soup.link.insert_after(table_soup.div.next)
with open(new_path_name, "w+") as file:
result = str(new_soup)
try:
file.write(result)
except Exception:
print('Met exception. Changing encoding to cp1252')
try:
file.write(result('cp1252'))
except Exception:
print('cp1252 did\'nt work. Changing encoding to utf-8')
file.write(result.encode('utf8'))
try:
print('utf8 did\'nt work. Changing encoding to utf-16')
file.write(result.encode('utf16'))
except Exception:
pass
This works in the majority of cases, but sometimes it fails to write, at which point the exception kicks in and I try every feasible encoding without success:
updating the sn
Met exception. Changing encoding to cp1252
cp1252 did'nt work. Changing encoding to utf-8
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
file.write(result)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scraper.py", line 79, in <module>
get_latest(entries[0], int(num), entries[1])
File "scraper.py", line 56, in get_latest
update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes
Can anyone give me any pointers on how to better handle html data that might have inconsistent encoding?
In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:
TypeError: write() argument must be str, not bytes
If you want to write bytes, you should open the file in binary mode.
BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,
soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii
with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))
In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.
If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.
data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']
with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')
Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.
Just out of curiosity, is this line of code a typo file.write(result('cp1252'))? Seems like it is missing .encode method.
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
Will it work perfectly if you modify the code to: file.write(result.encode('cp1252'))
I once had this write to file with encoding problem and brewed my own solution through the following thread:
Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
.
My problem solved by changing the html.parser parsing mode to html5lib. I root-caused my problem due to malformed HTML tag and solved it with html5lib parser. For your reference, this is the documentation for each parser provided by BeautifulSoup.
Hope this helps

Python ldif3 parser and exception in for loop

From site: https://pypi.python.org/pypi/ldif3/3.2.0
I have this code:
from ldif3 import LDIFParser
from pprint import pprint
parser = LDIFParser(open('data.ldif', 'rb'))
for dn, entry in parser.parse():
print('got entry record: %s' % dn)
pprint(record)
And now, reading my file data.ldif I have exception in parser.parse().
Question is how to catch this exception and allow for loop to go to next record (continue)?
Trackback:
Traceback (most recent call last):
File "ldif.py", line 16, in <module>
for dn, entry in parser.parse():
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 373, in parse
yield self._parse_entry_record(block)
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 346, in _parse_entry_record
attr_type, attr_value = self._parse_attr(line)
File "/home/dlubom/anaconda2/lib/python2.7/site-packages/ldif3.py", line 309, in _parse_attr
return attr_type, attr_value.decode('utf8')
File "/home/dlubom/anaconda2/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb3 in position 6: invalid start byte
i think it's not possible to handle exceptions in such case because they happend before / during the variable assignment.
BTW probably want to use the attribute:
strict (boolean) – If set to False, recoverable parse errors will
produce log warnings rather than exceptions.
Example:
parser = LDIFParser(ldif_file, strict=False)
https://ldif3.readthedocs.io/en/latest/
That helped me parsing an invalid ldif file containing commas inside CN attributes.

Categories