I am using the lib presented in https://github.com/cmcqueen/cobs-python to make sure i can send data over a serial line.
However, this lib requires a string as input, which can be about anything (text files, images, etc...). Since I'm using serial line without any special features on, there is no need to worry about special characters triggering some events.
If I could, I would send for example the image in raw mode, since it does not matter how the data is passed to the other end, but I need to encode it with this lib.
I have tried the following:
data = open('img.jpg', 'wb')
cobs_packet = cobs.encode(''.join(data).encode('utf-8'))
This gives me the following error:
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
The problem is, if I use different encoding types, the data length is changed, and that can't happen for what I'm trying to do.
Isn't there any way to simply convert the input to string as-is?
EDIT: I'm using python version 2.7
I didn't test how far this is valid, but I have found a simple solution which seems to be working. By using bytes() function, it works somehow as strings. I can pass it to the lib in question like this.
Thanks for the help everyone. Cheers.
Related
So I am currently doing a beginner CTF challengeon pwnable.tw, the "start" challenge specifically. After reversing the challenge binary I found out there was a buffer overflow exploit, and one thing I would have to do to get an ideal starting point would be to leak the stack address by pointing it back to a specific address (0x08048087), so i crafted a payload, that would then overwrite the return address with the address I was aiming for. However, I'm having trouble converting the byte data into a string format to be fed to the vulnerable program.
Below is my python code:
from pwn import *
shellcode = b'A' * 20
shellcode += pack(0x08048087, 32)
print(shellcode)
I use the pwn library to simplify packing the address, and then I print it and then pipe it into the vulnerable binary as stdin. However, what will happen when I print this, is that rather than printing the string equivalent of the associated hex values of that address, it will instead print this:
b'AAAAAAAAAAAAAAAAAAAA\x87\x80\x04\x08'
Just a string literal version of the hex values themselves. However, this will of course not be interpreted by the program in the way i intend it to be. So I try to decode it into utf-8 or an ASCII string, or even use str to convert it no matter which way I choose I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 20: invalid start byte
It would seem it can't decode the 0x87, which makes sense, in this case there does not seem to be an equivalent for it to decode to. But then my question becomes how can I deliver my shell code, specifically the hexadecimal address part, to the program in a way that the program will interpret that portion of the overflowed buffer as the address that i intend it to, rather than it being incorrectly mapped since my script gave me a stringified version of the hex values themselves?
So I ended up finding the answer, it was to use sys.stdout.buffer.write(), rather than print or sys.stdout.write() since sys.stdout.buffer.write() uses a BufferedWriter which simply operates on raw bytes rather than the other two which operate on text/strings. Thank you to everyone in the comments who helped me!
This question already has an answer here:
Converting python string into bytes directly without eval()
(1 answer)
Closed 2 years ago.
Let me be more clear.
I'm receiving a string in Python like this:
file = "b'x\\x9c\\xb4'"
The type of file is str. But you can see inside of that string the format of a <class 'bytes'>. It was the result of calling str(file) once file was already encoded. I would like to decode it but i don't know how to decode the bytes inside of a string object.
My question is: is there a way to interpret file as bytes instead of str without having to call something like bytes(file, 'utf-8') or file.encode('utf-8')? The problem with these method is that i would encode the already encoded bytes as i stated before.
Why do i need that?
I'm building an API and i need to send back as a JSON value a significantly big string. Since there was plenty of space for me to compress it, i ended using zlib:
import zlib
file = BIG_STRING
file_compressed = zlib.compress(BIG_STRING.encode(utf-8)) # Since zlib expects a bytes object
send_back({"SOME_BIG_STRING": str(file_compressed)})
I'm sending it back as a string because i can't send it back as a bytes object, it doesn't support that. And if i try to decode it compressed before sending i ended up facing an error:
send_back({"SOME_BIG_STRING": file_compressed.decode('utf-8')})
-> UnicodeDecodeError: utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
And when i receive the same string later in the program, i find myself stuck on the problem described initially.
I'm lacking knowledge right now to be able to do some workaround and couldn't find an answer to this. I'd be extremely grateful if anyone could help me!
Anyway, you can call eval("b'x\\x9c\\xb4'") and get your result b'x\x9c\xb4' if you don't find any other solution. But eval usage isn't recommended in the common case and it will be a bad practice.
I have a task where I needed to generate a portable version of a data dictionary, with some extra fields inserted. I ended up building a somewhat large Python dictionary, which I then wanted to convert to JSON. However, when I attempt this conversion...
with open('CPS14_data_dict.json','w') as f:
json.dump(data_dict,f,indent=4,encoding='utf-8')
I get smacked with an exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 17: invalid start byte
I know this is a common error, but I cannot, for the life of me, find an instance of how one locates the problematic portion in the dictionary. The only plausible thing I have seen is to convert the dictionary to a string and run ord() on each character:
for i,c in enumerate(str(data_dict)):
if ord(c)>128:
print i,'|',c
The problem is, this operation returns nothing at all. Am I missing something about how ord() works? Alternatively, a position (17) is reported, but it's not clear to me what this refers to. There do not appear to be any problems at the 17th character, row, or entry.
I should say that I know about the ensure_ascii=False option. Indeed, it will write to disk (and it's beautiful). This approach, however, seems to just kick the can down the road. I get the same encoding error when I try to read the file back in. Since I will want to use this file for multiple purposes (converted back to a dictionary), this is an issue.
It would also be helpful to note that this is my work computer with Windows 7, so I don't have my shell tools to explore the file (and my VM is on the fritz).
Any help would be greatly appreciated.
I'm using Jeff's demo code for using the YouTube API and Python to interact with captions for my videos. And I have it working great for my videos in English. Unfortunately, when I try to use it with my videos that have automatic transcripts in Spanish, which contain characters such as á¡, etc., I get an encoding error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 25: ordinal not in range(128)
My Python script has # -*- coding: utf-8 -*- at the top and I've changed the CAPTIONS_LANGUAGE_CODE to 'es', but it seems like the script is still interpreting the .srt file it downloads as ascii rather than utf-8. The line where it downloads the .srt file is:
if response_headers["status"] == "200":
self.srt_captions = SubRipFile.from_string(body)
How can I get Python to consider the srt file as utf-8 so that it doesn't throw an encoding error?
Thanks!
It looks like this isn't really a Youtube API issue at all, but a Python one. Note that your error isn't an encoding error, but a decoding error; you've stumbled upon the way that Python is designed to work (for better or for worse). Many, many functions in Python will cast unicode data as 8-bit strings rather than native unicode objects, using \x with a hex number to represent characters greater than 127. (One such method is the "from_string" method of the SubRipFile object you're using.) Thus the data is still unicode, but the object is a string. Because of this, when you then are forcing a casting to a unicode object (triggered by using the 'join' method of a unicode object in the sample code you provided), Python will assume an ascii codec (the default for 8-bit strings, regardless of data encoding) to deal with the data, which then throws an error on those hex characters.
There are several solutions.
1) You could explicitly tell Python that when you run your join method to not assume an ascii codec, but I always struggle with getting that right (and doing it in every case). So I won't attempt some sample code.
2) You could forego native unicode objects and just use 8-bit strings to work with your unicode data; this would only require you changing this line:
body = u'\n'.join(lines[2:])
To this:
body = '\n'.join(lines[2:])
There are potential drawbacks to this approach, however -- again, you'd have to make sure you're doing it in every case; you also wouldn't be leveraging Python-native unicode objects (which may or may not be an issue for later in your code).
3) you could use the low-level 'codecs' module to ensure that the data is cast as a native unicode object from the get-go rather than messing around with 8-bit strings. Normally, you accomplish such a task in this manner:
import codecs
f=codecs.open('captions.srt',encoding='utf-8')
l=f.readlines()
f.close()
type(l[0]) # will be unicode object rather than string object
Of course, you have the complication of using a SubRipFile object which returns a string, but you could get around that by either sending it through a StringIO object (so the codecs module can treat the ripped data as a file), using the codecs.encode() method, etc. The Python docs have pretty good sections on all of this.
Best of luck.
I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.
I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.
Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.