How can I access Firefox's internal indexedDB files using Python? - python

I need to read firefox's indexeddb using python.
I use slite3 package to retrieve contents of indexeddb:
with sqlite3.connect(indexeddb_file) as conn:
c = conn.cursor()
c.execute('select * from object_data;')
rows = c.fetchall()
for row in rows:
print row[2]
However, although I know that contents in database are strings, they are stored as sqlite binary blobs. Is there a way to read the strings stored as blobs from python?
I've tried:
hex() and quote() sql methods just encode the blob to hexadecimal
the same problem when I write the blob to file
UPDATE
Following the coding scheme in firefox source code of the implementation of indexeddb pointed out by #paa in one of the comments of this question, I implemented part of FF encoding method for database keys in python. So, far I have implemented it only for strings but implementing it for other types would be even easier:
BYTE_LENGTH = 8
def hex_to_bin(hex_str):
"""Return binary representation of hexadecimal string."""
return str(trim_bin(int(hex_str, 16)).zfill(len(hex_str) * 4))
def byte_to_unicode(bin_byte):
"""Return unicode encoding for binary byte."""
return chr(int(str(bin_byte), 2))
def trim_bin(int_n):
"""Return int num converted to trimmed bin representation."""
return bin(int_n)[2:]
def decode(key):
"""Return decoded idb key."""
decoded = key
m = re.search("[1-9]", key) # change for non-zero
if m:
i = m.start()
typeoffset = int(key[i])
else:
# error
pass
data = key[i + 1:]
if typeoffset is 1:
# decode number
pass
elif typeoffset is 2:
# decode date
pass
elif typeoffset is 3:
# decode string
bin_repr = hex_to_bin(data)
decoded = ""
for i in xrange(0, len(bin_repr), BYTE_LENGTH):
byte = bin_repr[i:i + BYTE_LENGTH]
if byte[0] is '0':
byte_1 = int(byte, 2) - 1
decoded += byte_to_unicode(trim_bin(byte_1))
else:
byte = byte[2:]
if byte[1] is '0':
byte_127 = int(byte, 2) + 127
decoded += byte_to_unicode(trim_bin(byte_127))
i += BYTE_LENGTH
decoded += byte_to_unicode(bin_repr[i:i + BYTE_LENGTH])
elif byte[1] is '1':
decoded += byte_to_unicode(byte)
i += BYTE_LENGTH
decoded += byte_to_unicode(bin_repr[i:i + BYTE_LENGTH])
i += BYTE_LENGTH
decoded += byte_to_unicode(bin_repr[i:i + 2])
return decoded
elif typeoffset is 4:
# decode array
pass
else:
# error
pass
return decoded
However, I'm still not able to decode the data fields of indexeddb. It seems to me that they are not using any sophisticated scheme like the one for the keys because I can read some parts of the actual values when I encode them in UTF-16.

(Typing here since I can't comment yet...)
For data itself I've been trying to do the same thing for data blobs. For my problem, I'm trying to grab JSON strings out. If I look at the DB I'm trying to sift through, I do see UTF-16 encoded characters, most of the time. But there are strange cases where I have this:
"there we go" is encoded as 7400 6800 6500 7200 6500 2000 77 [05060C] 6700 6F00. The [05060C] supposedly encodes "e ".
https://mxr.mozilla.org/mozilla-release/source/dom/indexedDB/IDBObjectStore.cpp
I'm trying to look into that and see if there are any clues. Should be plenty of other source files in the directory that could help.

Related

Reading variable length binary values from a file in python

I have three text values that I am encrypting and then writing to a file. Later I want to read the values back (in another script) and decrypt them.
I've successfully encrypted the values:
cenc = rsa.encrypt(client_name.encode('utf8'), publicKey)
denc = rsa.encrypt(expiry_date.encode('utf8'), publicKey)
fenc = rsa.encrypt(features.encode('utf8'), publicKey)
and written to a binary file:
licensefh = open("license.sfb", "wb")
licensefh.write(cenc)
licensefh.write(denc)
licensefh.write(fenc)
licensefh.close()
The three values cenc, denc and fenc are all of different lengths so when I read the file back:
licensefh = open("license.sfb", "rb")
encMessage = licensefh.read()
encMessage contains the entire file and I don't know how to get the three values back again.
I've tried using a separator between the values:
SEP = bytes(chr(0x02).encode('utf8'))
...
licensefh.write(cenc)
licensefh.write(SEP)
...
and then using encMessage.partition(SEP) or encMessage.split(SEP) but the data invariably contains the SEP value in it somewhere (I've tried a few different characters) so that didn't work.
I tried getting the length of the bytes objects cenc, denc and fenc, but this returned 256 for each value even though the contents of the variables are all different lengths.
My question is this. How do I write these three variable length values to a binary file and then separate them when I read them back again?
Here's an example of the 3 binary values:
b'tX\x10Fo\x89\x10~\x83Pok\xd1\xfb\xbe\x0e<a\xe5\x11md:\xe6\x84#\xfa\xf8\xe5\xeb\xf8\xdc{\xc0Z\xa0\xc0^\xc1\xd9\x820\xec\xec\xb0R\x99/\xa2l\x88\xa9\xa6g\xa3\x01m\xf9\x7f\x91\xb9\xe1\x80\xccs|\xb7_\xa9Fp\x11yvG\xdc\x02d\x8aK2\x92t\x0e\x1f\xca\x19\xbb&\xaf{\xc0y>\t|\x86\xab\x16.\xa5kZ"\xab6\xaaV\xf4w\x7f\xc5q\x07\xef\xa9\xa5\xa3\xf3 6\xdb\x03\x19S\xbd\x81\xf9\xc8\xc5\x90\x1e\x19\x86\xa4q\xe3?i\xc4\xac\t\xd5=3C\x9b#\xc3IuAN,\xeat\xc6\x96VFL\x1eFWZ\xa4\xd73\x92P#\x1d\xb9\x12\x15\xc9\xd4~\x8aWm^\xb8\x8b\x9d\x88\n)\xeb#\xe3\x93\xb1\\\xd6^\xe0\xce\xa2(\x05\xf5\xe6\x8b\xd1\x15\xd8v\xf0\xae\x90\xd8?\x01\r\x00\xf4\xa5\xadM|%\x98\xa9SR\xc6\xd0K\x9e&\xc3\xe0M\x81\x87\xdea\xcc\xd5\x9c\xcd\xfd1l\x1f\xb9?\xed\xd1\x95\xbc\x11\x85U9'
b'l\xd3S\xcc\x03\x9a\xf2\xfdr\xca\xbbA\x06\xfb\xd8\xbbWi\xdc\xb1\xf6&\x97T\x81Kl\r\x86\x9b\x95?\x94}\x8a\xd3\xa1V\x81\xd3]*B\x1f\x96`\xa3\xd1\xf2|B\x84?\xa0\ns\xb7\xcf\x18Y\x87\xcfR\x87!\x14\x81!\xf7\xf2\xe5x|=O\xe3\xba2\xf2!\x93\x0fT7\x0c~4\xa3\xe5\xb7\xf9wy\xb5\x12FM\x96\xd9\xfd\xedn\x9c\xacw\x1b\xc2\x17+\xb6\x05`\x10\xf8\xe4\x01\xde\xc7\xa2\xa0\x80\xd8\x15\xb1+<s\xc7\x19\x9c\x14\xb0\x1a"\x10\xbb\x0f\xe1\x05\x93\xd2?xX\xd9\x93\x8an\x8d\xcd\xbd!c\xd0,\xa45\xbai\xe3\xccx\x08\xaa,\xd1\xe5\'t\x91\xb8\xf2n$\x0c\xf9-\xb4\xc2\x07\x81\xe1\xe7\x8e\xb3\x98\x11\xf3\xa6\xd9wz\x9a3\xc9\x9c?z\xd8\xaa\x08}\xa2\x9c[\xf2\x9d\xe4\xcdb\xddl\xceV\x7f\xf1\x81\xb3\x88\x1e\x9c5?k\x0f\xc9\x86\x86&\xedV.\xa7\x8d\x13&V\xad\xca\xe5\x93\xfe\xa5\x94\xbc\xf5\xd1{Cl\xc0\x030\x92\x03\xc9'
b'#\xbdd7\xe9\xa0{\t\xb9\x87B\x9e\xf9\x97P^\xf3V\xb6\x93\x1f(J\x0b\xa3\xbf\xd8\x04\x86T\xa4\xca\xf3\xe8%\xddC\x11\xdb5\xff,\xf7\x13\xd7\xd2\xbc\xf3\x893\x83\xdcmJ\xc8p\xdf\x07V\x7fb\xeb\xa9\x8b\x0f\xca\xf9\x05\xfc\xdfS\x94b\x90\xcd\xfcn?/]\x11\xaf\xe606\xfb\\U59\xa0>\xbd\xd8\x1c\xa8\xca\x83\xf4C\x95v7\xc6\xe00\xe4,d_/\x83\xa0\xb9mO\x0e\xc4\x97J\x15\xf0\xca-\xa0\xafT\xe4\x82\x03\n\x14:\xa1\xdcL\x98\x9d,1\xfa\x10\xf4\xfd\xa0\x0b\xc7\x13!\xf7\xdb/\xda\x1a\x9df\x1cQ\xc0\x99H\x08\xa0c\x8f9/4\xc4\x05\xc6\x9eM\x8e\xe5V\xf8D\xc3\xfd\xad4\x94A\xb9[\x80\xb9\xcf\xe6\xd9\xb3M2\xd9N\xfbA\x18\x84/W\x9b\x92\xfe\xbb\xd6C\x85\xa3\xc6\xd2T\xd0\xb2\xb9\xf7R\xb4(s\xda\xbcX,9w\x17\x1c\xfb|\xa0\x87\xba\xca6>y\xba\\L4wc\x94\xe7$Y\x89\x07\x9b\xfe\x9b?{\x85'
#pippo1980 's comment is how I would do it, using struct :
import struct
cenc = b'tX\x10Fo\x89\x10~\x83Pok\xd1\xfb\xbe\x0e<a\xe5\x11md:\xe6\x84#\xfa\xf8\xe5\xeb\xf8\xdc{\xc0Z\xa0\xc0^\xc1\xd9\x820\xec\xec\xb0R\x99/\xa2l\x88\xa9\xa6g\xa3\x01m\xf9\x7f\x91\xb9\xe1\x80\xccs|\xb7_\xa9Fp\x11yvG\xdc\x02d\x8aK2\x92t\x0e\x1f\xca\x19\xbb&\xaf{\xc0y>\t|\x86\xab\x16.\xa5kZ"\xab6\xaaV\xf4w\x7f\xc5q\x07\xef\xa9\xa5\xa3\xf3 6\xdb\x03\x19S\xbd\x81\xf9\xc8\xc5\x90\x1e\x19\x86\xa4q\xe3?i\xc4\xac\t\xd5=3C\x9b#\xc3IuAN,\xeat\xc6\x96VFL\x1eFWZ\xa4\xd73\x92P#\x1d\xb9\x12\x15\xc9\xd4~\x8aWm^\xb8\x8b\x9d\x88\n)\xeb#\xe3\x93\xb1\\\xd6^\xe0\xce\xa2(\x05\xf5\xe6\x8b\xd1\x15\xd8v\xf0\xae\x90\xd8?\x01\r\x00\xf4\xa5\xadM|%\x98\xa9SR\xc6\xd0K\x9e&\xc3\xe0M\x81\x87\xdea\xcc\xd5\x9c\xcd\xfd1l\x1f\xb9?\xed\xd1\x95\xbc\x11\x85U9'
denc = b'l\xd3S\xcc\x03\x9a\xf2\xfdr\xca\xbbA\x06\xfb\xd8\xbbWi\xdc\xb1\xf6&\x97T\x81Kl\r\x86\x9b\x95?\x94}\x8a\xd3\xa1V\x81\xd3]*B\x1f\x96`\xa3\xd1\xf2|B\x84?\xa0\ns\xb7\xcf\x18Y\x87\xcfR\x87!\x14\x81!\xf7\xf2\xe5x|=O\xe3\xba2\xf2!\x93\x0fT7\x0c~4\xa3\xe5\xb7\xf9wy\xb5\x12FM\x96\xd9\xfd\xedn\x9c\xacw\x1b\xc2\x17+\xb6\x05`\x10\xf8\xe4\x01\xde\xc7\xa2\xa0\x80\xd8\x15\xb1+<s\xc7\x19\x9c\x14\xb0\x1a"\x10\xbb\x0f\xe1\x05\x93\xd2?xX\xd9\x93\x8an\x8d\xcd\xbd!c\xd0,\xa45\xbai\xe3\xccx\x08\xaa,\xd1\xe5\'t\x91\xb8\xf2n$\x0c\xf9-\xb4\xc2\x07\x81\xe1\xe7\x8e\xb3\x98\x11\xf3\xa6\xd9wz\x9a3\xc9\x9c?z\xd8\xaa\x08}\xa2\x9c[\xf2\x9d\xe4\xcdb\xddl\xceV\x7f\xf1\x81\xb3\x88\x1e\x9c5?k\x0f\xc9\x86\x86&\xedV.\xa7\x8d\x13&V\xad\xca\xe5\x93\xfe\xa5\x94\xbc\xf5\xd1{Cl\xc0\x030\x92\x03\xc9'
fenc = b'#\xbdd7\xe9\xa0{\t\xb9\x87B\x9e\xf9\x97P^\xf3V\xb6\x93\x1f(J\x0b\xa3\xbf\xd8\x04\x86T\xa4\xca\xf3\xe8%\xddC\x11\xdb5\xff,\xf7\x13\xd7\xd2\xbc\xf3\x893\x83\xdcmJ\xc8p\xdf\x07V\x7fb\xeb\xa9\x8b\x0f\xca\xf9\x05\xfc\xdfS\x94b\x90\xcd\xfcn?/]\x11\xaf\xe606\xfb\\U59\xa0>\xbd\xd8\x1c\xa8\xca\x83\xf4C\x95v7\xc6\xe00\xe4,d_/\x83\xa0\xb9mO\x0e\xc4\x97J\x15\xf0\xca-\xa0\xafT\xe4\x82\x03\n\x14:\xa1\xdcL\x98\x9d,1\xfa\x10\xf4\xfd\xa0\x0b\xc7\x13!\xf7\xdb/\xda\x1a\x9df\x1cQ\xc0\x99H\x08\xa0c\x8f9/4\xc4\x05\xc6\x9eM\x8e\xe5V\xf8D\xc3\xfd\xad4\x94A\xb9[\x80\xb9\xcf\xe6\xd9\xb3M2\xd9N\xfbA\x18\x84/W\x9b\x92\xfe\xbb\xd6C\x85\xa3\xc6\xd2T\xd0\xb2\xb9\xf7R\xb4(s\xda\xbcX,9w\x17\x1c\xfb|\xa0\x87\xba\xca6>y\xba\\L4wc\x94\xe7$Y\x89\x07\x9b\xfe\x9b?{\x85'
packing_format = "<HHH" # little-endian, 3 * (2-byte unsigned short)
with open("license.sfb", "wb") as licensefh:
licensefh.write(struct.pack(packing_format, len(cenc), len(denc), len(fenc)))
licensefh.write(cenc)
licensefh.write(denc)
licensefh.write(fenc)
# close is automatic with a context-manager
with open("license.sfb", "rb") as licensefh2:
header_length = struct.calcsize(packing_format)
cenc2_len, denc2_len, fenc2_len = struct.unpack(packing_format, licensefh2.read(header_length))
cenc2 = licensefh2.read(cenc2_len)
denc2 = licensefh2.read(denc2_len)
fenc2 = licensefh2.read(fenc2_len)
assert len(cenc2) == cenc2_len and len(denc2) == denc2_len and len(fenc2) == fenc2_len # the file was not truncated
unread_bytes = licensefh2.read() # until EOF
assert len(unread_bytes) == 0 # there is nothing else in the file, everything has been read
assert cenc == cenc2
assert denc == denc2
assert fenc == fenc2

- how to decode Decoding Multiple CDR Records from a File with asn1tools (python)

how to decode Decoding Multiple CDR Records from a File with asn1tools..
this is my python code:
### input file name
fileName='D:/Python/Asn1/SIO/bHWMSC12021043012454329.dat'
## create Dict from asn file struct
Foo = asn1tools.compile_files('asn_Huawei.asn',cache_dir='My-Cache',numeric_enums=True)
## Open binary file
with open(fileName,"rb+") as binaryfile:
buffer = binaryfile.read()
## Match and decode all record with Dict
decoded = Foo.decode('CallEventRecord',buffer)
print(decoded)
print(decoded) give only first record. My file contain 1550 records.... how to read Tag by tag my file with asn1tools
I got the same issue, trying to figure it out. You can +1 this issue
I managed to decode multiCDR files with a combination of pyasn1 and asn1tools, I'll update when it is fully tested.
I would use "decode_with_length" instead of decode.
decoded, lenDecoded = Foo.decode_with_length('CallEventRecord',buffer)
buffer = buffer[lenDecoded:]
Then just loop until the buffer is empty.
For me (it highly depends how your Files are actually internally structured and what your goal is) the so far fastest method is to use find to jump to the next relevant ticket, mine always start with 'b\xa7':
with open(file, 'rb') as encoded_bytes:
data = encoded_bytes.read()
encoded_bytes.close()
file_len = len(data)
while index < file_len:
next_occurence = data[index:].find(b'\xa7')
if next_occurence < 0:
break
....
index += next_occurence
decoded_record, record_length = schema.decode_with_length('CallEventRecord', data[index:])
.....
index += 1
continue

What is wrong with my decryption function?

import base64
import re
def encrypt(cleartext, key):
to_return = bytearray(len(cleartext))
for i in xrange(len(cleartext)):
to_return[i] = ord(cleartext[i]) ^ ord(key)
return base64.encodestring(str(to_return))
def decrypt(ciphertxt,key):
x = base64.decodestring(re.escape(ciphertxt))
to_return = bytearray(len(x))
for i in xrange(len(x)):
to_return[i] = ord(x[i]) ^ ord(key)
while to_return[i]>127:
to_return[i]-=127
return to_return
When I encrypt bob then use my decrypt function it returns bob. However for longer things like paragraphs that when encrypted, the cipher text contains \ slashes it does not work. I do not get back ascii characters or base64 characters I get back weird chinese characters or square characters. Please any insight to point me in the right direction will help.
As jasonharper said, you're mangling your Base64 data by calling re.escape on it. Once you get rid of that, your code should be fine. I haven't tested it extensively, but it works correctly for me with multi-line text.
You should also get rid of this from your decrypt function:
while to_return[i]>127:
to_return[i]-=127
It won't do anything if the original cleartext is valid ASCII, but it will mess up the decoding if the cleartext does contain bytes > 127.
However, those functions could be a little more efficient.
FWIW, here's a version that works correctly on both Python 2 and Python 3. This code isn't as efficient as it could be on Python 3, due to the compromises made to deal with the changes in text and bytes handling in Python 3.
import base64
def encrypt(cleartext, key):
buff = bytearray(cleartext.encode())
key = ord(key)
buff = bytearray(c ^ key for c in buff)
return base64.b64encode(bytes(buff))
def decrypt(ciphertext, key):
buff = bytearray(base64.b64decode(ciphertext))
key = ord(key)
buff = bytearray(c ^ key for c in buff)
return buff.decode()
# Test
s = 'This is a test\nof XOR encryption'
key = b'\x53'
coded = encrypt(s, key)
print(coded)
plain = decrypt(coded, key)
print(plain)
Python 3 output
b'Bzs6IHM6IHMycyc2ICdZPDVzCxwBczY9MCEqIyc6PD0='
This is a test
of XOR encryption

Binary to ASCII in Python

I'm trying to decode binary which are located in a .txt file, but I'm stuck. I don't see any possibilities this can go around.
def code(): testestest
ascii = {'01000001':'A', ...}
binary = {'A':'01000001', ...}
print (ascii, binary)
def encode():
pass
def decode(code,n):
f = open(code, mode='rb') # Open a file with filename <code>
while True:
chunk = f.read(n) # Read n characters at time from an open file
if chunk == '': # This is one way to check for the End Of File in Python
break
if chunk != '\n':
# Process it????
pass
How can I take the binary in the .txt file and output it as ASCII?
From your example, your input looks like a string of a binary formatted number.
If so, you don't need a dictionnary for that:
def byte_to_char(input):
return chr(int(input, base=2))
Using the data you gave in the comments, you have to split your binary string into bytes.
input ='01010100011010000110100101110011001000000110100101110011001000000110101001110101011100110111010000100000011000010010000001110100011001010111001101110100001000000011000100110000001110100011000100110000'
length = 8
input_l = [input[i:i+length] for i in range(0,len(input),length)]
And then, per byte, you convert it into a char:
input_c = [chr(int(c,base=2)) for c in input_l]
print ''.join(input_c)
Putting it all together:
def string_decode(input, length=8):
input_l = [input[i:i+length] for i in range(0,len(input),length)]
return ''.join([chr(int(c,base=2)) for c in input_l])
decode(input)
>'This is just a test 10:10'

Unicode text represented as u'xxxx instead of Japanese in Python 2.7

I've had many struggles with Unicode in Python over the years as I work with many text files in Japanese, so I'm familiar with using .encode("utf-8") to get Japanese text back into Japanese display from u'xxxx. I am NOT getting any encoding/decoding errors. But text I'm reading from a unicode file, manipulating, then writing back into a new file is being represented as strings of u'xxxx instead of the original Japanese text. I have tried .encode() and .decode() in multiple places, and also not using them at all, every time with the same result. Any suggestions are welcome.
Specifically, I am using the Scrapy library to write a spider that takes text from a file it crawls, extracts bits of text to construct the filename of a new file, and then writes the first div of the HTML file as a string into that new file.
What is even more confusing to me is that the bits of text I'm using to create the filename all render in Japanese, as does the filename itself. Is it because I am using str() on the div that I am getting u'xxxx as the content of my file? Please toward the end of the code to see this line.
Here is my complete code (and please ignore how hacky some of it is):
def parse_item(self, response):
original = 0
author = "noauthor"
title = "notitle"
year = "xxxx"
publisher = "xxxx"
typer = "xxxx"
ispub = 0
filename = response.url.split("/")[-1]
if "_" in filename:
filename = filename.split("_")[0]
if filename.isdigit():
title = response.xpath("//h1/text()").extract()[0].encode("utf-8")
author = response.xpath("//h2/text()").extract()[0].encode("utf-8")
ID = filename
bibliographic_info = response.xpath("//div[2]/text()").extract()
for subyear in bibliographic_info:
ispub = 0
subyear = subyear.encode("utf-8").strip()
if "初出:" in subyear:
publisher = subyear.split(":")[1]
original = 1
ispub = 1
if "入力:" in subyear:
typer = subyear.split(":")[1]
if len(subyear) > 1 and (original == 1) and (ispub == 0):
counter = 0
while counter < len(subyear):
if subyear[counter].isdigit():
break
counter+=1
if counter != len(subyear):
year = subyear[counter:(counter+4)]
original = 0
body = str(response.xpath("//div[1]/text()").extract())
new_filename = author + "_" + title + "_" + publisher + "_" + year + "_" + typer + ".html"
file = open(new_filename, "a")
file.write(body.encode("utf-8")
file.close()
# -*- coding: utf-8 -*-
# u'初出' and u'\u521d\u51fa' are different ways to specify *the same* string
assert u'初出' == u'\u521d\u51fa'
#XXX don't mix Unicode and bytes!!!
assert u'初出' != '初出' and u'初出' != '\u521d\u51fa'
Don't use str() at all with a Unicode string as an argument, use the explicit .encode() instead.
Do not call .encode(), .decode() unless necessary; use Unicode sandwich instead:
decode bytes that you receive from outside world into Unicode
keep it Unicode inside your script
encode into bytes at the end to save to a file, send over a network.
Both the first and the last step might be implicit i.e., your program might only see Unicode text.
Note, these are three different things:
the way a string looks like in the source code when you specify it using a string literal (unicode escapes, source code encoding, raw string literals)
the content of the string
how it looks like if you print it (repr(), 'backslashreplace' error handler)
If you see u'...' in the output; it means that at some point repr(unicode_string) is called. It may be implicit e.g., via print([unicode_string]) because repr() is called on items of the list while it is converted to string.
print(u'\u521d\u51fa') # -> 初出 #NOTE: no u'', \u..
print(repr(u'\u521d\u51fa')) # -> u'\u521d\u51fa'

Categories