I've recently tried to create my own Vigenère encryption script for files. It works without issues with both encryption and decryption mechanisms. But it goes REALLY slow when it comes to "big files". For example, a 772 Ko file took 13.71s and it doesn't look linear as I would expect it to be. supposed it has to do with the fact that I try to read the entire document in a buffer.
the critical part is in only one line, for each byte in the buffer (f.read()) I add the value of the key corresponding. I think this is not optimal but I don't think of another method at the moment.
for i, v in enumerate(buf):
output_buffer += bytes([(v+m*key[i%keylength])%256])
"m" is just a parameter that is set to 1 for encryption and -1 for decryption.
The file buffer "buf" and key are bytes, not strings.
If you have any suggestion on how to work with the buffer (not reading and putting into ram the entire document) or with the code for the actual encryption I would be glad to know about it.
Thanks in advance.
Related
I struggle to understand the correct use of an IV (Initialization Vector) when encrypting data in AES.
Precisely, I'm not sure where to store my randomly generated IV: in my script, the data will be encrypted, then saved to a file, then the program will terminate. During the next session, the previously saved data must be decrypted. If my understanding of IVs is correct, I have to use the same IV for decryption as for encryption (but another random-IV for every single encryption process).
Thus, I have to store the IV somewhere - some people recommend prepending it to the encrypted data, but if I get it right that won't work in my case, because I need the IV in order to be able to decrypt it.
Is this correct or did I misunderstand something? I want to avoid saving the encrypted/hashed key and the IV (even if hashed itself) inside some unencrypted plain-text-settings-file or something.
The IV itself is not sensitive data; its just used to scramble the state of the first cipher-text block and the scrambling is not recoverable from the IV itself (the key adds in the "secret" factor).
For "proper" chaining modes the IV is separate from the cipher text (you need the initial IV for both encryption and decryption) and must be stored separately and passed separately to the crypto library API. After encryption you can store the IV however you like - just don't lose it ;).
You can certainly "prepend" / "append" to the cipher text so you only have to store a single blob of data - but you'll just have to split it off prior to decryption as that is what the API will expect.
The "unproper" way to do an IV (e.g. if your crypto library API doesn't have native IV support, but does support chaining) is just to prepend a single block of random data to the plaintext before encrypting it. In this case there isn't any IV to store separately - you simply encrypt the entire IV+message binary pair - and then you simply delete the first block of data after decrypting it. The "random data" you prepend has the same constraints as a real IV (don't reuse the same random data with the same key, etc).
The two approaches are semantically different at the API level, but the effect on the actual encryption is the same (scamble the first block of real payload unpredictably).
In terms of how the IV is used - there are many possible schemes. See the wikipedia article on block chaining here for a convenient picture showing how the IV can be used in various chaining modes when it is really store separately.
https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation
I have written an aes cipher in python and it works well with simple text files.
when viewing a ~200k .txt file before and after encryption/decryption through a hex editor the bytes are identical, however there are issues when I try to encrypt/decrypt any other file types (png of similar size for example). The beginning of the decrypted file is the same as the original but there are differences. a single byte will be missing from the decrypted file that was present in the original but the rest is correct.
what is likely to be the cause? if it was down to the algorithm being incorrect then would it not be affecting text files as well?
I worked out what was wrong. I was stupidly removing padding with bytearray.remove() rather than .pop()
So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.
First some background:
I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.
However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.
I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.
Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.
EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.
This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.
Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.
Below is an example. Demo the code below at http://dbgr.cc/o
with open("pretend_im_large.bin", "rb") as f:
start_pos = 0
read_bytes = []
# seek to the end of the file
f.seek(0,2)
file_size = f.tell()
# seek back to the beginning of the stream
f.seek(0,0)
while f.tell() < file_size:
read_bytes.append(f.read(1))
f.seek(9,1)
print read_bytes
The code above assumes pretend_im_large.bin is a file with the contents:
A00000000
B00000000
C00000000
D00000000
E00000000
F00000000
The output of the code above is:
['A', 'B', 'C', 'D', 'E', 'F']
I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.
Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.
If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.
Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.
If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.
I want to apply an AES 128b encryption (probably CBC + Padding) on a data stream.
In case it matters, I'm sending chunks of around 1500bits each.
I work in Python, and I did a small test with M2Crypto with AES encrypt in one side and decrypt at the other side. It works perfect, but probably don't really secures anything since I use the same key, same IVS and all that.
So, the question is: What the best approach for AES encryption on large data streams?
I thought about loading a new 'keys' file from time to time. Then, the application will use this file to expend and extract AES keys or something like that, but it still sounds awful to build a new AES object for each chunk, so there must be a better way.
I believe I can also use the IVS here, but not quite sure where and how.
I need to be able to send encrypted data between a Ruby client and a Python server (and vice versa) and have been having trouble with the ruby-aes gem/library. The library is very easy to use but we've been having trouble passing data between it and the pyCrypto AES library for Python. These libraries seem to be fine when they're the only one being used, but they don't seem to play well across language boundaries. Any ideas?
Edit: We're doing the communication over SOAP and have also tried converting the binary data to base64 to no avail. Also, it's more that the encryption/decryption is almost but not exactly the same between the two (e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
(e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
I missed that bit. There's nothing wrong with your encryption/decryption. It sounds like a padding problem. AES always encodes data in blocks of 128 bits. If the length of your data isn't a multiple of 128 bits the data should be padded before encryption and the padding needs to be removed/ignored after encryption.
Turns out what happened was that ruby-aes automatically pads data to fill up 16 chars and sticks a null character on the end of the final string as a delimiter. PyCrypto requires you to do multiples of 16 chars so that was how we figured out what ruby-aes was doing.
It's hard to even guess at what's happening without more information ...
If I were you, I'd check that in your Python and Ruby programs:
The keys are the same (obviously). Dump them as hex and compare each byte.
The initialization vectors are the same. This is the parameter IV in AES.new() in pyCrypto. Dump them as hex too.
The modes are the same. The parameter mode in AES.new() in pyCrypto.
There are defaults for IV and mode in pyCrypto, but don't trust that they are the same as in the Ruby implementation. Use one of the simpler modes, like CBC. I've found that different libraries have different interpretations of how the mode complex modes, such as PTR, work.
Wikipedia has a great article about how block cipher modes.
Kind of depends on how you are transferring the encrypted data. It is possible that you are writing a file in one language and then trying to read it in from the other. Python (especially on Windows) requires that you specify binary mode for binary files. So in Python, assuming you want to decrypt there, you should open the file like this:
f = open('/path/to/file', 'rb')
The "b" indicates binary. And if you are writing the encrypted data to file from Python:
f = open('/path/to/file', 'wb')
f.write(encrypted_data)
Basically what Hugh said above: check the IV's, key sizes and the chaining modes to make sure everything is identical.
Test both sides independantly, encode some information and check that Ruby and Python endoded it identically. You're assuming that the problem has to do with encryption, but it may just be something as simple as sending the encrypted data with puts which throws random newlines into the data. Once you're sure they encrypt the data correctly, check that you receive exactly what you think you sent. Keep going step by step until you find the stage that corrupts the data.
Also, I'd suggest using the openssl library that's included in ruby's standard library instead of using an external gem.