Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Is it possible to convert a file text/image/mp3 to just the binary code thats its made up of for it then to be manipulated for example in python or whatever language. I poked around a bit online and Binary files were mentioned a lot but nothing was particularly useful or coherent. Thanks for any info, i've done a fair bit of high level programming so now am looking to branch out a bit.
If you want to manipulate binary files, use the rb (read binary) and wb (write binary) file modes:
with open('binary_file.mp3', 'rb') as f:
first_byte = f.read(1)
To be clear, all files are binary. Some binary files can be interpreted as text files, but they're still stored in binary underneath. Think of it this way, a file is a series of numbers, the numbers can only be in the range 0 to 255. Then in the 60's and 70's some Americans decided that if you see the number 65, it's actually the capital letter "A", then 66 is "B" etc. Then 97 is lower case "a" 98 is "b" etc. and we would never use numbers greater than 127. You could come up with your own mapping of numbers to letters (and other people in different countries did) but you should probably use the mapping people have more or less all agreed on using, which is called ASCII (and its extension, UTF-8). If you want to look at the actual numbers under the hood of a file, you need a hex editor. But they represent numbers not like we are used to.
If you want to see what the actual ones and zeros of a file are just use this (the := operator requires Python 3.8+)
with open('binary_file_name', 'rb') as f:
while byte := f.read(1):
print(f'{ord(byte):08b}')
A binary file is just an array of bytes and most programming languages deal with arrays, there's no "binary code" conversion to do. Binary formats then exist to tell a file type from another (e.g. an image from an mp3), because you can only interpret raw bytes if you gave them a meaning in the first place.
Related
I am currently running Simulations written in C later analyzing the results using Python scripts.
ATM the C Programm is writing the results (lots of double values) in a text file which is slowly but surely eating a lot of disc space.
Is there a file format which is more space efficient to store lots of numeric values?
At best but not necessarily it should fulfill the following requirements
Values can be appended continuously such that not all values have to be in memory at once.
The file is more or less easily readable using Python.
I feel like this should be a really common question, but looking for an answer I only found descriptions of various data types within C.
Binary file, but please, be careful with the format of data that you are saving. If possible, reduce the width of each variable that you are using. For example, do you need to save decimal or float, or you can have just 16 or 32 bit integer?
Further, yes, you may apply some of the compression scheme to compress the data before saving, and decompress it after reading, but that requires much more work, and it is probably an overkill for what you are doing.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
Improve this question
I have the following string:
1679.2235398,-1555.40390834,-1140.07728186,-1999.85500108
and I'm using a steganography technique to store it in an image. Now when I retrieve it back out of the image, sometimes I got it back in a complete form and I have no issue with that. Where in other occasions, the retrieved data are not fully retrieved (due to a modification/alteration being occurred on the image), so the result something look like this:
1679.2235398,-1555.I8\xf3\x1cj~\x9bc\x13\xac\x9e8I>[a\xfdV#\x1c\xe1\xea\xa0\x8ah\x02\xed\xd1\x1c\x84\x96\xe2\xfbk*8'l
Notice that, only "1679.2235398,-1555." are correctly retrieved, while the rest is where the modification has been occurred.
Now, how do I compute (in percentage) how much I successfully retrieved?
Since the length is not the same, I can't do a character by character comparison , it seems that I need to slice or convert the modified data into some other form to match the length of the original data.
Any tips?
A lot of this is going to depend on the context of your problem, but you have a number of options here.
If your results always look like that, you could just find the longest common subsequence, then divide by the length of the original string for a percentage.
Levenshtein distance is a common way of comparing strings, as the number of characters required to change to turn one string into another. This question has several answers discussing how to turn that into a percentage.
If you don't expect the strings to always come out in the same order, this answer suggests some algorithms used for DNA work.
Well it really depends.. My solution would be something like this:
I would start with all the longest string possible and check if they are in the new string
if original_string in new_string:
'something happens here'.
that would be inside a loop that wld decrease the size of the original string and get all combinations possible. so the next one wld be N-1 long and have 2 possible combinations (cutting off the first number or the last number), and so on, until u get to a specific threshold, or to 1 long strings. the loop can store the longest string you find in a log inside the if conditional, and afterward you can just check the results. hope that helps.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
All the ascii characters can be represented by utf-8 (the first seven bits of space). Using exclusively utf-8 could simplify string handling greatly. Granted utf-8 is not a fixed length format and therefore has certain performance penalties with respect to ascii but I have the feeling python normally goes for pythonic before performance.
My question: Has it ever been addressed why python3 implements strings this way instead of utf-8 exclusively? Thereby not representing it as a bitstream with different representations but always with the utf-8 encoding.
I'm not looking for personal opinions from SO users but for PEP's or a transcript from the dictator addressing this very point.
From PEP 393:
Rationale
There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).
One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification chooses UTF-8 as the recommended way
of exposing strings to C code.
For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced. If representations do share
data, it is also possible to omit structure fields, reducing the base
size of string objects.
If it is not clear from the above text:
We want most strings representation to be space efficient
We want efficient indexing whenever possible
We want to be compatible with all systems and provide all Unicode on all systems
Result is that using a single internal representation would fail at least one of the constraints.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Someone on here asked a question similar to this, but it got quickly downvoted and closed due to the newbiness of it's nature. So I decided to answer it myself, and for others who want to know how to make this nifty program, because it isn't really such a bad idea. So here goes nothing!
I'm not going to tell you your code is horrendous. But you could simplify it.
Everything in the try routine could be replaced by a single line:
print 'Space for about', int(totstor*1048576/25), 'standard jpg image files available'
In other words you can print the result of the calculation directly. Let int() take care of rounding, getting rid of ".0", etc., and rely on the fact that you can print integers (and most other data types) directly without converting them to a string. You simply chain together the items you need in the output using commas. (There are other ways of getting numbers into the desired output text, but this is simplest).
First, I took a couple of pictures (JPEG FORMAT) (With a software I developed myself :3) and got their file sizes, and they averaged out at about 25kb. So then to get the amount of pictures on a disk, you take it's available space in kilobytes, and divide it by 25 (Or larger for different file types, not really sure).
For the program, I asked the user to enter amount of gigs available, and then the program multiplies it by 1048576 (Kbs in a Gig) and stores it as a value, then multiplies it by 25.
So, here is the wonderful code (Hopefully the comments kinda explain what is going on, I am not to great at this):
#Main Loop
while True:
#Set number of Gigs
totstor=raw_input("Enter the amount of storage on the desired disk (In gb): ")
#Just in case you get bored
if totstor=='quit':
break
try:
#Do the math
gigs=int(totstor)
gigs=round(gigs)
kilos=gigs*1048576
kilos=kilos/25
kilos=round(kilos)
kilos=str(kilos)
kilos=kilos.strip('.0')
print 'Space for about '+kilos+' standard jpg image files available'
#If an error occurs, let em' know
except:
print 'Invalid Number!'
print '\n'
print '\n'
#Bye
quit()
Anyone who got any help out of this, leave some feedback. Or just tell me how horrendous my code is xD.
I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?
An example line would read like this:
62233 242344 0.42442423
(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)
As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?
I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).
Thanks for all the ideas and clarifying questions!
You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.
Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)
I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.
The Power of Gzip
Yes there are Python library allow you to stream output and automatically compress it.
Lossy encoding does save space. Cutting down the precision helps.
I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.
An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.
If this can be used in Stata, I can help more with suggestions for quick conversions.