I am working on a project that requires me to read a file with a .dif extension. Dif stands for data information exchange. The file opens nicely in Open Office Calc. Then you can easily save as a csv file, however when I open in Python all I get are random characters that don't make sense. Here is the last code that I tried just to see if I could read.
txt = open('C:\myfile.dif', 'rb').read()
print txt
I would even be open to programatically converting the file to csv first. before opening if someone knows how to do that. As always, any help is much appreciated. Below is a partial screenshot of what I get when I run the code.
Hadn't heard of this file format. Went and got a sample here.
I tested your method and it works fine:
>>> content = open(r"E:\sample.dif", 'rb').read()
>>> print (content)
b'TABLE\r\n0,1\r\n"EXCEL"\r\nVECTORS\r\n0,8\r\n""\r\nTUPLES\r\n0,3\r\n""\r\nDATA\r\n0,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"Welcome to File Extension FYI Center!"\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n""\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"ID"\r\n1,0\r\n"Type"\r\n1,0\r\n"Description"\r\n-1,0\r\nBOT\r\n0,1\r\nV\r\n1,0\r\n"ASP"\r\n1,0\r\n"Active Server Pages"\r\n-1,0\r\nBOT\r\n0,2\r\nV\r\n1,0\r\n"JSP"\r\n1,0\r\n"JavaServer Pages"\r\n-1,0\r\nBOT\r\n0,3\r\nV\r\n1,0\r\n"PNG"\r\n1,0\r\n"Portable Network Graphics"\r\n-1,0\r\nBOT\r\n0,4\r\nV\r\n1,0\r\n"GIF"\r\n1,0\r\n"Graphics Interchange Format"\r\n-1,0\r\nBOT\r\n0,5\r\nV\r\n1,0\r\n"WMV"\r\n1,0\r\n"Windows Media Video"\r\n-1,0\r\nEOD\r\n'
>>>
The question is what is in the file and how do you want to handle it. Personally I liked:
with open(r"E:\sample.dif", 'rb') as f:
for line in f:
print (line)
In the first code block, that long line that has a b'' (for bytes!) in front of it can be iterated on \r\n:
b'TABLE\r\n'
b'0,1\r\n'
b'"EXCEL"\r\n'
b'VECTORS\r\n'
b'0,8\r\n'
b'""\r\n'
b'TUPLES\r\n'
b'0,3\r\n'
b'""\r\n'
b'DATA\r\n'
b'0,0\r\n'
.
.
.
b'"Windows Media Video"\r\n'
b'-1,0\r\n'
b'EOD\r\n'
Related
Why does the text not show up when I click on the file_io_reverse.ipynb file??
##I am trying to read 'file_io.ipynb' and put the reverse of it into 'file_io_reverse.ipynb', this code doesn't work at all
f = open('file_io_reverse.ipynb', "a")
with open('file_io.ipynb', "r") as f2:
for i in f2:
x = i[::-1]
print(x)
f.write(x)
f.close()
As #olvin pointed out, your mixture of ways of opening and closing files is inconsistent but not functionally incorrect and should work.
What are you trying to open the file_io_reverse.ipynb file in?
IPYNB notebooks are plain text files formatted using JSON, making them human-readable and easy to share with others. So if you are trying to reverse contents of each line in the file and trying to save it in another file, then that would make the new ipynb file invalid.
Try opening the file in a text editor, and it should have the reversed lines for each line in the file_io.ipynb.
I want to open a file, decode the format of data (from base64 to ASCII), rewrite or save the decoded string, either back to the same file, or new one.
I have it opening, reading, decoding (and printing as a test) the decoded base64 string into readable format (ASCII I believe)
My goal is to now save this output to: either a "newfile.txt" document or back to the original "test.mcz" file ready for the next steps of my mission...
I know there are great online base64 decoders and they do work well for what I am doing - I use them often, but my goal is to write my own program as a learning exercise more than anything (also when my internet plays up I need an offline program)
Here's where I am so far (the original file is .mcz format it is a game save)
# PYTHON 3
import base64
f = open('test.mcz', 'r')
f_read = f.read()
# print(f_read) # was just as a test
new_f_read = base64.b64decode(f_read)
print (new_f_read)
This prints a butt-load of readable code that is what I need, but I don't want to have to just copy and paste this output from the Python shell into another editor, I want to save it to a file...for convenience.
Either back into the same test.mcz (I will be re-encoding to base64 again later on anyway) or to a new file - thus leaving my original as it was.
problem arises when I want to save/write this decoded output that is stored within the new_f_read variable...it's just been a headache, before I started I could visualise how it needed to be written, I got tripped up when I had to switch it all over to Python3 for some reason (Don't ask...) and I have tried so many variations from online examples - I wouldn't know where to start explaining what I've tried so far. I can't open the original file as both "r" AND "w" together so once Ive opened and decoded I cant reopen the original file as "w" because it just wipes the contents (which are still encoded anyway) -
I think I need to write functions to handle:
1. Open, read, save string to a variable
2. Manipulate string - decode
3. Write the new string to new or existing file
Sounds easy I know, but I am stuck...so here I am. If anyone shows examples, please take the time to explain what is going on, it seems pointless to me having code I don't understand. Apologies if this seems like a simple thing, help would be appreciated..Thanks
First, you can absolutely open a file for both reading and writing without truncating the contents. That's what the r+ mode is for (see https://docs.python.org/3/library/functions.html#open). If you do this, the model is (a) open the file, (b) read it, (c) seek back to the beginning with e.g. f.seek(0), (d) write it.
Secondly, you can simply open the file, read it, then close the file, and then reopen it, write it, and close it again, like this:
# open the file for reading, read the data, then close the file
with open('test.mcz', 'rb') as f:
f_read = f.read()
new_f_read = base64.b64decode(f_read)
# open the file for writing, write the data, then close the file
with open('test.mcz', 'wb') as f:
f.write(new_f_read)
This is probably the easiest solution.
The easiest thing is to open first a read file handle, close it then open a write handle. Read/Write handles are complicated because they have to have a pointer to where in the file you are and it add overhead that you don't need to use. You could do it if you wanted, but its a waste of time here.
Using the with operator to open files is recommended since the file will automatically close when you leave the with block.
import base64
with open('test.mcz', 'r') as f:
encode = base64.b64decode(f.read())
with open('test.mcz', 'wb') as f:
f.write(encode)
This is the same as
import base64
f = open('test.mcz', 'r'):
encode = base64.b64decode(f.read())
f.close()
f = open('test.mcz', 'wb'):
f.write(encode)
f.close()
As part of a bigger project, I would simply like to make sure that a file can be opened and Python can read and use it. So after I opened up the txt file, I said:
data = txtfile.read()
first_line = data.split('\n',1)[2]
print(first_line)
I also tried
print(f1.readline())
where f1 is the txt file. This, again, did nothing.
I am using the spyder IDE, and it just says running file, and doesn't print anything. Is it because my file is too large? It is 4.6 gigs.
Does anyone have any idea what's going on?
and it just says running file, and doesn't print anything. Is it
because my file is too large? It is 4.6 gigs.
Yes.
data = txtfile.read()
This function is going to read the entire file. Since you stated that the file is 4.6GB, it is going to take time to load the entire file and then split the by newline character.
See this: Read large text files in Python
I don't know your context of use, so, if you can process line by line, it would be simpler. Or even chunks would make it simpler than reading the entire file.
first_line = open('myfile.txt', 'r').readline()
When I use the following code
from PyPDF2 import PdfFileMerger
merge = PdfFileMerger()
for newFile in nlst:
merge.append(newFile)
merge.write("newFile.pdf")
Something happened as following:
raise utils.PdfReadError("EOF marker not found")
PyPDF2.utils.PdfReadError: EOF marker not found
Anybody could tell me what happened?
After encountering this problem using camelot and PyPDF2, I did some digging and have solved the problem.
The end of file marker '%%EOF' is meant to be the very last line, but some PDF files put a huge chunk of javascript after this line, and the reader cannot find the EOF.
Illustration of what the EOF plus javascript looks like if you open it:
b'>>\r\n',
b'startxref\r\n',
b'275824\r\n',
b'%%EOF\r\n',
b'\n',
b'\n',
b'<script type="text/javascript">\n',
b'\twindow.parent.focus();\n',
b'</script><!DOCTYPE html>\n',
b'\n',
b'\n',
b'\n',
So you just need to truncate the file before the javascript begins.
Solution:
def reset_eof_of_pdf_return_stream(pdf_stream_in:list):
# find the line position of the EOF
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(pdf_stream_in)-i
print(f'EOF found at line position {-i} = actual {actual_line}, with value {x}')
break
# return the list up to that point
return pdf_stream_in[:actual_line]
# opens the file for reading
with open('data/XXX.pdf', 'rb') as p:
txt = (p.readlines())
# get the new list terminating correctly
txtx = reset_eof_of_pdf_return_stream(txt)
# write to new pdf
with open('data/XXX_fixed.pdf', 'wb' as f:
f.writelines(txtx)
fixed_pdf = PyPDF2.PdfFileReader('data/XXX_fixed.pdf')
PDF is a file format, where a pdf parser normally starts reading the file by reading some global information located at the end of the file. At the very end of the document there needs to be a line with the content of
%%EOF
This is a marker, where the pdf parser knows, that the PDF document ends here and the global information it needs, should be before this (a startxref section).
I guess, that the error message you see, means, that one of the input documents was truncated and is missing this %%EOF-marker.
One simple solution for this problem (EOF marker not found). Open your .pdf file in other application (I used Libre office draw in Ubuntu 18.04). Then export the file as .pdf. Using this exported .pdf file the problem will not persist.
PyPDF2 cannot find the EOF marker in a PDF that is encrypted.
I came across the same error while I was working through the (excellent) Automate The Boring Stuff. Chapter 15, 2nd edition, page 355, project Combining Select Pages from Many PDFs.
I chose to combine all the PDFs I had made during this chapter into one document and one of them was an encrypted PDF and the project failed when it got to the end of the encrypted document with the error message:
PyPDF2.utils.PdfReadError: EOF marker not found
I moved the encrypted file to a different folder (so it would not be merged with the other pdfs and the project worked fine.
So, it seems PyPDF2 cannot find the EOF marker in a PDF that is encrypted.
I've also got that problem and got a solution.
First, python reads PDF as 'rb' or 'wb' as a binary read and write format.
END OF FILE
Occurs when that there was an open parenthesis somewhere on a line, but not a matching closing parenthesis. Python reached the end of the file while looking for the closing parenthesis.
Here is the 1 solution:
Close that file that you've opened earlier using this command
newfile.close()
Check whether that pdf is opened using other variable and again close it
Same_file_with_another_variable.close()
Now open it only once and use it , you are good to go.
I wanted to add my hacky solution to this issue.
I had the same error with python requests (application/pdf).
In my case the provider (a shipping labeling service) did give a 200 and a b'string which represents the PDF, but in some random cases it missed the EOF marker.
Because it was random, I came up with the following solution:
for obj in label_objects:
get_label = api.get_label(label_id=obj.label_id)
while not 'EOF' in str(get_label.content):
get_label = api.get_label(label_id=obj.label_id)
At a few tries it gives the b'string with EOF and we're good to proceed.
i had the same problem.
For me the solution was to close the previously opened file before working with it again.
Okay I got a file container that is a product of a Webcrawler containing a lot of different file types, likely but not all are HTML XML JPG PNG PDF. Most of the container is HTML text so I tried to open it with:
with open(fname) as f:
content = f.readlines()
which basically fails when I hit a PDF. The files are structured in a way so that every file is preceded by a little meta Information telling me what kind of file type is following.
Is there a similar method to .readlines() in python to read files line by line. I don't need the PDFs I will Ignore them anyway I just want to skip them.
Thanks in advance
Edit:
Example File: GDrive Link
file has a readline() method too, but the idiomatic way is to simply iterate over the file:
with open("/works/even/with/a/pdf/document.pdf") as f:
for line in f:
do_something_with(line)
Also I don't understand what you mean by "(it) basically fails when I hit a PDF". I have no problem applying the above code to a pdf file here.
For reading files line by line you could use fileoperations.
from fileoperations import FileReader
print FileReader.LineByLine(fname) #Note this returns a list of lines.
Could you show us a sample of the pdf? This works for my PDF's.
OK I found a solution just open the container with open(fname,'rb') and you are able to parse it line by line