Python cannot read "warc.gz" file completely

Python cannot read "warc.gz" file completely - python

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.
I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.
After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.
I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.
I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?
Thanks in advance!
Example:
I create new warc.gz files like this:
import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")
To write records I use:
record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)
This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.
To read files I use:
warc_file = warc.open(warc_path, "rb")
To loop through records I use:
for record in warc_file:
...
The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.

It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.
It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:
import gzip
import warc
with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()

Related

How to use "dump_pstats" properly to retrieve the sorted data of the "cProfile" into "txt" file?

As the title indicate I have this issue of retrieving those information from dump_stats properly. Without further ado here is my simple code.
Code
import cProfile
import pstats
def fun_to_profile():
... code to be profilled ...
profiler = cProfile.Profile()
profiler.runcall(fun_to)profile)
stats.sort_stats('cumulative')
stats.print_stats()
stats.dump_stats("output.txt")
This is the simple code that I could found, and I really read multiple times the documentation.
Problem
My problem when I open the file "output.txt", even if it's empty or with non comprehended characters. So do I need to specify any extension of the file, or maybe the issue is with my compiler.
Thanks in advance.

Apparently working with cProfile is so easy and straight forwards. I figure the solution for the problem.
First of all we need to know that the more adequate file extension is "file.dat". Then we need to read it and writing down in the desired files format like text.txt.
For that we need the following piece of code :
import cProfile
import pstats
cProfile.run("fun_to_profile", "Out_put_profile.dat") # here we just run and save the output
with open("Profile_time.txt", "w") as f:
p = pstats.Stats("Out_put_profile.dat", stream=f)
p.sort_stats("time").print_stats() # here we sort our analysis by the time-spent
And just like this we will have a more materials for analyzing the code and in human readable format. Thanks for IDG TECHtalk for sharing the solution.
Link to the youtube video: https://youtu.be/dmnA3axZ3FY.

Winzip cannot open an archive created by python shutil.make_archive on windows. On ubuntu archive manager does fine

I am trying to return a zip file in django http response, the code goes something like...
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
response['Content-Length'] = getsize(archive)
response['Content-Disposition'] = "attachment; filename=test %s.zip" % datetime.now()
return response
Now when this code is executed on ubuntu the resulting downloaded file opens without any issue, but when its executed on windows the file created does not open in winzip (gives error 'Unsupported Zip Format').
Is there something very obvious I am missing here? Isn't python code supposed to be portable?
EDIT:
Thanks to J.F. Sebastian for his comment...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
checkout, my answer to this question for more details...

The code you have written should work correctly. I've just run the following line from your snippet to generate a zip file and was able to extract on Linux and Windows.
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
There is something funny and specific going on. I recommend you check the following:
Generate the zip file outside of Django with a script that just has that one liner. Then try and extract it on a Windows machine. This will help you rule out anything going on relating to Django, web server or browser
If that works then look at exactly what is in the folder you compressed. Do the files have any funny characters in their names, are there strange file types, or super long filenames.
Run a md5 checksum on the zip file in Windows and Linux just to make absolutely sure that the two files are byte by byte identical. To rule out any file corruption that might have occured.

Thanks to J.F. Sebastian for his comment...
I'll still write the solution here in detail...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
because apparently, hidden somewhere in python 2.3 documentation on open:
The most commonly-used values of mode are 'r' for reading, 'w' for
writing (truncating the file if it already exists), and 'a' for
appending (which on some Unix systems means that all writes append to
the end of the file regardless of the current seek position). If mode
is omitted, it defaults to 'r'. The default is to use text mode, which
may convert '\n' characters to a platform-specific representation on
writing and back on reading. Thus, when opening a binary file, you
should append 'b' to the mode value to open the file in binary mode,
which will improve portability. (Appending 'b' is useful even on
systems that don’t treat binary and text files differently, where it
serves as documentation.) See below for more possible values of mode.
So, in simple terms while reading binary files, using open(file, 'rb') increases portability of your code (it certainly did in this case)
Now, it extracts without troubles, on windows...

pickle.load Not Working

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?

The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards

You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

AppDailySales: Works, but the downloaded gzip file is corrupted

I am trying to use the appdailysales.py module to download daily our iPhone apps. I am a .NET developer, so I tried running this using IronPython in a C# solution using the following code:
using IronPython.Hosting;
var ipy = Python.CreateRuntime();
dynamic appSales = ipy.UseFile("appdailysales.py");
appSales.main();
Because I didn't have gzip, I took out the references to that module. I was going to use the GZipStream C# class to decompress the file (Apple, provides their downloads as .gz files). So, I commented out lines 75 and 429-435.
I have tried executing appdailysales.py in my C# solution, directly from IronPython and using Python 2.7 (installed ActivePython last night); all with the same results: When I try to open the .gz file using 7zip, I get the following error:
CRC Failed ... file is broken
When I try using the GZipStream class I get:
The CRC in GZip footer does not match the CRC calculated from the decompressed data
If I download the .gz file manually, I can decompress the file just fine using 7Zip or GZipStream.
I am fluent in C#, but new to Python. Any help you can provide would be much appreciated.
Thanks for your time.

Looks like line 444 is the problem. Here are lines 444-446:
downloadFile = open(filename, 'w')
downloadFile.write(filebuffer)
downloadFile.close()
At this stage, IF you have deleted lines 429-435 OR selected not to unzip, then filebuffer refers to the raw gzipped stream that you got from the web. The output file is opened in TEXT mode, and you are on Windows, so every \n in the BINARY gzipped stream will be converted to \r\n -- CORRUPTION, like the error message said.
So: for the module to be used portably on both Windows and other platforms, the open mode must be "wb" (b for binary). If the gunzipped result file is also a binary file, "wb" can be hardcoded in the open call. However if the gunzipped file is a text file (meant to be capable of being opened in a text editor), then you need just "w" for that purpose, and you should set a variable mode to either "wb" or "w" as appropriate, and use mode in the open call.
Big question: I understand why you removed the gzip references for IronPython usage. Did you remove those lines for Python 2.7? Or did you run it under Python 2.7 with those lines still in, but set options.unzipFile to False?

What's the easiest way to read a FoxPro DBF file from Python?

I've got a bunch of FoxPro (VFP9) DBF files on my Ubuntu system, is there a library to open these in Python? I only need to read them, and would preferably have access to the memo fields too.
Update: Thanks #cnu, I used Yusdi Santoso's dbf.py and it works nicely. One gotcha: The memo file name extension must be lower case, i.e. .fpt, not .FPT which was how the filename came over from Windows.

I prefer dbfpy. It supports both reading and writing of .DBF files and can cope with most variations of the format. It's the only implementation I have found that could both read and write the legacy DBF files of some older systems I have worked with.

I was able to read a DBF file (with associated BAK, CDX, FBT, TBK files**) using the dbf package from PyPI http://pypi.python.org/pypi/dbf . I am new to python and know nothing about DBF files, but it worked easily to read a DBF file from my girlfriend's business (created with a music store POS application called AIMsi).
After installing the dbf package (I used aptitude and installed dbf version 0.88 I think), the following python code worked:
from dbf import *
test = Table("testfile.dbf")
for record in test:
print record
x = raw_input("") # to pause between showing records
That's all I know for now, but hopefully it's a useful start for someone else who finds this question!
April 21, 2012 SJK Edit: Per Ethan Furman's comment, I should point out that I actually don't know which of the data files were necessary, besides the DBF file. The first time I ran the script, with only the DBF available, it complained of a missing support file. So, I just copied over the BAK, CDX, FPT (not FBT as I said before edit), TBK files and then it worked.

If you're still checking this, I have a GPL FoxPro-to-PostgreSQL converter at https://github.com/kstrauser/pgdbf . We use it to routinely copy our tables into PostgreSQL for fast reporting.

You can try this recipe on Active State.
There is also a DBFReader module which you can try.
For support for memo fields.

Check out http://groups.google.com/group/python-dbase
It currently supports dBase III and Visual Foxpro 6.0 db files... not sure if the file layout change in VFP 9 or not...

It's 2016 now and I had to fiddle with the dbf package to get it to work. Here is a python3 version to just export a dbf file to a csv
import dbf
d=dbf.Table('mydbf.dbf')
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)
I had some unicode error at first, but got around that by turning off memos.
import dbf
d=dbf.Table('mydbf.dbf', ignore_memos=True)
d.open()
dbf.export(d, filename='mydf_exported.csv', format='csv', header=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.