Python 3: extract files from tar.gz archive - python

I'm currently working with Semantically Enriched Wikipedia.
The resource is inside a 7.5 GB tar.gz archive and each file inside it, it's an XML whose schema is:
<text>
Plain text
</text>
<annotation>
Annotation for plain text
</annotation>
The current task is to extract each file and then parse the content inside the tags.
First thing I did was to use the tarfile module and its extractall() method, but during the extraction I got this error:
OSError: [Errno 22] Invalid argument: '.\\sew_conservative\\wiki384\\Live_%3F%21*%40_Like_a_Suicide.xml'
while a part of it is correctly extracted (I thought the error was due to the unicode char inside the xml file name, but I'm now seeing that each file has it inside).
So I planned to work on each file inside the archive using some of the API's methods and the code below.
Unfortunately, the TarInfo object which wraps each file doesn't allow to access the file content and the extraction, file by file, takes too much time.
def parse_sew():
sew_path = Path("C:/Users/beppe/Desktop/Tesi/Semantically Enriched Wikipedia/sew_conservative.tar.gz")
with tarfile.open(sew_path, mode='r') as t:
for item in t:
// extraction
Is the extraction mandatory to parse and use the content of the XML file or it's possible to read the archive content (on-the-fly, without extracting anything) and then parse the content?
UPDATE: I'm extracting the files via tar -xvzf filename.tar.gz command, everything is going well, but after 15 mins I was able to process only 500MB of the hundred GB.

I would suggest you to use 7zip for extraction. You can initiate 7zip extraction from python and then while it is extracting side by side you can read the files getting extracted. This will save quite a lot of time. You can implement is using threads.
Secondly dont use front slashes while giving windows path. You can use \\ in place of /.

You can also try using shutil as follows.
shutil.unpack_archive('path_to_your_archive', 'path_to_extract')

Related

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

python: extracting a .bz2 compressed file from a torrent file

I have a .torrent file that contains a .bz2 file. I am sure that such a file is actually in the .torrent because I extracted the .bz2 with utorrent.
How can I do the same thing in python instead of using utorrent?
I have seen a lot of libraries for dealing with .torrent files in python but apparently none does what I need. Among my unsuccessful attempts I can mention:
import torrent_parser as tp
file_cont = tp.parse_torrent_file('RC_2015-01.bz2.torrent')
file_cont is now a dictionary and file_cont['info']['name']='RC_2015-01.bz2' but if I try to open the file, i.e.
from bz2 import BZ2File
with BZ2File(file_cont['info']['name']) as f:
what_I_want = f.read()
then the content of the dictionary is (obviously, I'd say) interpreted as a path, and I get
No such file or directory: 'RC_2015-01.bz2'
Other attempts have been even more ruinous.
A .torrent file is just a metadata file, indicating where to get the data and the filename of the file. You can't get the file contents from that file.
Only once you have successfully downloaded this torrent file to disk (using torrent software) you can then use BZ2File to open it (if it is .bz2 format).
If you want to perform the actual download with Python, the only option I found was torrent-dl which hasn't been updated for 2 years.

How to get information of .jar file in python-magic

I have a folder full of jar, html, css, exe type file. How can I check the file?
I already run "file" command on *NIX and using python-magic. but the result is all like this.
test : Zip archive data, at least v1.0 to extract
How can I get information specifically like test : jar only using using magic number.
How do I do like this?
While not required, most JAR files have a META-INF/MANIFEST.MF file contained within them. You could check for the existence of this file, after checking if it's a zip file:
import zipfile
def zipFileContains(zipFileName, pathName):
f = zipfile.ZipFile(zipFileName, "r")
result = any(x.startswith(pathName.rstrip("/")) for x in f.namelist())
f.close()
return result
print zipFileContains("test.jar", "META-INF/MANIFEST.MF")
However, it might be better to just check if it's a zip file that ends in .jar.
Magic alone won't do it for you, since a JAR is literally just a zip file. Read more about the format here.

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

Renaming a HTML file with Python

A bit of background:
When I save a web page from e.g. IE8 as "webpage, complete", the images and such that the page contains are placed in a subfolder with the postfix "_files". This convention allows Windows to synchronize the .htm file and the accompanying folder.
Now, in order to keep the synchronization intact, when I rename the HTML file from my Python script I want the "_files" folder to be renamed also. Is there an easy way to do this, or will I need to
- rename the .htm file
- rename the _files folder
- parse the .htm file and replace all references to the old _files folder name with the new name?
There is just one easy way: Have IE save the file again under the new name. But if you want to do it later, you must parse the HTML. In this case, BeautifulSoup is your friend.
If you rename the folder, I'm not sure how you can get around parsing the .htm file and replacing instances of _files with the new suffix. Perhaps you can use a folder alias (shortcut?) but then that's not a very clean solution.
you can use simple string replace on your HTML file without parsing it, it can of course be troublesome if the text being replaced is mentioned in the HTML itself..
os.rename("test.html", "test2.html")
os.rename("test_files", "test2_files")
with open("test2.html", "r") as f:
s = f.read().replace("test_files", "test2_files")
with open("test2.html", "w") as f:
f.write(s)

Categories