Renaming a HTML file with Python - python

A bit of background:
When I save a web page from e.g. IE8 as "webpage, complete", the images and such that the page contains are placed in a subfolder with the postfix "_files". This convention allows Windows to synchronize the .htm file and the accompanying folder.
Now, in order to keep the synchronization intact, when I rename the HTML file from my Python script I want the "_files" folder to be renamed also. Is there an easy way to do this, or will I need to
- rename the .htm file
- rename the _files folder
- parse the .htm file and replace all references to the old _files folder name with the new name?

There is just one easy way: Have IE save the file again under the new name. But if you want to do it later, you must parse the HTML. In this case, BeautifulSoup is your friend.

If you rename the folder, I'm not sure how you can get around parsing the .htm file and replacing instances of _files with the new suffix. Perhaps you can use a folder alias (shortcut?) but then that's not a very clean solution.

you can use simple string replace on your HTML file without parsing it, it can of course be troublesome if the text being replaced is mentioned in the HTML itself..
os.rename("test.html", "test2.html")
os.rename("test_files", "test2_files")
with open("test2.html", "r") as f:
s = f.read().replace("test_files", "test2_files")
with open("test2.html", "w") as f:
f.write(s)

Related

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

Python 3: extract files from tar.gz archive

I'm currently working with Semantically Enriched Wikipedia.
The resource is inside a 7.5 GB tar.gz archive and each file inside it, it's an XML whose schema is:
<text>
Plain text
</text>
<annotation>
Annotation for plain text
</annotation>
The current task is to extract each file and then parse the content inside the tags.
First thing I did was to use the tarfile module and its extractall() method, but during the extraction I got this error:
OSError: [Errno 22] Invalid argument: '.\\sew_conservative\\wiki384\\Live_%3F%21*%40_Like_a_Suicide.xml'
while a part of it is correctly extracted (I thought the error was due to the unicode char inside the xml file name, but I'm now seeing that each file has it inside).
So I planned to work on each file inside the archive using some of the API's methods and the code below.
Unfortunately, the TarInfo object which wraps each file doesn't allow to access the file content and the extraction, file by file, takes too much time.
def parse_sew():
sew_path = Path("C:/Users/beppe/Desktop/Tesi/Semantically Enriched Wikipedia/sew_conservative.tar.gz")
with tarfile.open(sew_path, mode='r') as t:
for item in t:
// extraction
Is the extraction mandatory to parse and use the content of the XML file or it's possible to read the archive content (on-the-fly, without extracting anything) and then parse the content?
UPDATE: I'm extracting the files via tar -xvzf filename.tar.gz command, everything is going well, but after 15 mins I was able to process only 500MB of the hundred GB.
I would suggest you to use 7zip for extraction. You can initiate 7zip extraction from python and then while it is extracting side by side you can read the files getting extracted. This will save quite a lot of time. You can implement is using threads.
Secondly dont use front slashes while giving windows path. You can use \\ in place of /.
You can also try using shutil as follows.
shutil.unpack_archive('path_to_your_archive', 'path_to_extract')

Python create a directory and save .txt file into it

I am currently working on the school assignment using the Python Socket library.
I have server.py and client.py, and basically, I request a copy of the .txt file from client-side to server-side, and client-side needed to receive the text elements, create a new .txt file and directory to save it.
I am stuck in the file handling on the client-side.
What is the best way I can do create a directory and save .txt file into it?
# create a new .txt file for incoming data and save to new directory
with open(new_dir / "copied_text_file.txt", '+w') as text:
text.write(file_text)
I tried this way, and it does not save in my new directory.
I appreciate your help, thank you!
If you are trying to create a path, use os.path methods, see in particular join.
Is the name of your new directory "new_dir"? If so the command needs to be open("new_dir/copied_text_file.txt", "+w"). If not and new_dir is a string of the directory use open((new_dir + "/copied_text_file.txt"), "+w") better yet would be to use os.path.join(new_dir, "copied_text_file.txt") and call open on the resulting pathname.
open() takes a string as the destination for the file you're going
to be working with. You can pass it a URI just like would would use
when working on the command line.
import os
with open(os.path.join(new_dir / "copied_text_file.txt", '+w')) as text:
text.write(file_text)
You could just concatenate with +,
with open(new_dir+'/'+ "copied_text_file.txt", '+w')) as text:
# ...
However, using + will be lower because path.join lives inside
complied c code the python interpreter has an easier time running,
rather than having to do the concatenation in python which has more
overhead than CPython models.
https://docs.python.org/3.5/library/os.path.html#os.path.join

Copying file from one directory to another

Does anyone know how I can copy/duplicate a file from one directory into another without specification of src path? I got it to work with "shutil.copy2" but it's not exactly what I am looking for since the src argument asks for the path.
My goal is to be able to copy/duplicate a file from one directory into another by filename. Has anyone done this before, if so can you guide me in the right direction? - Thanks
#----------------------------------------------------------------------------------------------------------------#
# These params will be used for specifying which template you want to copy and where to output
#----------------------------------------------------------------------------------------------------------------#
'''Load file from x directory into current working directory '''
#PullTemplate: Specify which template you want to copy, by directory path
TemplateRepo = ("/home/hadoop/BackupFolders/Case_Project/scripts")
#OutputTemplate: Let's you specify where you want to output the copied template.
#Originally set to your current working directory (u".")
OutputTemplate = (u".")
shutil.copy2(TemplateRepo, OutputTemplate)
Well if you are trying to load a file in the same project you need to have at least the folder name inside that project.
You can use json
Something like this.
import json
#someFiles is just a fold name inside the projects main folder.
with open("someFiles\\file_name", "r") as whatever_u_want:
var_of_choice = json.load(whatever_u_want)
print (var_of_choice)
once the file is open you can save the variable var_of_choice as any file name you wish where you wish using the json dump method.
Click the file you want to copy, create a duplicate of the file by choosing Duplicate key under File ( below the Jupyter logo at the top ).
Choose the file copied(file_copy), and choose Move key under File.
Choose the file path where you want to paste/move the copied file.
Rename the copied file name as you wish.
For more info, you can refer to here: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781785884870/1/ch01lvl1sec12/basic-notebook-operations

Using Python, getting the name of files in a zip archive

I have several very large zip files available to download on a website. I am using Flask microframework (based on Werkzeug) which uses Python.
Is there a way to show the contents of a zip file (i.e. file and folder names) - to someone on a webpage - without actually downloading it? As in doing the working out server side.
Assume that I do not know what are in the zip archives myself.
I apoligize that this post does not include code.
Thank you for helping.
Sure, have a look at zipfile.ZipFile.namelist(). Usage is pretty simple, as you'd expect: you just create a ZipFile object for the file you want, and then namelist() gives you a list of the paths of files stored in the archive.
with ZipFile('foo.zip', 'r') as f:
names = f.namelist()
print names
# ['file1', 'folder1/file2', ...]
http://docs.python.org/library/zipfile.html
Specifically, try using the ZipFile.namelist() method.

Categories