Get XML file name from loaded XML files using Python - python

My Python code reads XML files stored at location and loads it into Python list after parsing using lxml library as shown below:
XMLFILEList = []
FilePath = 'C:\\plugin\\TestPlugin\\'
XMLFilePath = os.listdir(FilePath)
for XMLFILE in XMLFilePath:
if XMLFILE.endswith('.xml'):
XMLFILEList.append(etree.parse(XMLFILE))
print(XMLFILEList)
Output:
[<lxml.etree._ElementTree object at 0x000001CCEEE0C748>, <lxml.etree._ElementTree object at 0x000001CCEEE0C7C8>]
Currently, I see objects of XML files.
Please can anyone help me pull original filenames of XML files. For example: if my HelloWorld.xml file is loaded into XMLFILEList. I should be able to retrieve "HelloWorld.xml"

you have a one to one correspondence between XBRLFilePath and XMLFILEList, first one is the file you loaded, second is the file contents, just use that applying your if statement.
mydict = {}
for XMLFILE in XBRLFilePath:
if XMLFILE.endswith('.xml'):
mydict[XMLFILE] = etree.parse(XMLFILE)
your dict will now have as keys the files loaded, and as values the loaded files

Related

How can one parse whole XML documents using the LXML Sax module?

I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.
The relevant piece of code is this:
import lxml.etree as et
import lxml.sax
# deleted non relevant code
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith('.xml'):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# This is a function that I define elsewhere in which I use XPath to identify relevant
# elements and extract, change or add some information
xml_dosomething(file_root)
# init tree object from file_root
tree = et.ElementTree(file_root)
# save modified xml tree object to file with an added text so that I can keep a copy of original.
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:
handler = lxml.sax.ElementTreeContentHandler()
then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:
tree = handler.etree
lxml.etree.tostring(tree.getroot())
I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?

Extract First Page of All PDF Documents in a Library

I am new to PDF Handling in Python. I have a document library which contains a large volume of PDF Documents. I am trying to extract the First Page of each document. I have produced the below code.
My initial for loop "for entry in entries" returns the name of all documents in the library. I verify this by successfully printing all document names in the library.
I am using the pdfReader.getPage to specify the page number of each document whilst also using the extractText function to extract the text from the page. However, when i run this entire script, I am being thrown an error which states that one of the documents cannot be located. However, the document does exist in the library. This is shown in the screenshot from the library below. Whilst also verified by the fact that it prints in the list of documents in the repository.
I believe the issue is with how the extractText is iterating through all documents but I am unclear on how to resolve. Would anyone have any suggestions?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
directory = 'Fund Docs'
entries = os.listdir(directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
You need to specify the full path:
pdfFileObj = open(directory + '/' + entry, 'rb')
This will open the file at Fund Docs/FILE_NAME.pdf. By only specifying entry, it will look for the file in the current folder, which it won't find. By adding the folder at the start, you're saying to find the entry inside that folder.

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

How to iterate over JSON files in a directory and upload to mongodb

So I have a folder with about 500 JSON files. I need to upload all of them to a local mongodb database. I tried using Mongo Compass, but Compass can only upload one file at a time. In python I tried to write some simple code to iterate through the folder and upload them one by one, but I ran into some problems. First of all the JSON files are not comma-separated, rather line separated. So the files look like:
{ some JSON object }
{ some JSON object }
...
I wrote the following code to iterate through the folder and upload it:
import pymongo
import json
import pandas as pd
import numpy as np
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient['Test']
mycol = mydb['data']
directory = os.fsencode("C:/Users/PB/Desktop/test/")
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".json"):
mycol.insert_many(filename)
The code basically goes through a folder, checks if it's a .json file, then inserts it into the database. That is what should happen. However, I get this error:
TypeError: document must be an instance of dict, bson.son.SON,
bson.raw_bson.RawBSONDocument, or a type that inherits from
collections.MutableMapping
I cannot seem to upload it through python. I tried multiple variations of the code, but for some reason the python does not accept the json files.
The problem with these files seems to be that python only allows for comma-separated JSON files.
How could I fix this to upload all the files?
You're inserting the names of the files to mongo. Not the contents of the file.
Assuming you have multiple json files in a directory, where each file contains a json-object in each line...
You need to go through all the files, filter them, open them, read them line by line, parse each line into a dict, and then insert. Something like below:
os.chdir(directory)
for file in os.listdir(directory):
if file.endswith(".json"):
with open(file) as f:
for line in f:
mongo_obj = json.loads(line)
mycol.insert(mongo_obj)
I did a chdir first to avoid having to pass the whole path to open

search within an unextracted .zip file

I'm trying to use python to search within a .zip file without actually extracting the zipped file. I understand re.search can do searches within files, but will it do the same for files that have yet to be extracted?
The way to do this is using the zipfile module, it allows reading the name list and other meta info from the zip file prior to extracting the content.
import zipfile
zf = zipfile.ZipFile('example.zip', 'r')
print zf.namelist()
You can read more here about the Zipfile Library

Categories