I'm pretty new in programming and it is the first time I use xml, but for class I'm doing a gender classification project with a dataset of Blogs.
I have a folder which consists of xml files. Now I need to make a list of names of the files there.
Then I should be able to run through the list with a loop and open each file containing XML and get out of it what I want (ex. Text and class) and then store that in another variable, like adding it to a list or dictionary.
I tried something, but it isn't right and I'm kind of stuck. Can someone help me? This is wat I have so far:
path ='\\Users\\name\\directory\\folder'
dir = os.listdir( path )
def select_files_in_folder(dir, ext):
for filename in os.listdir(path):
fullname= os.path.join(path, filename)
tree = ET.parse(fullname)
for elem in doc.findall('gender'):
print(elem.get('gender'), elem.text)
If you want to build a list of all the xml files in a given directory you can do the following
def get_xml_files(path):
xml_list = []
for filename in os.listdir(path):
if filename.endswith(".xml"):
xml_list.append(os.path.join(path, filename))
return xml_list
just keep in mind that this is not recursive through the folders and it's just assuming that the xml files finish with .xml.
EDIT :
Parsing xml is highlly dependent of the library you'll be using. From your code I guess you're using xml.etree.ElementTree (keep in mind this lib is not safe against maliciously constructed data).
def get_xml_data(list):
data = []
for filename in list :
root = ET.parse(filename)
data = [ text for text in root.findall("whatever you want to get") ]
return data
Related
I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.
The relevant piece of code is this:
import lxml.etree as et
import lxml.sax
# deleted non relevant code
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith('.xml'):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# This is a function that I define elsewhere in which I use XPath to identify relevant
# elements and extract, change or add some information
xml_dosomething(file_root)
# init tree object from file_root
tree = et.ElementTree(file_root)
# save modified xml tree object to file with an added text so that I can keep a copy of original.
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:
handler = lxml.sax.ElementTreeContentHandler()
then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:
tree = handler.etree
lxml.etree.tostring(tree.getroot())
I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?
What is the best way to copy specific files from a list into a new directory using python?
For example, I have a text document containing file names like the below example:
E3004
D0402
B9404
C6089
I would like to search a directory and copy the files that are found to exist into a new directory whilst listing the codes that are not found into a new text document.
I am a complete python novice, so any help is much appreciated.
Here's a piece of code from a previous discussion which was put together as a solution to a similar problem, however, I am having trouble understanding where exactly to place my file paths for the src, dst and text document? Furthermore, is there a way to save out the data which was not found to a separate text document?
Link to the previous discussion: https://stackoverflow.com/a/51621897/10580480
import os import shutil from tkinter import * from tkinter import filedialog
root = Tk()
root.withdraw()
filePath = filedialog.askopenfilename()
folderPath = filedialog.askdirectory()
destination = filedialog.askdirectory()
filesToFind = []
with open(filePath, "r") as fh:
for row in fh:
filesToFind.append(row.strip())
filename variable itself for filename in os.listdir(folderPath):
if filename in filesToFind:
filename = os.path.join(folderPath, filename)
shutil.copy(filename, destination)
else:
print("file does not exist: filename")
Thanks
Using glob (see https://docs.python.org/3/library/glob.html) you can search for specific filenames in a directory - I have used it to find readme.md files from a git clones directory like:
for readmeFile in glob.glob('FULLPATH/clones/*/readme.md'):
so in your case:
loop through your document containing filenames
for readmeFile in glob.glob('FULLPATH/clones/*/' + filename):
for filename in os.listdir('/home/localDir/INPUT/'):
if filename.startswith("localFileName"):
I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.
I figured out how to hard code to one file:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.
How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?
To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
For additional reading, here is the documentation on os.listdir().
Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!
Assuming that the code above get you the data you need, all you need to do is read the files from the disk and process them.
First let's define a function that does what you were already doing, then we'll loop over all the documents in the directory and process them with that function.
Edit the following untested code to suit your needs.
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)
First of all thanks for reading this. I am a little stuck with sub directory walking (then saving) in Python. My code below is able to walk through each sub directory in turn and process a file to search for certain strings, I then generate an xlsx file (using xlsxwriter) and post my search data to an Excel.
I have two problems...
The first problem I have is that I want to process a text file in each directory, but the text file name varies per sub directory, so rather than specifying 'Textfile.txt' I'd like to do something like *.txt (would I use glob here?)
The second problem is that when I open/create an Excel I would like to save the file to the same sub directory where the .txt file has been found and processed. Currently my Excel is saving to the python script directory, and consequently gets overwritten each time a new sub directory is opened and processed. Would it be wiser to save the Excel at the end to the sub directory or can it be created with the current sub directory path from the start?
Here's my partially working code...
for root, subFolders, files in os.walk(dir_path):
if 'Textfile.txt' in files:
with open(os.path.join(root, 'Textfile.txt'), 'r') as f:
#f = open(file, "r")
searchlines = f.readlines()
searchstringsFilter1 = ['Filter Used :']
searchstringsFilter0 = ['Filter Used : 0']
timestampline = None
timestamp = None
f.close()
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Excel.xlsx', {'strings_to_numbers': True})
worksheetFilter = workbook.add_worksheet("Filter")
Thanks again for looking at this problem.
MikG
I will not solve your code completely, but here are hints:
the text file name varies per sub directory, so rather than specifying 'Textfile.txt' I'd like to do something like *.txt
you can list all files in directory, then check file extension
for filename in files:
if filename.endswith('.txt'):
# do stuff
Also when creating woorkbook, can you enter path? You have root, right? Why not use it?
You don't want glob because you already have a list of files in the files variable. So, filter it to find all the text files:
import fnmatch
txt_files = filter(lambda fn: fnmatch.fnmatch(fn, '*.txt'), files)
To save the file in the same subdirectory:
outfile = os.path.join(root, 'someoutfile.txt')
I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?
First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!