I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.
I figured out how to hard code to one file:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.
How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?
To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
For additional reading, here is the documentation on os.listdir().
Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!
Assuming that the code above get you the data you need, all you need to do is read the files from the disk and process them.
First let's define a function that does what you were already doing, then we'll loop over all the documents in the directory and process them with that function.
Edit the following untested code to suit your needs.
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)
Related
I'm pretty new in programming and it is the first time I use xml, but for class I'm doing a gender classification project with a dataset of Blogs.
I have a folder which consists of xml files. Now I need to make a list of names of the files there.
Then I should be able to run through the list with a loop and open each file containing XML and get out of it what I want (ex. Text and class) and then store that in another variable, like adding it to a list or dictionary.
I tried something, but it isn't right and I'm kind of stuck. Can someone help me? This is wat I have so far:
path ='\\Users\\name\\directory\\folder'
dir = os.listdir( path )
def select_files_in_folder(dir, ext):
for filename in os.listdir(path):
fullname= os.path.join(path, filename)
tree = ET.parse(fullname)
for elem in doc.findall('gender'):
print(elem.get('gender'), elem.text)
If you want to build a list of all the xml files in a given directory you can do the following
def get_xml_files(path):
xml_list = []
for filename in os.listdir(path):
if filename.endswith(".xml"):
xml_list.append(os.path.join(path, filename))
return xml_list
just keep in mind that this is not recursive through the folders and it's just assuming that the xml files finish with .xml.
EDIT :
Parsing xml is highlly dependent of the library you'll be using. From your code I guess you're using xml.etree.ElementTree (keep in mind this lib is not safe against maliciously constructed data).
def get_xml_data(list):
data = []
for filename in list :
root = ET.parse(filename)
data = [ text for text in root.findall("whatever you want to get") ]
return data
# Import tools used in code - lxml is not a normal library
# The .whl file will need to be downloaded and added - use PIP to install .whl files
import os, gc, shutil, re, xml.etree.cElementTree as ET
# GC is garbage collector
gc.collect()
from lxml import etree
# Read through directory pulling in filenames
# This currently needs to be modified to match folder structure - will retool to work from anywhere to anywhere
for fileCDR in os.listdir("Absolute File Path"):
# Check files looking for .cdr
if fileCDR.endswith(".cdr"):
# Parse file into string to pull out XML Flags - Try Exception used was having issues with tools
# Need to remove one of them as they are both not needed
try:
pfile = etree.parse(fileCDR).getroot()
except:
pfile = ET.parse(fileCDR).getroot()
# print (file)
# Read through xml in pfile variable - looking for start time
for startTime in pfile.findall('XML Tag startTime'):
start = startTime.text.replace(':', '-')
# Read through xml in pfile variable looking for Dialed Digits - originalDestinationId
for originalDestinationId in pfile.findall('XML Tag originalDestinationId'):
number = originalDestinationId.text
# Read through xml in pfile variable - looking for filename - correlationId
for correlationId in pfile.findall('XML Tag correlationId'):
fileName = correlationId.text
# Store all the collected variables into one variable to rename .cdr files
cdrDone = (start + "_" + number + "_" + fileName + ".cdr")
# Store all the collected variables into one variable to rename .ogg files
new_fileName = (start + "_" + number + "_" + fileName + ".ogg")
# Store file name of .cdr file to use in if statement to match files up
compareFile = fileCDR[:-4]
# Read through directory pulling in filenames
for fileOGG in os.listdir("Absolute File Path"):
# Used to check if filename ends in .ogg
if fileOGG.endswith(".ogg"):
# Used to trim file name of .ogg to match .cdr file
compareFileName = fileOGG[:-4] # type: string
# Used to make sure filenames match - .cdr and .ogg
if compareFileName == compareFile:
# These all currently needs to be modified to match folder structure - will retool to work from anywhere to anywhere
oggFile = os.path.join("Absolute File Path", fileOGG)
newOGG = os.path.join("Absolute File Path", new_fileName)
shutil.move(oggFile, newOGG)
cdrFile = os.path.join("Absolute File Path", fileCDR)
newCDR = ("Absolute File Path", cdrDone)
shutil.move(cdrFile, newCDR)
print (newOGG + " " + newCDR)
Originally I needed help reading in file names from a directory. I was able to find my answers but had more questions. The files I was trying to identify in a folder structure had xml inside them. After identifying those files I needed a way to parse out metadata in the xml. After pulling out the metadata from the xml I needed to save a file that shared the same name but had a different file extension. I wanted to retain the filename and include the two flags I wanted from the xml. I wanted the files moved outside of the source folder into a new folder while doing the renaming. This is the end result and contains the answer to my question in addition to a few other hurdles I had. I appreciate peoples feedback and am aware my question and original request was no good. It was my first time making a post on this site. Thanks again and hopefully this helps someone else out.
P.S. This is my first Python program and I want to do more with it. If I did things that could be modified for efficiency I would appreciate the tips or pointers.
load this up in a DOM and extract the value of the node you want.
save that value in a variable. close the file you want to rename . use the value stored in the variable to rename the file.
sorry, it is just pseudo-code but all the steps are well documented in many places.
I have a python program that downloads article text and then turns it into a txt file. The program currently spits out the txt files in the directory the program is located in. I would like to arrange this text in folders specific to the news source they came from. Could I save the data in the folder in the python program itself and change the directory as the news source file changes? Or should I create a shell script that runs the python program inside the folder it needs to be in? Or is there a better way to sort these files that I am missing?
Here is the code of the Python program:
import feedparser
from goose import Goose
import urllib2
import codecs
url = "http://rss.cnn.com/rss/cnn_tech.rss"
feed = feedparser.parse(url)
g = Goose()
entryLength = len(feed['entries'])
count = 0
while True:
article = g.extract(feed.entries[count]['link'])
title = article.title
text = article.cleaned_text
file = codecs.open(feed['entries'][count]['title'] + ".txt", 'w', encoding = 'utf-8')
file.write(text)
file.close()
count = count + 1
if count == entryLength:
break
If you only give your save functions filenames, they will save to the current directory. However, if you provide them with paths, your files will end up there. Python takes care of it.
folder = 'whatever' #the folder you wish to save the files in
name = 'somefilename.txt'
filename = os.path.join(folder, filename)
Using that filename will make the file end up in the folder 'whatever/'
Edit: I see you've posted your code now. As br1ckb0t mentioned in his comment below, in your code you could write something like codecs.open(folder + feed['entries'].... Make sure to append a slash to folder if you do that, or it'll just end up as part of the filename.
First of all thanks for reading this. I am a little stuck with sub directory walking (then saving) in Python. My code below is able to walk through each sub directory in turn and process a file to search for certain strings, I then generate an xlsx file (using xlsxwriter) and post my search data to an Excel.
I have two problems...
The first problem I have is that I want to process a text file in each directory, but the text file name varies per sub directory, so rather than specifying 'Textfile.txt' I'd like to do something like *.txt (would I use glob here?)
The second problem is that when I open/create an Excel I would like to save the file to the same sub directory where the .txt file has been found and processed. Currently my Excel is saving to the python script directory, and consequently gets overwritten each time a new sub directory is opened and processed. Would it be wiser to save the Excel at the end to the sub directory or can it be created with the current sub directory path from the start?
Here's my partially working code...
for root, subFolders, files in os.walk(dir_path):
if 'Textfile.txt' in files:
with open(os.path.join(root, 'Textfile.txt'), 'r') as f:
#f = open(file, "r")
searchlines = f.readlines()
searchstringsFilter1 = ['Filter Used :']
searchstringsFilter0 = ['Filter Used : 0']
timestampline = None
timestamp = None
f.close()
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Excel.xlsx', {'strings_to_numbers': True})
worksheetFilter = workbook.add_worksheet("Filter")
Thanks again for looking at this problem.
MikG
I will not solve your code completely, but here are hints:
the text file name varies per sub directory, so rather than specifying 'Textfile.txt' I'd like to do something like *.txt
you can list all files in directory, then check file extension
for filename in files:
if filename.endswith('.txt'):
# do stuff
Also when creating woorkbook, can you enter path? You have root, right? Why not use it?
You don't want glob because you already have a list of files in the files variable. So, filter it to find all the text files:
import fnmatch
txt_files = filter(lambda fn: fnmatch.fnmatch(fn, '*.txt'), files)
To save the file in the same subdirectory:
outfile = os.path.join(root, 'someoutfile.txt')
I want to implement a file reader (folders and subfolders) script which detects some tags and delete those tags from the files.
The files are .cpp, .h .txt and .xml And they are hundreds of files under same folder.
I have no idea about python, but people told me that I can do it easily.
EXAMPLE:
My main folder is A: C:\A
Inside A, I have folders (B,C,D) and some files A.cpp A.h A.txt and A.xml. In B i have folders B1, B2,B3 and some of them have more subfolders, and files .cpp, .xml and .h....
xml files, contains some tags like <!-- $Mytag: some text$ -->
.h and .cpp files contains another kind of tags like //$TAG some text$
.txt has different format tags: #$This is my tag$
It always starts and ends with $ symbol but it always have a comment character (//,
The idea is to run one script and delete all tags from all files so the script must:
Read folders and subfolders
Open files and find tags
If they are there, delete and save files with changes
WHAT I HAVE:
import os
for root, dirs, files in os.walk(os.curdir):
if files.endswith('.cpp'):
%Find //$ and delete until next $
if files.endswith('.h'):
%Find //$ and delete until next $
if files.endswith('.txt'):
%Find #$ and delete until next $
if files.endswith('.xml'):
%Find <!-- $ and delete until next $ and -->
The general solution would be to:
use the os.walk() function to traverse the directory tree.
Iterate over the filenames and use fn_name.endswith('.cpp') with if/elseif to determine which file you're working with
Use the re module to create a regular expression you can use to determine if a line contains your tag
Open the target file and a temporary file (use the tempfile module). Iterate over the source file line by line and output the filtered lines to your tempfile.
If any lines were replaced, use os.unlink() plus os.rename() to replace your original file
It's a trivial excercise for a Python adept but for someone new to the language, it'll probably take a few hours to get working. You probably couldn't ask for a better task to get introduced to the language though. Good Luck!
----- Update -----
The files attribute returned by os.walk is a list so you'll need to iterate over it as well. Also, the files attribute will only contain the base name of the file. You'll need to use the root value in conjunction with os.path.join() to convert this to a full path name. Try doing just this:
for root, d, files in os.walk('.'):
for base_filename in files:
full_name = os.path.join(root, base_filename)
if full_name.endswith('.h'):
print full_name, 'is a header!'
elif full_name.endswith('.cpp'):
print full_name, 'is a C++ source file!'
If you're using Python 3, the print statements will need to be function calls but the general idea remains the same.
Try something like this:
import os
import re
CPP_TAG_RE = re.compile(r'(?<=// *)\$[^$]+\$')
tag_REs = {
'.h': CPP_TAG_RE,
'.cpp': CPP_TAG_RE,
'.xml': re.compile(r'(?<=<!-- *)\$[^$]+\$(?= *-->)'),
'.txt': re.compile(r'(?<=# *)\$[^$]+\$'),
}
def process_file(filename, regex):
# Set up.
tempfilename = filename + '.tmp'
infile = open(filename, 'r')
outfile = open(tempfilename, 'w')
# Filter the file.
for line in infile:
outfile.write(regex.sub("", line))
# Clean up.
infile.close()
outfile.close()
# Enable only one of the two following lines.
os.rename(filename, filename + '.orig')
#os.remove(filename)
os.rename(tempfilename, filename)
def process_tree(starting_point=os.curdir):
for root, d, files in os.walk(starting_point):
for filename in files:
# Get rid of `.lower()` in the following if case matters.
ext = os.path.splitext(filename)[1].lower()
if ext in tag_REs:
process_file(os.path.join(root, base_filename), tag_REs[ext])
Nice thing about os.splitext is that it does the right thing for filenames that start with a ..