XMLCorpusReader is not creating a corpus

XMLCorpusReader is not creating a corpus - python

I am learning Natural Language Processing with python's nltk. I want to create a corpus from an XML file i have in my directory. So I used the following code.
>> from nltk.corpus import XMLCorpusReader
>> corpus_root = "/Desktop/my_dir/corpus/"
>> wiki = XMLCorpusReader(corpus_root ,'output.xml')
>> wiki.fileids()
>>
This code block is supposed to output the fileid as 'output.xml'.But it doesnt return anything and the cursor goes to the next line ">>".
I have my output.xml in the exact directory as specified in corpus_root.
I have all the permission to read and write to the file 'output.xml'.
I have nltk and all its data installed and has all the specified paths.
What should i do to make it work ?

Let's walk through your code:
from nltk.corpus import XMLCorpusReader
corpus_root = "/Desktop/my_dir/corpus/"
I'm a bit skeptical of this path name (see this answer: https://stackoverflow.com/a/6617625/583834). It probably should be something like /usr/my_username/Desktop/my_dir/corpus. Make sure that your path is correct by opening up your terminal window, navigating to your directory and executing pwd to get your absolute path. Then copy it above.
wiki = XMLCorpusReader(corpus_root ,'output.xml')
XMLCorpusReader reads a directory as well as a list of filenames already existing in that directory. The second argument here is your input file name, not your output name. (Note the third "how to do it" section here for a sample call of the related WordListCorpusReader: reader = WordListCorpusReader('.', ['wordlist']))
wiki.fileids()
It's likely that you're not getting anything from this last line because the previous two lines are not used correctly.

Related

Can create and save Word files in Python, but cannot print them

I have made a Client Report Generator - by entering info into the various fields, the app will generate a Word file, save it, print it, and then clear the entry fields to be used for the next report. I have gotten everything working, except for the printing. I have tried multiple solutions, and the one that seems to have the most promise is included in my code. If anyone can figure out what is going on, I would greatly appreciate it. File locations are not final, but I will set those up once ready to package the app. The problem is that the program halts with no error message, so I don't even know what to look for. If I use the alternate syntax for the current print module (arguments added as separate strings), I get a "File Not Found" error, and that's it. I am open to a completely different approach to printing, if that has a better chance of working.
Code follows:
#Save and Print
def SavePrint ():
# Create Word Document
CliNameVar = CliName.get('1.0', 'end-1c')
AppDateVar=AppDate.get('1.0', 'end-1c')
RepBodyVar=RepBody.get('1.0', 'end-1c')
from docx import Document
from docx.shared import Pt
from docx.shared import Inches
CliDoc = Document ()
body_style = CliDoc.styles['Body Text']
body = CliDoc.add_paragraph(style=body_style).add_run(f'{CliNameVar} - {AppDateVar}')
body.font.name = 'Arial'
body.font.size = Pt(12)
body = CliDoc.add_paragraph
body = CliDoc.add_paragraph(style=body_style).add_run(f'{RepBodyVar}')
CliDoc.add_picture('D:/Documents/DavidSignatureBlue.png', width=Inches(2.5))
#Save
FileLoc=Path("C:/Claire's Documents/AAAAFidler/Clients/%s" % CliNameVar)
FileName="C:/Claire's Documents/AAAAFidler/Clients/%s/%s - %s.docx" %(CliNameVar, CliNameVar, AppDateVar)
FileLoc.mkdir(parents=True)
CliDoc.save(FileName)
#Print
import subprocess
subprocess.Popen("'C:\Program Files (x86)\Microsoft Office\root\Office16\winword.exe' '%s' /mFilePrintDefault /mFileExit", shell=True).communicate() %FileName
UPDATE:
So, after wrestling with the fact that I can't seem to pass variables to the subprocess module, I tried using a TEMP folder, and hard-code the print filename. I got a bit further, in that Word opens (though I would prefer it happen in the background), but it still complains that the file name is not valid - even though I can see that the file exists.
#Print
import time
import shutil
import subprocess
tmpdir = "C:/Claire's Documents/TMPFILE"
os.mkdir(tmpdir)
tmpfil = "C:/Claire's Documents/TMPFILE/prntfile.docx"
shutil.copy(FileName, tmpfil)
time.sleep(3)
subprocess.Popen(["C:/Program Files (x86)/Microsoft Office/root/Office16/winword.exe", "C:/Claire's Documents/TMPFILE/prntfile.docx", "/mFilePrintDefault", "/mFileExit"]).communicate()
time.sleep(5)
shutil.rmtree(tmpdir)
SECOND UPDATE:
Turns out, the subprocess module does not like spaces, even inside double quotes. Removed the space between "Claire's Documents", and now it prints - except it will not close again, per the "/mFileExit". Going to look at closing it as a separate instruction.

So, as mentioned in my update, I found the problem to be twofold: The subprocess module does not like to have variables passed to it, and it doesn't like spaces in the path to the file it will be working on.
As such, the solution was to make a temporary file that always has the same name, and making sure that there were no spaces in the path to the temp file for the subprocess to get hung up on. I didn't even need the temp folder - just dropped the print file in the parent directory, and deleted it afterwards. It also turns out that the fact that the Word document opens in the foreground, and doesn't close is actually a feature, not a bug, so no further wrestling with this problem.
Code follows:
#FileName variable set in Save section
#Print
import time
import shutil
import subprocess
tmpfil = "C:/Claire'sDocuments/prntfile.docx"
shutil.copy(FileName, tmpfil)
time.sleep(3)
subprocess.Popen(["C:/Program Files (x86)/Microsoft Office/Office12/winword.exe", "C:/Claire'sDocuments/prntfile.docx", "/mFilePrintDefault"]).communicate()
os.remove(tmpfil)

How to deal in python with the xml files preceded by the TIME prefix in successive runs of SUMO

I want to get the results of a SUMO successive runs in CSV format directly by using python script (not by using the xml2csv tools and cmd). Due to the TIME prefix comes before the XML file, I don't know how to deal with this part of the code.
Here we want the run to show the results separately by using the time:
sumoCmd = [sumoBinary, "-c", "test4.sumocfg", "--tripinfo-output", "tripinfo.xml", "--output-prefix", " TIME"].
And here is where I must put the proper XML file name which is my question:
tree = ET.parse("myfile.xml")
Any help would be appreciated.
Best, Ali

You can just find the file using glob e.g.:
import glob
tripinfos = glob.glob("*tripinfo.xml")
To get the latest you can use sorted:
latest = sorted(tripinfos)[-1]
tree = ET.parse(latest)

Using variable in file path in Python

I've found similar questions to this but can't find an exact answer and I'm having real difficulty getting this to work, so any help would be hugely appreciated.
I need to find a XML file in a folder structure that changes every time I run some automated tests.
This piece of code finds the file absolutely fine:
import xml.etree.ElementTree as ET
import glob
report = glob.glob('./Reports/Firefox/**/*.xml', recursive=True)
print(report)
I get a path returned. I then want to use that path, in the variable "report" and look for text within the XML file.
The following code finds the text fine IF the python file is in the same directory as the XML file. However, I need the python file to reside in the parent file and pass the "report" variable into the first line of code below.
tree = ET.parse("JUnit_Report.xml")
root = tree.getroot()
for testcase in root.iter('testcase'):
testname = testcase.get('name')
teststatus = testcase.get('status')
print(testname, teststatus)
I'm a real beginner at Python, is this even possible?

Build the absolute path to your report file:
report = glob.glob('./Reports/Firefox/**/*.xml', recursive=True)
abs_path_to_report = os.path.abspath(report)
Pass that variable to whatever you want:
tree = ET.parse(abs_path_to_report )

File not found from Python although file exists

I'm trying to load a simple text file with an array of numbers into Python. A MWE is
import numpy as np
BASE_FOLDER = 'C:\\path\\'
BASE_NAME = 'DATA.txt'
fname = BASE_FOLDER + BASE_NAME
data = np.loadtxt(fname)
However, this gives an error while running:
OSError: C:\path\DATA.txt not found.
I'm using VSCode, so in the debug window the link to the path is clickable. And, of course, if I click it the file opens normally, so this tells me that the path is correct.
Also, if I do print(fname), VSCode also gives me a valid path.
Is there anything I'm missing?
EDIT
As per your (very helpful for future reference) comments, I've changed my code using the os module and raw strings:
BASE_FOLDER = r'C:\path_to_folder'
BASE_NAME = r'filename_DATA.txt'
fname = os.path.join(BASE_FOLDER, BASE_NAME)
Still results in error.
Second EDIT
I've tried again with another file. Very basic path and filename
BASE_FOLDER = r'Z:\Data\Enzo\Waste_Code'
BASE_NAME = r'run3b.txt'
And again, I get the same error.
If I try an alternative approach,
os.chdir(BASE_FOLDER)
a = os.listdir()
then select the right file,
fname = a[1]
I still get the error when trying to import it. Even though I'm retrieving it directly from listdir.
>> os.path.isfile(a[1])
False

Using the module os you can check the existence of the file within python by running
import os
os.path.isfile(fname)
If it returns False, that means that your file doesn't exist in the specified fname. If it returns True, it should be read by np.loadtxt().
Extra: good practice working with files and paths
When working with files it is advisable to use the amazing functionality built in the Base Library, specifically the module os. Where os.path.join() will take care of the joins no matter the operating system you are using.
fname = os.path.join(BASE_FOLDER, BASE_NAME)
In addition it is advisable to use raw strings by adding an r to the beginning of the string. This will be less tedious when writing paths, as it allows you to copy-paste from the navigation bar. It will be something like BASE_FOLDER = r'C:\path'. Note that you don't need to add the latest '\' as os.path.join takes care of it.

You may not have the full permission to read the downloaded file. Use
sudo chmod -R a+rwx file_name.txt
in the command prompt to give yourself permission to read if you are using Ubuntu.

For me the problem was that I was using the Linux home symbol in the link (~/path/file). Replacing it with the absolute path /home/user/etc_path/file worked like charm.

Iterating through subdirectories to add unique strings to each file

My goal: To build a program that:
Opens a folder (provided by the user) from the user's computer
Iterates through that folder, opening each document in each subdirectory (named according to language codes; "AR," "EN," "ES," etc.)
Substitutes a string in for another string in each document. Crucially, the new string will change with each document (though the old string will not), according to the language code in the folder name.
My level of experience: Minimal; been learning python for a few months but this is the first program I'm building that's not paint-by-numbers. I'm building it to make a process at work faster. I'm sure I'm not building this as efficiently as possible; I've been throwing it together from my own knowledge and from reading stackexchange religiously while building it.
Research I've done on my own: I've been living in stackexchange the past few days, but I haven't found anyone doing quite what I'm doing (which was very surprising to me). I'm not sure if this is just because I lack the vocabulary to search (tried out a lot of search terms, but none of them totally match what I'm doing) or if this is just the wrong way of going about things.
The issue I'm running into:
I'm getting this error:
Traceback (most recent call last):
File "test5.py", line 52, in <module>
for f in os.listdir(src_dir):
OSError: [Errno 20] Not a directory: 'ExploringEduTubingEN(1).txt'
I'm not sure how to iterate through every file in the subdirectories and update a string within each file (not the file names) with a new and unique string. I thought I had it, but this error has totally thrown me off. Prior to this, I was getting an error for the same line that said "Not a file or directory: 'ExploringEduTubingEN(1).txt'" and it's surprising to me that the first error could request a file or a directory, and once I fixed that, it asked for just a directory; seems like it should've just asked for a directory at the beginning.
With no further ado, the code (placing at bottom because it's long to include context):
import os
ex=raw_input("Please provide an example PDF that we'll append a language code to. ")
#Asking for a PDF to which we'll iteratively append the language codes from below.
lst = ['_ar.pdf', '_cs.pdf', '_de.pdf', '_el.pdf', '_en_gb.pdf', '_es.pdf', '_es_419.pdf',
'_fr.pdf', '_id.pdf', '_it.pdf', '_ja.pdf', '_ko.pdf', '_nl.pdf', '_pl.pdf', '_pt_br.pdf', '_pt_pt.pdf', '_ro.pdf', '_ru.pdf',
'_sv.pdf', '_th.pdf', '_tr.pdf', '_vi.pdf', '_zh_tw.pdf', '_vn.pdf', '_zh_cn.pdf']
#list of language code PDF appending strings.
pdf_list=open('pdflist.txt','w+')
#creating a document to put this group of PDF filepaths in.
pdf2='pdflist.txt'
#making this an actual variable.
for word in lst:
pdf_list.write(ex + word + "\n")
#creating a version of the PDF example for every item in the language list, and then appending the language codes.
pdf_list.seek(0)
langlist=pdf_list.readlines()
#creating a list of the PDF paths so that I can use it below.
for i in langlist:
i=i.rstrip("\n")
#removing the line breaks.
pdf_list.close()
#closing the file after removing the line breaks.
file1=raw_input("Please provide the full filepath of the folder you'd like to convert. ")
#the folder provided by the user to iterate through.
folder1=os.listdir(file1)
#creating a list of the files within the folder
pdfpath1="example.pdf"
langfile="example2.pdf"
#setting variables for below
#my thought here is that i'd need to make the variable the initial folder, then make it a list, then iterate through the list.
for ogfile in folder1:
#want to iterate through all the files in the directory, including in subdirectories
src_dir=ogfile.split("/",6)
src_dir="/".join(src_dir[:6])
#goal here is to cut off the language code folder name and then join it again, w/o language code.
for f in os.listdir(src_dir):
f = os.path.join(src_dir, f)
#i admit this got a little convoluted–i'm trying to make sure the files put the right code in, I.E. that the document from the folder ending in "AR" gets the PDF that will now end in "AR"
#the perils of pulling from lots of different questions in stackexchange
with open(ogfile, 'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
for langfile in langlist:
f.write(content.replace(pdfpath1, langfile))
#replacing the placeholder PDF link with the created PDF links from the beginning of the code
If you read this far, thanks. I've tried to provide as much information as possible, especially about my thought process. I'll keep trying things and reading, but I'd love to have more eyes on it.

You have to specify the full path to your directories/files. Use os.path.join to create a valid path to your file or directory (and platform-independent).
For replacing your string, simply modify your example string using the subfolder name. Assuming that ex as the format filename.pdf, you could use: newstring = ex[:-4] + '_' + str.lower(subfolder) + '.pdf'. That way, you do not have to specify the list of replacement strings nor loop through this list.
Solution
To iterate over your directory and replace the content of your files as you'd like, you can do the following:
# Get the name of the file: "example.pdf" (note the .pdf is assumed here)
ex=raw_input("Please provide an example PDF that we'll append a language code to. ")
# Get the folder to go through
folderpath=raw_input("Please provide the full filepath of the folder you'd like to convert. ")
# Get all subfolders and go through them (named: 'AR', 'DE', etc.)
subfolders=os.listdir(folderpath)
for subfolder in subfolders:
# Get the full path to the subfolder
fullsubfolder = os.path.join(folderpath,subfolder)
# If it is a directory, go through it
if os.path.isdir(fullsubfolder):
# Find all files in subdirectory and go through each of them
files = os.listdir(fullsubfolder)
for filename in files:
# Get full path to the file
fullfile = os.path.join(fullsubfolder, filename)
# If it is a file, process it (note: we do not check if it is a text file here)
if os.path.isfile(fullfile):
with open(fullfile, 'r+') as f:
content = f.read()
f.seek(0)
f.truncate()
# Create the replacing string based on the subdirectory name. Ex: 'example_ar.pdf'
newstring = ex[:-4] + '_' + str.lower(subfolder) + '.pdf'
f.write(content.replace(ex, newstring))
Note
Instead of asking the user to find write the folder, you could ask him to open the directory with a dialog box. See this question for more info: Use GUI to open directory in Python 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XMLCorpusReader is not creating a corpus - python

Related

Can create and save Word files in Python, but cannot print them

How to deal in python with the xml files preceded by the TIME prefix in successive runs of SUMO

Using variable in file path in Python

File not found from Python although file exists

Iterating through subdirectories to add unique strings to each file

Categories

Resources