Python - Parsing all XML files in a folder to CSV files

Python - Parsing all XML files in a folder to CSV files - python

I just started learning python so this might be a very basic question but here's where I'm stuck.
I'm trying to parse ALL XML files in a given folder and outputting CSV files, with the same filename as the original XML files. I've tested with single files and it works perfectly but the issue I'm having is with performing the same for all of them and having that running on a loop as it would be a perpetual script.
Here my code:
import os
import xml.etree.cElementTree as Eltree
import pandas as pd
path = r'C:/python_test'
filenames = []
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
print(fullname)
filenames.append(fullname)
cols = ["serviceName", "startDate", "endDate"]
rows = []
for filename in filenames:
xmlparse = Eltree.parse(filename)
root = xmlparse.getroot()
csvoutput=[]
for fixed in root.iter('{http://www.w3.org/2001/XMLSchema}channel'):
channel = fixed.find("channelName").text
for dyn in root.iter('programInformation'):
start = dyn.find("publishedStartTime").text
end = dyn.find("endTime").text
rows.append({"serviceName": channel, "startDate": start, "endDate": end})
df = pd.DataFrame(rows, columns=cols)
df.to_csv(csvoutput)
This is the error I'm getting:
C:/python_test\1.xml
C:/python_test\2.xml
C:/python_test\3.xml
C:/python_test\4.xml
C:/python_test\5.xml
C:/python_test\6.xml
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 49, in <module>
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py", line 3466, in to_csv
return DataFrameRenderer(formatter).to_csv(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\format.py", line 1105, in to_csv
csv_formatter.save()
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\csvs.py", line 237, in save
with get_handle(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 609, in get_handle
ioargs = _get_filepath_or_buffer(
File "C:\Users\ragehol\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py", line 396, in _get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'list'>
Any kind of suggestions would be greatly appreciated!
Many thanks!

This is you bug:
csvoutput=[] is defined as a list. Later on you pass it as argument to df.to_csv(csvoutput). So you are passing a list to a method that looks for a file path.

Related

Merge PDFs from python list of WindowsPath paths

I have an excel file with some data in rows and columns and I am targetting at take the file names from each row and merge those into one pdf file (simply each row to one pdf file)
This is an example of a list ['1', '112238', '112239', '112240', '112337', '112338'] the first element in the python list will be the pdf name and the other elements are the file names that is supposed to be existing in directory named Files
Here's my attempt till now
from pathlib import Path
import pandas as pd
from PyPDF2 import PdfFileMerger
BASE_DIR = Path.cwd()
MAIN_DIR = BASE_DIR / 'Files'
FINAL_DIR = BASE_DIR / 'Final'
try:
shutil.rmtree(FINAL_DIR)
except:
pass
FINAL_DIR.mkdir(parents=True, exist_ok=True)
df = pd.read_excel('MainFile.xlsx', dtype = str)
for l in df.T.columns:
new_list = list(df.T[l][df.T[l].notna()])
files_list = [MAIN_DIR / f'{i}.pdf' for i in new_list[1:]]
final_list = []
final_list.append(new_list[0])
for file in files_list:
if file.exists():
final_list.append(file)
else:
print(f'{file} ---> NOT Exists')
merger = PdfFileMerger()
for pdf in final_list[1:]:
merger.append(pdf)
merger.write(FINAL_DIR / f'{final_list[0]}.pdf')
merger.close()
Here's a snapshot of the excel file that I read the file names from
and the pdf files in directory named Files
When I tried to run the script, I encountered an error like that
Traceback (most recent call last):
File "C:\Users\Future\Desktop\demo.py", line 33, in <module>
merger.append(pdf)
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\merger.py", line 203, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\merger.py", line 133, in merge
pdfr = PdfFileReader(fileobj, strict=self.strict)
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
AttributeError: 'WindowsPath' object has no attribute 'seek'
I have tried this modification merger.append(str(Path(pdf))) and it seems to skip the first problem (I am not sure) but now I got another error
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Traceback (most recent call last):
File "C:\Users\Future\Desktop\demo.py", line 39, in <module>
merger.write(FINAL_DIR / f'{str(final_list[0])}.pdf')
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\merger.py", line 230, in write
self.output.write(fileobj)
File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib\site-packages\PyPDF2\pdf.py", line 487, in write
stream.write(self._header + b_("\n"))
AttributeError: 'WindowsPath' object has no attribute 'write'

Solved by these two modifications
merger.append(str(Path(pdf)))
and I used os to join the path as I failed to use Path
merger.write(os.path.join(str(FINAL_DIR), str(final_list[0] + '.pdf')))

Pandas not reading tables from html files in folder

I am trying to read the tables of each individual html file in a folder using pandas, to find out the number of tables in each file.
However, this feature works when specifying a single file, but when I try to run it in the folder it says there are no tables.
This is the code for the single file
import pandas as pd
file = r'C:\Users\Ahmed_Abdelmuniem\Desktop\XXX.html'
table = pd.read_html(file)
print ('tables found:', len(table))
This is the output
C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\python.exe C:/Users/Ahmed_Abdelmuniem/PycharmProjects/PandaHTML/main.py
tables found: 72
Process finished with exit code 0
This is the code for each file in a folder
import pandas as pd
import shutil
import os
source_dir = r'C:\Users\Ahmed_Abdelmuniem\Desktop\TMorning'
target_dir = r'C:\Users\Ahmed_Abdelmuniem\Desktop\TAfternoon'
file_names = os.listdir(source_dir)
for file_name in file_names:
table = pd.read_html(file_name)
print ('tables found:', len(table))
This is the error log:
C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\python.exe "C:/Users/Ahmed_Abdelmuniem/PycharmProjects/File mover V2.0/main.py"
Traceback (most recent call last):
File "C:\Users\Ahmed_Abdelmuniem\PycharmProjects\File mover V2.0\main.py", line 12, in <module>
table = pd.read_html(file_name)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 1085, in read_html
return _parse(
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 913, in _parse
raise retained
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 893, in _parse
tables = p.parse_tables()
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Users\Ahmed_Abdelmuniem\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\html.py", line 543, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
Process finished with exit code 1

os.listdir returns a list containing the names of the entries in the directory including subdirectories or any other files. If you want to keep only html files, prefer use glob.glob.
import glob
file_names = glob.glob(os.path.join(source_dir, '*.html'))
Edit: if you want to use os.listdir, you have to get the actual path to the file:
for file_name in file_names:
table = pd.read_html(os.path.join(source_dir, file_name))
print ('tables found:', len(table))

Reading TDMS File with python nptdms, cannot open tdms file

I am having issues with getting basic function of the nptdms module working.
First, I am just trying to open a TDMS file and print the contents of specific channels within specific groups.
Using python 2.7 and the nptdms quick start here
Following this, I will be writing these specific pieces of data into a new TDMS file. Then, my ultimate goal is to be able to take a set of source files, open each, and write (append) to a new file. The source data files contain far more information that is needed, so I am breaking out the specifics into their own file.
The problem I have is that I cannot get past a basic error.
When running this code, I get:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 27, in <module>
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 94, in __init__
self._read_segments(f)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 119, in _read_segments
object._initialise_data(memmap_dir=self.memmap_dir)
File "C:\Anaconda2\lib\site-packages\nptdms\tdms.py", line 709, in _initialise_data
mode='w+b', prefix="nptdms_", dir=memmap_dir)
File "C:\Anaconda2\lib\tempfile.py", line 475, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
File "C:\Anaconda2\lib\tempfile.py", line 244, in _mkstemp_inner
fd = _os.open(file, flags, 0600)
OSError: [Errno 2] No such file or directory: 'r\\nptdms_yjfyam'
Here is my code:
from nptdms import TdmsFile
import numpy as np
import pandas as pd
#set Tdms file path
tdms_file = TdmsFile(r"C:\\Users\daniel.worts\Desktop\this_is_my_tdms_file.tdms","r")
# set variable for TDMS groups
group_nameone = '101'
group_nametwo = '752'
# set objects for TDMS channels
channel_dataone = tdms_file.object(group_nameone 'Payload_1')
channel_datatwo = tdms_file.object(group_nametwo, 'Payload_2')
# set data from channels
data_dataone = channel_dataone.data
data_datatwo = channel_datatwo.data
print data_dataone
print data_datatwo
Big thanks to anyone who may have encountered this before and can help point to what I am missing.
Best,
- Dan
edit:
Solved the read data issue by removing the 'r' argument from the file path.
Now I am having another error I can't trace when trying to write.
from nptdms import TdmsFile, TdmsWriter, RootObject, GroupObject, ChannelObject
import numpy as np
import pandas as pd
newfilepath = r"C:\\Users\daniel.worts\Desktop\Mined.tdms"
datetimegroup101_channel_object = ChannelObject('101', DateTime, data_datetimegroup101)
with TdmsWriter(newfilepath) as tdms_writer:
tdms_writer.write_segment([datetimegroup101_channel_object])
Returns error:
Traceback (most recent call last):
File "PullTDMSdataIntoNewFile.py", line 82, in <module>
tdms_writer.write_segment([datetimegroup101_channel_object])
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 68, in write_segment
segment = TdmsSegment(objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in __init__
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 88, in <genexpr>
paths = set(obj.path for obj in objects)
File "C:\Anaconda2\lib\site-packages\nptdms\writer.py", line 254, in path
self.channel.replace("'", "''"))
AttributeError: 'TdmsObject' object has no attribute 'replace'

Python glob.glob(dir) Memory Error

I'm dealing with a memory issue when searching a folder with milions of files. Does anyone know how to overcome this situation? Is there some way to put a limit of files that glob will search? So it could be executed in chunks?
Traceback (most recent call last):
File "./lb2_lmanager", line 533, in <module>
main(sys.argv[1:])
File "./lb2_lmanager", line 318, in main
matched = match_files(policy.directory, policy.file_patterns)
File "./lb2_lmanager", line 32, in wrapper
res = func(*args, **kwargs)
File "./lb2_lmanager", line 380, in match_files
listing = glob.glob(directory)
File "/usr/lib/python2.6/glob.py", line 16, in glob
return list(iglob(pathname))
File "/usr/lib/python2.6/glob.py", line 43, in iglob
yield os.path.join(dirname, name)
File "/usr/lib/python2.6/posixpath.py", line 70, in join
path += '/' + b
MemoryError

Try using generators instead of lists.
To understand what generators are read this
import glob
dir_list = glob.iglob(YOUR_DIRECTORY)
for file in dir_list:
print file
Change YOUR_DIRECTORY to the directory that you would like to list.

Using os.path.join with os.path.getsize, returning FileNotFoundError

In conjunction with my last question, I'm onto printing the filenames with their sizes next to them in a sort of list. Basically I am reading filenames from one file (which are added by the user), taking the filename and putting it in the path of the working directory to print it's size one-by-one, however I'm having an issue with the following block:
print("\n--- Stats ---\n")
with open('userdata/addedfiles', 'r') as read_files:
file_lines = read_files.readlines()
# get path for each file and find in trackedfiles
# use path to get size
print(len(file_lines), "files\n")
for file_name in file_lines:
# the actual files should be in the same working directory
cwd = os.getcwd()
fpath = os.path.join(cwd, file_name)
fsize = os.path.getsize(fpath)
print(file_name.strip(), "-- size:", fsize)
which is returning this error:
tolbiac wpm-public → ./main.py --filestatus
--- Stats ---
1 files
Traceback (most recent call last):
File "./main.py", line 332, in <module>
main()
File "./main.py", line 323, in main
parseargs()
File "./main.py", line 317, in parseargs
tracking()
File "./main.py", line 204, in tracking
fsize = os.path.getsize(fpath)
File "/usr/lib/python3.4/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/tolbiac/code/wpm-public/file.txt\n'
tolbiac wpm-public →
So it looks like something is adding a \n to the end of file_name, I'm not sure if thats something used in the getsize module, I tried this with os.stat, but it did the same thing.
Any suggestions? Thanks.

When you're reading in a file, you need to be aware of how the data is being seperated. In this case, the read-in file has a filename once per line seperated out by that \n operator. Need to strip it then before you use it.
for file_name in file_lines:
file_name = file_name.strip()
# rest of for loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Parsing all XML files in a folder to CSV files - python

This is you bug: csvoutput=[] is defined as a list. Later on you pass it as argument to df.to_csv(csvoutput). So you are passing a list to a method that looks for a file path.

Related

Merge PDFs from python list of WindowsPath paths

Pandas not reading tables from html files in folder

Reading TDMS File with python nptdms, cannot open tdms file

Python glob.glob(dir) Memory Error

Using os.path.join with os.path.getsize, returning FileNotFoundError

Categories

Resources