Load specific PyYAML documents from file - python

I have a .yml file, and I'm trying to load certain documents from it. I know that:
print yaml.load(open('doc_to_open.yml', 'r+'))
will open the first (or only) document in a .yml file, and that:
for x in yaml.load_all(open('doc_to_open.yml', 'r+')):
print x
which will print all the YAML documents in the file. But say I just want to open the first three documents in the file, or want to open the 8th document in the file. How would I do that?

If you don't want to parse the first seven YAML files at all, e.g. for efficiency reasons, you will have to search for the 8th document yourself.
There is the possibility to hook into the first stage of the parser and count the number of DocumentStartTokens() within the stream, and only start passing on the tokens after the 8th and stopping to do so on the 9th, but doing that is far from trivial. And that would still tokenize, at least, all of the preceding documents.
The completely inefficient way, for which an efficient replacement, IMO, would need to behave the same, would be to use .load_all() and select the appropriate document, after complete tokenizing/parsing/composing/resolving all of the documents ¹:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
for idx, data in enumerate(yaml.load_all(open('input.yaml'):
if idx == 7:
yaml.dump(data, sys.stdout)
If you run the above on a document input.yaml:
---
document: 0
---
document: 1
---
document: 2
---
document: 3
---
document: 4
---
document: 5
---
document: 6
---
document: 7 # < the 8th document
---
document: 8
---
document: 9
...
you get the output:
document: 7 # < the 8th document
You unfortunately cannot naively just count the number of document markers (---), as the document doesn't have to start with one:
document: 0
---
document: 1
.
.
nor does it have to have the marker on the first line if the file starts with a directive ²:
%YAML 1.2
---
document: 0
---
document: 1
.
.
or starts with a "document" consisting of comments only:
# the 8th document is the interesting one
---
document: 0
---
document: 1
.
.
To account for all that you can use:
def get_nth_yaml_doc(stream, doc_nr):
doc_idx = 0
data = []
for line in stream:
if line == u'---\n' or line.startswith('--- '):
doc_idx += 1
continue
if line == '...\n':
break
if doc_nr < doc_idx:
break
if line.startswith(u'%'):
continue
if doc_idx == 0: # no initial '---' YAML files don't start with
if line.lstrip().startswith('#'):
continue
doc_idx = 1
if doc_idx == doc_nr:
data.append(line)
return yaml.load(''.join(data))
with open("input.yaml") as fp:
data = get_nth_yaml_doc(fp, 8)
yaml.dump(data, sys.stdout)
and get:
document: 7 # < the 8th document
in all of the above cases, efficiently, without even tokenizing the preceding YAML documents (nor the following).
There is an additional caveat in that the YAML file could start with a byte-order-marker, and that the individual documents within a stream can start with these markers. The above routine doesn't handle that.
¹ This was done using ruamel.yaml of which I am the author, and which is an enhanced version of PyYAML. AFAIK PyYAML would work the same (but would e.g. drop the comment on the roundtrip).
² Technically the directive is in its own directives document, so you should count that as document but the .load_all() doesn't give you that document back, so I don't count it as such.

Related

Adding rows and columns to a pandas DataFrame in multiple loops

I am trying to make a simple tool which can look for keywords (from multiple txt files) in multiple PDFs. In the end, I would like it to produce a report in the following form:
Name of the pdf document
Keyword document 1
Keyword document...
Keyword document x
PDF 1
1
546
77
PDF...
3
8
8
PDF x
324
23
34
Where the numbers represent the total number of occurrences of all keywords from the keyword document in that particular file.
This is where I got so far - the function can successfully locate, count, and relate summed keywords to the document:
import fitz
import glob
def keyword_finder():
# access all PDFs from current directory
for pdf_file in glob.glob('*.pdf'):
# open files using PyMuPDF
document = fitz.open(pdf_file)
# count the number of pages in document
document_pages = document.page_count
# access all txt files (these contain the keywords)
for text_file in glob.glob('*.txt'):
# empty list to store the results
occurrences_sdg = []
# open keywords file
inputs = open(text_file, 'r')
# read txt file
keywords_list = inputs.read()
# split the words by an 'enter'
keywords_list_separated = keywords_list.split('\n')
for keyword in keywords_list_separated[1:-1]: # omit first and last entry
occurrences_keyword = []
# read in page by page
for page in range(0, document_pages):
# load in text from i page
text_per_page = document.load_page(page)
# search for keywords on the page, and sum all occurrences
keyword_sum = len(text_per_page.search_for(keyword))
# add occurrences from each page to list per keyword
occurrences_keyword.append(keyword_sum)
# sum all occurances of a keyword in the document
occurrences_sdg.append(sum(occurrences_keyword))
if sum(occurrences_sdg) > 0:
print(f'{pdf_file} has {sum(occurrences_sdg)} keyword(s) from {text_file}\n')
I did try using pandas and I believe that still is the best choice. The number of loops makes it difficult for me to decide at which point the "skeleton" dataframe should be made, and when the results should be added. Final goal is to have this produced report saved as csv.

ruamel.yaml: Preserve comments and blank lines when dumping a class that was previously loaded from yaml

I have a class that I wish to load/modify/dump from and to yaml while preserving comments and general formating using ruamel.yaml (0.17.21).
My issue is that after a yaml --> python --> yaml roundtrip, some comments disappear, some inline comment get put on their own line, and some bank lines (which are comments in ruamel.yaml I believe) are missing.
I'm not sure if I'm doing something wrong, or if this is a bug report.
Here's a minimal working example:
import sys
from ruamel.yaml import YAML, yaml_object
yaml = YAML()
#yaml_object(yaml)
class ExampleClass():
def __init__(self, subentries):
if not 'subentry_0' in subentries:
raise AssertionError
for k,v in subentries.items():
setattr(self, k, v)
# Here I can also define a `__setstate__` method that calls the init for me
# But it doesn't change much
source = """
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
"""
a = yaml.load(source)
yaml.dump(a, sys.stdout)
Outputs:
# top-level comment
entry: !ExampleClass
# entry inline comment
subentry_0: 0
subentry_1: 1
subentry_2: 2
entry 2: |
This is a long
text
entry
Where some funky stuff happened to the comments and blank spaces.
If I initialyze my class via a['entry'].__init__(a['entry'].__dict__), I also lose most comments and blank lines, but it looks better:
# top-level comment
entry: !ExampleClass
subentry_0: 0
subentry_1: 1
subentry_2: 2
entry 2: |
This is a long
text
entry
For blank lines, it'd be acceptable to me to just strip them all and then insert blank lines back between top-level entries.
There are two issues here. One is that when you want to round-trip, you should not register tags for your own objects.
ruamel.yaml can round-trip tagged collections (mapping, sequence) and most scalars (most notably it cannot
round-trip a tagged null/~). This gives you subclasses of standard python types that mostly behave
as you would expect and preserve all of the comments as well as any tags.
The second issue is that
comments between keys and their values have issues,
and tags interfere with comments (i.e. are not
properly covered by enough testcases because of lazyness of the ruamel.yaml author). IIRC comments between a key and a tagged value get complete lost.
The easiest solution for this second issue (for now) is probably to post-process the output.
import sys
import ruamel.yaml
yaml_str = """\
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
"""
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=4)
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
# print(data['entry'].tag.value)
def correct_comment_after_tag(s):
# if a previous line ends in a tag and this line has enough spaces
# at the start, append the end of the line to the previous one
res = []
prev_line = -1 # -1 if previous line didn't end in tag, else length of previous line
for line in s.splitlines():
linesplit = line.split()
if linesplit and linesplit[-1].startswith('!'):
prev_line = len(line)
else:
if prev_line > 0:
if line.lstrip().startswith('#') and line.find('#') > prev_line:
res[-1] += line[prev_line:]
prev_line = -1
continue
prev_line = -1
res.append(line)
return '\n'.join(res)
yaml.dump(data, sys.stdout, transform=correct_comment_after_tag)
which gives:
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
To get the ExampleClass behaviour I would probably duck-type a __getattr__ on ruamel.yaml.comments.CommentedMap
that checks for subentry_0 and returns the value for key. If usually know up-front that I am going to round-trip or not
and use yamlrt = YAML() if I do and yamls = YAML(typ='safe') with classes registered in yamls if I don't).
If you need to do (extra) checks on tagged nodes, it is IMO easiest to recursively walk over the
data structure, testing dict, list and possible items for their .tag attribute and do the appropriate check when the tag matches.
Alternatively, you probably get a Python datastructure that preserves comments on round-trip
when you make ExampleClass a subclass
of CommentedMap, but I am not sure.

How to convert a .txt to .xml in python

So the current problem I'm facing would be in converting a text file into a xml file.
The text file would be in this format.
Serial Number: Operator ID: test Time: 00:03:47 Test Step 2 TP1: 17.25 TP2: 2.46
Serial Number: Operator ID: test Time: 00:03:47 Test Step 2 TP1: 17.25 TP2: 2.46
I wanted to convert to convert it into a xml with this format:
<?xml version="1.0" encoding="utf-8"?>
<root>
<filedata>
</serialnumber>
<operatorid>test</operatorid>
<time>00:00:42 Test Step 2</time>
<tp1>17.25</tp1>
<tp2>2.46</tp2>
</filedata>
...
</root>
I was using a code like this to convert my previous text file to xml...but right now I'm facing problems in splitting the lines.
import xml.etree.ElementTree as ET
import fileinput
import os
import itertools as it
root = ET.Element('root')
with open('text.txt') as f:
lines = f.read().splitlines()
celldata = ET.SubElement(root, 'filedata')
for line in it.groupby(lines):
line=line[0]
if not line:
celldata = ET.SubElement(root, 'filedata')
else:
tag = line.split(":")
el=ET.SubElement(celldata,tag[0].replace(" ",""))
tag=' '.join(tag[1:]).strip()
if 'File Name' in line:
tag = line.split("\\")[-1].strip()
elif 'File Size' in line:
splist = filter(None,line.split(" "))
tag = splist[splist.index('Low:')+1]
#splist[splist.index('High:')+1]
el.text = tag
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
ET.tostring(
root)).toprettyxml(indent=" ",encoding='utf-8').strip()
with open("test.xml","wb") as f:
f.write(formatedXML)
I saw a similar question in stackoverflow
" Python text file to xml "
but the problem is I couldn't change it into a .csv format as this file is generated by a certain machine.
If anyone know how to solve it, please do help.
Thank you.
Here is a better method of splitting the lines.
Notice that the text variable would technically be your .txt file, and that I purposely modified it so that we have a greater context of the output.
from collections import OrderedDict
from pprint import pprint
# Text would be our loaded .txt file.
text = """Serial Number: test Operator ID: test1 Time: 00:03:47 Test Step 1 TP1: 17.25 TP2: 2.46
Serial Number: Operator ID: test2 Time: 00:03:48 Test Step 2 TP1: 17.24 TP2: 2.47"""
# Headers of the intended break-points in the text files.
headers = ["Serial Number:", "Operator ID:", "Time:", "TP1:", "TP2:"]
information = []
# Split our text by lines.
for line in text.split("\n"):
# Split our text up so we only have the information per header.
default_header = headers[0]
for header in headers[1:]:
line = line.replace(header, default_header)
info = [i.strip() for i in line.split(default_header)][1:]
# Compile our header+information together into OrderedDict's.
compiled_information = OrderedDict()
for header, info in zip(headers, info):
compiled_information[header] = info
# Append to our overall information list.
information.append(compiled_information)
# Pretty print the information (not needed, only for better display of data.)
pprint(information)
Outputs:
[OrderedDict([('Serial Number:', 'test'),
('Operator ID:', 'test1'),
('Time:', '00:03:47 Test Step 1'),
('TP1:', '17.25'),
('TP2:', '2.46')]),
OrderedDict([('Serial Number:', ''),
('Operator ID:', 'test2'),
('Time:', '00:03:48 Test Step 2'),
('TP1:', '17.24'),
('TP2:', '2.47')])]
This method should generalize better than what you are currently writing, and the idea of the code is something I've had saved from another project. I recommend you going through the code and understanding its logic.
From here you should be able to loop through the information list and create your custom .xml file. I would recommend you checking out dicttoxml as well, as it might make your life much easier on the final step.
In regards to your code, remember: breaking down fundamental tasks is easier than trying to incorporate them all into one. By trying to create the xml file while you split your txt file you've created a monster that is hard to tackle when it revolts back with bugs. Instead, take it one step at a time -- create "checkpoints" that you are 100% certain work, and then move on to the next task.

What is a working method for extracting numeric values with associated data from open text?

I tried to look for a solution but nothing was giving me quite what I needed. I'm not sure regex can do what I need.
I need to process a large amount of data where license information is provided. I just need to grab the number of licenses and the name for each license then group and tally the license counts for each company.
Here's an example of the data pulled:
L00129A578-E105C1D138 1 Centralized Recording
$42.00
L00129A677-213DC6D60E 1 Centralized Recording
$42.00
1005272AE2-C1D6CACEC8 5 Station
$45.00
100525B658-3AC4D2C93A 5 Station
$45.00
I would need to grab the license count and license name then add like objects so it would grab (1 Centralized Recording, 1 Centralized Recording, 5 Station, 5 Station) then add license counts and output (2 Centralized Recording, 10 Station)
What would be the easiest way to implement this?
It looks like you're trying to ignore the license number, and get the count and name. So, the following should point you on your way for your data, if it is as uniform as it seems:
import re
r = re.compile(r"\s+(\d+)\s+[A-Za-z ]+")
r = re.compile(r"\s+(\d+)\s+([A-Za-z ]+)")
m = r.search(" 1 Centralized")
m.groups()
# ('1', 'Centralized')
That regex just says, "Require but ignore 1 or more spaces, pay attention to the string of digits after it, require but ignore 1 or more spaces after it, and pay attention to the capital letters, lower case letters, and spaces after it." (You may need to trim of a newline when you're done.)
The file-handling bit would look like:
f = open('/path/to/your_data_file.txt')
for line in f.readlines():
# run regex and do stuff for each line
pass
import re, io, pandas as pd
a = open('your_data_file.txt')
pd.read_csv(io.StringIO(re.sub(r'(?m).*\s(\d+)\s+(.*\S+)\s+$\n|.*','\\1,\\2',a)),
header=None).groupby(1).sum()[0].to_dict()
Pandas is a good tool for jobs like this. You might have to play around with it a bit. You will also need to export your excel file as a .csv file. In the interpreter,try:
import pandas
raw = pandas.read_csv('myfile.csv')
print(raw.columns)
That will give you the column headings for the csv file. If you have headers name and nums, then you can extract those as a list of tuples as follows:
extract = list(zip(raw.name, raw.nums))
You can then sort this list by name:
extract = sorted(extract)
Pandas probably has a method for compressing this easily, but I can't recall it so:
def accum(c):
nm = c[0][0]
count = 0
result = []
for x in c:
if x[0] == nm:
count += x[1]
else:
result.append((nm, count))
nm = x[0]
count = x[1]
result.append((nm, count))
return result
done = accum(extract)
Now you can write this to a text file as follows(fstrings require Python 3.6+)
with open("myjob.txt", "w+") as fout:
for x in done:
line = f"name: {x[0]} count: {x[1]} \n"
fout.write(line)

Dividing a .yml file up

I need to break down .yml files into 3 parts: Header, Working (the part I need to edit), and footer. The header is everything that comes before the 'Resource: ' block, and the footer is everything after the resource block. I essentially need to create code that creates 3 lists, dictionaries, strings, whatever works, that holds these three sections of the YAML file, then allows me to run more code against the working piece, then concatenate all of them together at the end and produce the new document with the same indentations. No changes should be made to the header or the tail.
Note: I've looked up everything about yaml parsing and whatnot, but cannot seem to implement the advice I've found effectively. A solution that does not involve importing yaml would be preferred, but if you must, please explain what is really going on with the import yaml code so I can understand what I'm messing up.
Files that contain one or more YAML documents (in short: a YAML file, which, since
Sept. 2006, has been recommended to have the extension .yaml), are
text files and can be concatenated from parts as such. The only requirement being
that in the end you have a text file that is a valid YAML file.
The easiest is of course to have the Header and footer in separate
files for that, but as you are talking about multiple YAML files this
soon becomes unwieldy. It is however always possible to do some basic
parsing of the file contents.
Since your Working part starts with Resource:, and you indicate 3
lists or dictionaries (you cannot have three strings at the root of a
YAML document). The root level data structure of your YAML document
needs to be a either a mapping, and everything else, except for the
keys to that mapping need to be indented (in theory it only needs to
be indented more, but in practise this almost always means that the
keys are not indented), Like (m.yaml):
# header
a: 1
b:
- 2
- c: 3 # end of header
Resource:
# footer
c:
d: "the end" # really
or the root level needs to be a sequence (s.yaml):
# header
- a: 1
b:
- 2
- c: 3
- 42 # end of header
- Resource:
# footer
- c:
d: "the end" # really
both can easily be split without loading the YAML, here is the example code for doing so for
the file with the root level mapping:
from pathlib import Path
from ruamel.yaml import YAML
inf = Path('m.yaml')
header = [] # list of lines
resource = []
footer = []
for line in inf.open():
if not resource:
if line.startswith('Resource:'): # check if we are at end of the header
resource.append(line)
continue
header.append(line)
continue
elif not footer:
if not line or line[0] == ' ': # still in the resource part
resource.append(line)
continue
footer.append(line)
# you now have lists of lines for the header and the footer
# define the new data structure for the resource this is going to be a single key/value dict
upd_resource = dict(Resource=['some text', 'for the resource spec', {'a': 1, 'b': 2}])
# write the header lines, dump the resource lines, write the footer lines
outf = Path('out.yaml')
with outf.open('w') as out:
out.write(''.join(header))
yaml = YAML()
yaml.indent(mapping=2, sequence=2, offset=0) # the default values
yaml.dump(upd_resource, out)
out.write(''.join(footer))
print(outf.read_text())
this gives:
# header
a: 1
b:
- 2
- c: 3 # end of header
Resource:
- some text
- for the resource spec
- a: 1
b: 2
# footer
c:
d: "the end" # really
Doing the same while parsing the YAML file is not more difficult. The following automitcally handles
both cases (whether the root level is a mapping or a sequence):
from pathlib import Path
from ruamel.yaml import YAML
inf = Path('s.yaml')
upd_resource_val = ['some text', 'for the resource spec', {'a': 1, 'b': 2}]
outf = Path('out.yaml')
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=2, offset=0)
yaml.preserve_quotes = True
data = yaml.load(inf)
if isinstance(data, dict):
data['Resource'] = upd_resource_val
else: # assume a list,
for item in data: # search for the item which has as value a dict with key Resource
try:
if 'Resource' in item:
item['Resource'] = upd_resource_val
break
except TypeError:
pass
yaml.dump(data, outf)
This creates the following out.yaml:
# header
- a: 1
b:
- 2
- c: 3
- 42 # end of header
- Resource:
- some text
- for the resource spec
- a: 1
b: 2
# footer
- c:
d: "the end" # really
If the m.yaml file had been the input, the output would have
been exactly the same as with the text based "concatenation" example code.

Categories