Converting text file to YAML in Python

Converting text file to YAML in Python - python

I have a text file to convert to YAML format. Here are some notes to describe the problem a little better:
The sections within the file have a different number of subheadings to each other.
The values of the subheadings can be any data type (e.g. string, bool, int, double, datetime).
The file is approximately 2,000 lines long.
An example of the format is below:
file_content = '''
Section section_1
section_1_subheading1 = text
section_1_subheading2 = bool
end
Section section_2
section_2_subheading3 = int
section_2_subheading4 = double
section_2_subheading5 = bool
section_2_subheading6 = text
section_2_subheading7 = datetime
end
Section section_3
section_3_subheading8 = numeric
section_3_subheading9 = int
end
'''
I have tried to convert the text to YAML format by:
Replacing the equal signs with colons using regex.
Replacing Section section_name with section_name :.
Removing end between each section.
However, I am having difficulty with #2 and #3. This is the text-to-YAML function I have created so far:
import yaml
import re
def convert_txt_to_yaml(file_content):
"""Converts a text file to a YAML file"""
# Replace "=" with ":"
file_content2 = file_content.replace("=", ":")
# Split the lines
lines = file_content2.splitlines()
# Define section headings to find and replace
section_names = "Section "
section_headings = r"(?<=Section )(.*)$"
section_colons = r"\1 : "
end_names = "end"
# Convert to YAML format, line-by-line
for line in lines:
add_colon = re.sub(section_headings, section_colons, line) # Add colon to end of section name
remove_section_word = re.sub(section_names, "", add_colon) # Remove "Section " in section header
line = re.sub(end_names, "", remove_section_word) # Remove "end" between sections
# Join lines back together
converted_file = "\n".join(lines)
return converted_file
I believe the problem is within the for loop - I can't manage to figure out why the section headers and endings aren't changing. It prints perfectly if I test it, but the lines themselves aren't saving.
The output format I am looking for is the following:
file_content = '''
section_1 :
section_1_subheading1 : text
section_1_subheading2 : bool
section_2 :
section_2_subheading3 : int
section_2_subheading4 : double
section_2_subheading5 : bool
section_2_subheading6 : text
section_2_subheading7 : datetime
section_3 :
section_3_subheading8 : numeric
section_3_subheading9 : int
'''

I would rather convert it to dict and then format it as yaml using the yaml package in python as below:
import yaml
def convert_txt_to_yaml(file_content):
"""Converts a text file to a YAML file"""
config_dict = {}
# Split the lines
lines = file_content.splitlines()
section_title=None
for line in lines:
if line=='\n':
continue
elif re.match('.*end$', line):
#End of section
section_title=None
elif re.match('.*Section\s+.*', line):
#Start of Section
match_obj = re.match(".*Section\s+(.*)", line)
section_title=match_obj.groups()[0]
config_dict[section_title] = {}
elif section_title and re.match(".*{}_.*\s+=.*".format(section_title), line):
match_obj = re.match(".*{}_(.*)\s+=(.*)".format(section_title), line)
config_dict[section_title][match_obj.groups()[0]] = match_obj.groups()[1]
return yaml.dump(config_dict )

Related

Extract paragraphs instead of sentences with Python Tika

I've been trying around to solve my issue, but nothing seems to be working - I need your help!
I found code that is quite effective at parsing a PDF (e.g. annual report) and extracting sentences from it. I was, however, wondering how I could extract paragraphs from this instead of sentences.
My intuition would be to split paragraphs by not removing the double "\n", i.e. "\n\n", which could indicate the start of a new paragraph. I haven't quite managed to get it to work. Does anyone have any tips?
#PDF parsing
class parsePDF:
def __init__(self, url):
self.url = url
def extract_contents(self):
""" Extract a pdf's contents using tika. """
pdf = parser.from_file(self.url)
self.text = pdf["content"]
return self.text
def clean_text(self):
""" Extract & clean sentences from raw text of pdf. """
# Remove non ASCII characters
printables = set(string.printable)
self.text = "".join(filter(lambda x: x in printables, self.text))
# Replace tabs with spaces
self.text = re.sub(r"\t+", r" ", self.text)
# Aggregate lines where the sentence wraps
# Also, lines in CAPITALS is counted as a header
fragments = []
prev = ""
for line in re.split(r"\n+", self.text):
if line.isupper():
prev = "." # skip it
elif line and (line.startswith(" ") or line[0].islower()
or not prev.endswith(".")):
prev = f"{prev} {line}" # make into one line
else:
fragments.append(prev)
prev = line
fragments.append(prev)
# Clean the lines into sentences
sentences = []
for line in fragments:
# Use regular expressions to clean text
url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\."
r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:#\-_=#])*")
line = re.sub(url_str, r" ", line) # URLs
line = re.sub(r"^\s?\d+(.*)$", r"\1", line) # headers
line = re.sub(r"\d{5,}", r" ", line) # figures
line = re.sub(r"\.+", ".", line) # multiple periods
line = line.strip() # leading & trailing spaces
line = re.sub(r"\s+", " ", line) # multiple spaces
line = re.sub(r"\s?([,:;\.])", r"\1", line) # punctuation spaces
line = re.sub(r"\s?-\s?", "-", line) # split-line words
# Use nltk to split the line into sentences
for sentence in nltk.sent_tokenize(line):
s = str(sentence).strip().lower() # lower case
# Exclude tables of contents and short sentences
if "table of contents" not in s and len(s) > 5:
sentences.append(s)
return sentences
And this is what you can call after
url = URL OF ANY PDF YOU WANT (e.g. https://www.aperam.com/sites/default/files/documents/Annual_Report_2021.pdf)
pp = parsePDF(url)
pp.extract_contents()
sentences = pp.clean_text()
All recommendations are greatly appreciated!
PS: If anyone has a better solution already created, I'd be more than happy to have a look at it!

Sort a file with a specific line pattern in Python

Given a file with the following content:
enum class Fruits(id: String) {
BANANA(id = "banana"),
LEMON(id = "lemon"),
DRAGON_FRUIT(id = "dragonFruit"),
APPLE(id = "apple"); }
I want to sort this file given the pattern "id = ", and then replace these lines with the new sorted lines.
I wrote a piece of code in python that sorts the whole file, but I'm struggling with regex to read/find the pattern so I can sort it.
My python script:
import re
fruitsFile = '/home/genericpath/Fruits.txt'
def sortFruitIds():
# this is an attempt to get/find the pattern, but it return an AttributeError:
# 'NoneType' object has no attribute 'group'
with open(fruitsFile, "r+") as f:
lines = sorted(f, key=lambda line: str(re.search(r"(?<=id = )\s+", line)))
for line in lines:
f.write(line)
When trying to find the pattern with regex, it returns an AttributeError: 'NoneType' object has no attribute 'group'
Any help is appreciated.

Looks like your main issue is that your regex expects a space character \s but what you want to be looking for is any non-space character \S. With that in mind this should work:
import re
fruitsFile = 'Fruits.txt'
def sortFruitIds():
with open(fruitsFile, "r+") as f:
lines = f.readlines()
lines_sorted = sorted(lines, key=lambda line: re.search(r"(?<=id = \")\S+|$", line).group())
for line in lines_sorted:
f.write(line)
I also added |$ to the regex to return an empty string if there is no match, and added group() to grab the match.

We can approach this by doing a regex find all for all entries in the enum. Then sort them alphabetically by the id string value, and join together the final enum code. Note that below I also extract the first line of the enum for use later in the output.
inp = '''enum class Fruits(id: String) {
BANANA(id = "banana"),
LEMON(id = "lemon"),
DRAGON_FRUIT(id = "dragonFruit"),
APPLE(id = "apple"); }'''
header = re.search(r'enum.*?\{', inp).group()
items = re.findall(r'\w+\(id\s*=\s*".*?"\)', inp)
items.sort(key=lambda m: re.search(r'"(.*?)"', m).group(1))
output = header + '\n ' + ',\n '.join(items) + '; }'
print(output)
This prints:
enum class Fruits(id: String) {
APPLE(id = "apple"),
BANANA(id = "banana"),
DRAGON_FRUIT(id = "dragonFruit"),
LEMON(id = "lemon"); }

split list with multiple delimiter

I have a list containing string from lines in txt file.
import csv
import re
from collections import defaultdict
parameters = ["name", "associated-interface", "type", "subnet", "fqdn", "wildcard-fqdn", "start-ip", "end-ip", "comment"]
address_dict = defaultdict(dict)
address_statements = []
with open("***somepaths**\\file.txt",
"r") as address:
in_address = False
for line in address:
line = line.strip()
#print (line)
if in_address and line != "next":
if line == "end":
break
address_statements.append(line)
else:
if line == "config firewall address":
in_address = True
#print(address_statements)
if address_statements:
for statement in address_statements:
op, param, *val = statement.split()
if op == "edit":
address_id = param
elif op == "set" and param in parameters:
address_dict[address_id][param] = ' '.join(val)
# output to the CSV
with open("***somepaths**\\file.csv", "w",
newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=parameters)
writer.writeheader()
for key in address_dict:
address_dict[key]['name'] = key
writer.writerow(address_dict[key])
output should be like this: edit "name test" but it turn out to emit the space after the name and be like this: edit "name
How can I include everything in the double quotes?

You are using
op, param, *val = statement.split()
which splits at spaces - a line of
edit "CTS SVR"'
will put '-edit' into op, '"CTS' into param and the remainder of the line (split at spaces as list) into val: ['SVR"'].
You need a way to Split a string by spaces preserving quoted substrings - if you have params that are internally seperated by spaces and delimited by quote.
Inspired by this answer the csv module gives you what you need:
t1 = 'edit1 "some text" bla bla'
t2 = 'edit2 "some text"'
t3 = 'edit3 some thing'
import csv
reader = csv.reader([t1,t2,t3], delimiter = " ", skipinitialspace = True)
for row in reader:
op, param, *remainder = row
print(op,param,remainder, sep = " --> ")
Output:
edit1 --> some text --> ['bla', 'bla']
edit2 --> some text --> []
edit3 --> some --> ['thing']
You can apply the reader to one line only ( reader = csv.reader([line], delimiter = " ") ).
Probably a duplicate of Split a string by spaces preserving quoted substrings - I closevoted earlier on the question and cannot vote for duplicate anymore - hence the detailed answer.

How to replace quote ( `”` ) with normal quote (`"`)

I have some large content in a text file like this:
1. name="user1” age="21”
2. name="user2” age="25”
....
If we notice I have this ( ” ) special type of quote here at end of each word.
I just want to replace that quote ( ” ) with normal quote (")
Code:
import codecs
f = codecs.open('myfile.txt',encoding='utf-8')
for line in f:
print "str text : ",line
a = repr(line)
print "repr text : ",a
x = a.replace(u'\u201d', '"')
print "new text : ",x
Output:
str text : 1. name="user1” age="21”
repr text : u'1. name="user1\u201d age="21\u201d\n'
new text : u'1. name="user1\u201d age="21\u201d\n'
but its not working. What I am missing here?
Update :
I just tried this:
import codecs
f = codecs.open('one.txt')
for line in f:
print "str text : ",line
y= line.replace("\xe2\x80\x9d", '"')
print "ynew text : ",y
and it is working now.
Still I want to know what was wrong with x = a.replace(u'\u201d', '"')

a is the repr of the line, which does not contain the char ”, but contains the string \,u,2,0,1,d.
So changing a = repr(line) to a = line will fix the problem.

Converting Binary to ASCII, and ASCII to Binary

I'm currently writing an ascii-binary/binary-ascii converter in Python for a school project, and I have an issue with converting from ascii (String text) to binary. The idea is to print the outcome in the test() on the bottom of the code.
When running the code in WingIDE, an error occurs:
On the line starting with
bnary = bnary + binary[chnk]
KeyError: "Norway stun Poland 30:28 and spoil Bielecki's birthday party."
What I'm trying to do here is to convert the String of text stored in "text.txt" to a String of integers, and then print this binary string.
Any help is greatly appreciated. I tried looking at other ascii-binary vice-versa convertion related questions, but none seemed to work for me.
My code:
def code():
binary = {}
ascii = {}
# Generate ascii code
for i in range(0,128) :
ascii[format(i,'08b')] = chr(i)
# Reverse the ascii code, this will be binary
for k, v in ascii.iteritems():
binary[v] = binary.get(v, [])
binary[v].append(k)
return ascii
def encode(text,binary):
'''
Encode some text using text from a source
'''
bnary = ""
fi = open(text, mode='rb')
while True:
chnk = fi.read()
if chnk == '':
break
if chnk != '\n':
binry = ""
bnary = bnary + binary[chnk]
return bnary
def decode(sourcecode,n, ascii):
'''
Decode a sourcecode using chunks of size n
'''
sentence = ""
f = open(sourcecode, mode='rb') # Open a file with filename <sourcecode>
while True:
chunk = f.read(n) # Read n characters at time from an open file
if chunk == '': # This is one way to check for the End Of File in Python
break
if chunk != '\n':
setence = "" # The ascii sentence generated
# create a sentence
sentence = sentence + ascii[chunk]
return sentence
def test():
'''
A placeholder for some test cases.
It is recommended that you use some existing framework, like unittest,
but for a temporary testing in a development version can be done
directly in the module.
'''
print encode('text.txt', code())
print decode('sourcecode.txt', 8, code())
test()

If you want to decode and encode, have this solutions
Encode ascii to bin
def toBinary(string):
return "".join([format(ord(char),'#010b')[2:] for char in string])
Encode bin to ascii
def toString(binaryString):
return "".join([chr(int(binaryString[i:i+8],2)) for i in range(0,len(binaryString),8)])

fi.read() returns the whole document the first time and '' next times. So you should do
text = fi.read()
for char in text:
do_stuff()
Edit1
You can only read your file once. Thus you have to get your chars one by one. A file.read returns a string containing the whole document.
You can iterate over a string to get chars one by one.
The main error is your binary is {"a":['010001110'], "b":...} and you try to access with the key "azerty" where you should do it char by char:
string = "azer"
result = []
for c in string:
result += binary[c]
>>> result = [['11001'],[1001101'],...]

# Lets say your filename is stored in fname
def binary(n):
return '{0:08b}'.format(n)
with open(fname) as f:
content = f.readlines()
for i in content:
print(binary(ord(i)), end='')
print('')
This will give you the integer value(from ascii) of each character in the file, line by line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting text file to YAML in Python - python

Related

Extract paragraphs instead of sentences with Python Tika

Sort a file with a specific line pattern in Python

split list with multiple delimiter

How to replace quote ( `”` ) with normal quote (`"`)

Converting Binary to ASCII, and ASCII to Binary

Categories

Resources