How to parse a "here document" in Python?

How to parse a "here document" in Python? - python

I want to write a Python method that reads a text file with key-values:
FOO=BAR
BUZ=BLEH
I also want to support newlines either through quoting and \n, and by supporting here-docs:
MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC
While the first one is easy to implement, I'm struggling with the second. Is there maybe something in Python's stdlib (i.e. shlex) that I can use already?

"test.txt" content:
FOO=BAR
BUZ=BLEH
MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC
Function:
def read_strange_file(filename):
with open(filename) as f:
file_content = f.read().splitlines()
res = {}
key, value, delim = "", "", ""
for line in file_content:
if "=" in line and not delim:
key, value = line.split("=")
if value.strip(" ").startswith("<<"):
delim = value.strip(" ")[2:] # extracting delimiter keyword
value = ""
continue
if not delim or (delim and line == delim):
if value.startswith("\"") and value.endswith("\""):
# [1: -1] delete quotes
value = bytes(value[1: -1], "utf-8").decode("unicode_escape")
if delim:
value = value[:-1] # delete "\n"
res[key] = value
delim = ""
if delim:
value += line + "\n"
return res
Usage:
result = read_strange_file("test.txt")
print(result)
Output:
{'FOO': 'BAR', 'BUZ': 'BLEH', 'MULTILINE1': 'This\nis a test', 'MULTILINE2': 'This\nis a test'}

I'm assuming that this is the test string (i.e., there are unseen \n characters at the end of each line):
s = ''
s += 'MULTILINE1="This\nis a test"\n'
s += 'MULTILINE2= <<DOC\n'
s += 'This\n'
s += 'is a test\n'
s += 'DOC\n'
The best I can do is to cheat using NumPy:
import numpy as np
A = np.asarray([ss.rsplit('\n', 1) for ss in ('\n'+s).split('=')])
keys = A[:-1,1].tolist()
values = A[1:,0].tolist()
#optionally parse here-documents
di = 'DOC' #delimiting identifier
values = [v.strip().lstrip('<<%s\n'%di).rstrip('\n%s'%di) for v in values]
print('Keys: ', keys)
print('Values: ', values)
#if you want a dictionary:
d = dict( zip(keys, values) )
This results in:
Keys: ['MULTILINE1', 'MULTILINE2']
Values: ['"This\nis a test"', '"This\nis a test"']
It works by sneakily adding a \n character to the beginning of the string, then splitting the whole string by = characters, then finally uses rsplit to retain all values to the right of =, even when those values contain multiple \n characters. Printing the array A makes things clearer:
[['', 'MULTILINE1'],
['"This\nis a test"', 'MULTILINE2'],
[' <<DOC\nThis\nis a test\nDOC', '' ]]

Related

Find specific values in a txt file and adding them up with python

I have a txt file which looks like that:
[Chapter.Title1]
Irrevelent=90 B
Volt=0.10 ienl
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=999
Irrevelent=999
[Chapter.Title3]
Irrevelent=92 B
Volt=0.20 ienl
Watt=5 W
Ampere=6 A
Irrevelent=93 C
What I want is that it catches "Title1" and the values "0,1", "2" and "3". Then adds them up (which would be 5.1).
I don't care about the lines with "irrevelent" at the beginning.
And then the same with the third block. Catching "Title3" and adding "0.2", "5" and "6".
The second block with "Title2" does not contain "Volt", Watt" and "Ampere" and is therefore not relevant.
Can anyone please help me out with this?
Thank you and cheers

You can use regular expressions to get the values and the titles in lists, then use them.
txt = """[Chapter.Title1]
Irrevelent=90 B
Volt=1 V
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=92 B
Volt=4 V
Watt=5 W
Ampere=6 A
Irrevelent=93 C"""
#that's just the text
import re
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=(\d+)'
rxv2=r'Watt=(\d+)'
rxv3=r'Ampere=(\d+)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
print(res1)
print(resv1)
print(resv2)
print(resv3)
Here you get the titles and the interesting values you want :
['Title1', 'Title2']
['1', '4']
['2', '5']
['3', '6']
You can then use them as you want, for example :
for title_index in range(len(res1)):
print(res1[title_index])
value=int(resv1[title_index])+int(resv2[title_index])+int(resv3[title_index])
#use float() instead of int() if you have non integer values
print("the value is:", value)
You get :
Title1
the value is: 6
Title2
the value is: 15
Or you can store them in a dictionary or an other structure, for example :
#dict(zip(keys, values))
data= dict(zip(res1, [int(resv1[i])+int(resv2[i])+int(resv3[i]) for i in range(len(res1))] ))
print(data)
You get :
{'Title1': 6, 'Title2': 15}
Edit : added opening of the file
import re
with open('filename.txt', 'r') as file:
txt = file.read()
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
data= dict(zip(res1, [float(resv1[i])+float(resv2[i])+float(resv3[i]) for i in range(len(res1))] ))
print(data)
Edit 2 : ignoring missing values
import re
with open('filename.txt', 'r') as file:
txt = file.read()
#divide the text into parts starting with "chapter"
substr = "Chapter"
chunks_idex = [_.start() for _ in re.finditer(substr, txt)]
chunks = [txt[chunks_idex[i]:chunks_idex[i+1]-1] for i in range(len(chunks_idex)-1)]
chunks.append(txt[chunks_idex[-1]:]) #add the last chunk
#print(chunks)
keys=[]
values=[]
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
for chunk in chunks:
res1 = re.findall(rx1, chunk)
resv1 = re.findall(rxv1, chunk)
resv2 = re.findall(rxv2, chunk)
resv3 = re.findall(rxv3, chunk)
# check if we can find all of them by checking if the lists are not empty
if res1 and resv1 and resv2 and resv3 :
keys.append(res1[0])
values.append(float(resv1[0])+float(resv2[0])+float(resv3[0]))
data= dict(zip(keys, values ))
print(data)

Here's a quick and dirty way to do this, reading line by line, if the input file is predictable enough.
In the example I just print out the titles and the values; you can of course process them however you want.
f = open('file.dat','r')
for line in f.readlines():
## Catch the title of the line:
if '[Chapter' in line:
print(line[9:-2])
## catch the values of Volt, Watt, Amere parameters
elif line[:4] in ['Volt','Watt','Ampe']:
value = line[line.index('=')+1:line.index(' ')]
print(value)
## if line is "Irrelevant", or blank, do nothing
f.close()

There are many ways to achieve this. Here's one:
d = dict()
V = {'Volt', 'Watt', 'Ampere'}
with open('chapter.txt', encoding='utf-8') as f:
key = None
for line in f:
if line.startswith('[Chapter'):
d[key := line.strip()] = 0
elif key and len(t := line.split('=')) > 1 and t[0] in V:
d[key] += float(t[1].split()[0])
for k, v in d.items():
if v > 0:
print(f'Total for {k} = {v}')
Output:
Total for [Chapter.Title1] = 6
Total for [Chapter.Title2] = 15

Python extract string starting with index up to character

Say I have an incoming string that varies a little:
" 1 |r|=1.2e10 |v|=2.4e10"
" 12 |r|=-2.3e10 |v|=3.5e-04"
"134 |r|= 3.2e10 |v|=4.3e05"
I need to extract the numbers (ie. 1.2e10, 3.5e-04, etc)... so I would like to start at the end of '|r|' and grab all characters up to the ' ' (space) after it. Same for '|v|'
I've been looking for something that would:
Extract a substring form a string starting at an index and ending on a specific character...
But have not found anything remotely close.
Ideas?
NOTE: Added new scenario, which is the one that is causing lots of head-scratching...

To keep it elegant and generic, let's utilize split:
First, we split by ' ' to tokens
Then we find if it has an equal sign and parse the key-value
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
parts = sabich.split(' |')
values = {}
for p in parts:
if '=' in p:
k, v = p.split('=')
values[k.replace('|', '').strip()] = v.strip(' ')
# {'r': '3.2e10', 'v': '4.3e05'}
print(values)
This can be converted to the one-liner:
import re
sabich = "134 |r| = 3.2e10 |v|=4.3e05"
values = {t[0].replace('|', '').strip() : t[1].strip(' ') for t in [tuple(p.split('=')) for p in sabich.split(' |') if '=' in p]}
# {'|r|': '1.2e10', '|v|': '2.4e10'}
print(values)

You can solve it with a regular expression.
import re
strings = [
" 1 |r|=1.2e10 |v|=2.4e10",
" 12 |r|=-2.3e10 |v|=3.5e-04"
]
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
out.append(dict(re.findall(pattern, s)))
print(out)
Output
[{'|r|': '1.2e10', '|v|': '2.4e10'}, {'|r|': '-2.3e10', '|v|': '3.5e-04'}]
And if you want to convert the strings to number
out = []
pattern = r'(?P<name>\|[\w]+\|)=(?P<value>-?\d+(?:\.\d*)(?:e-?\d*)?)'
for s in strings:
# out.append(dict(re.findall(pattern, s)))
out.append({
name: float(value)
for name, value in re.findall(pattern, s)
})
Output
[{'|r|': 12000000000.0, '|v|': 24000000000.0}, {'|r|': -23000000000.0, '|v|': 0.00035}]

How do i get this code to count words not letters?

I am fairly new to python and I am trying to get this code to open txt files and rid the files of punctuation, Read those files, create a list of the words, and then count the occurrences of each word. It is counting the occurrences of letters. Also, how do you properly call functions within other functions?
import os
# create the dictionary
dictionary = {}
# create dictionary list
dictionarylist = []
def make_a_listh():
path = 'data/training/'
Heal = path + 'Health/'
heal_files = os.listdir(Heal)
# print(heal_files)
punctuations = '''!()-—[]{};:'"\,<>.|/?##$%^&*_~'''
no_puncth = ""
line = "---------------------------------------------------
--------------------------"
for j in heal_files:
file2 = open(Heal + j, 'r').read()
for char in file2:
if char not in punctuations:
no_puncth = no_puncth + char
print(j + line, "\n", no_puncth)
def make_a_listm():
path = 'data/training/'
Minn = path + 'Minnesota/'
minn_files = os.listdir(Minn)
# print the filename and a new line
punctuations = '''!()—-—[]{};:’'"\,<>.|/?#“#$%^&*_~'''
no_punctm = ""
line = "---------------------------------------------------
-------------------------"
for i in minn_files:
file1 = open(Minn + i, 'r')
for char in file1:
if char not in punctuations:
no_punctm = no_punctm + char
# print(i + line, "\n", no_punctm.replace('"',''))
return no_punctm
def Freq(file1):
# as long as there is a line in file loop
for line in file1:
# create variable to hold each word from the file
words = line.split()
# as long as there is a word in words loop
for eachword in words:
# if there is an existing word in dictionary
increase occurrence count
if eachword in dictionary:
dictionary[eachword] = dictionary[eachword] + 1
# if there is a word that is new set count to 1
else:
dictionary[eachword] = 1
# for every item (k and v) in dictionary, loop
for k, v in dictionary.items():
# create temporary place holder for v and k values
temporary = [v, k]
# (add) temporary values to dictionaryList
dictionarylist.append(temporary)
# print out each value from dictionaryList in.
descending order on new lines
print("\n".join(map(str, sorted(dictionarylist,
reverse=True))))
Freq(file1=make_a_listm())

Here is how you can use the Counter() method from the collections module, and how you can use re.sub() to handle the punctuations more efficiently:
from glob import glob
import re
from collections import Counter
words = []
for file in glob("C:\\Users\\User\\Desktop\\Folder\\*.txt"): # For every file in Folder that ends with .txt
with open(file, 'r') as r: # Open the file in read mode
nopunc = re.sub('\W', ' ', r.read()) # Use re.sub to replace all punctuations with spaces
words += [w.strip().lower() for w in nopunc.split() if w.strip()] # List all the words in lower case, and add the list to words
print(Counter(words)) # prints out a dictionary with each unique word as the keys, and the frequency of those words as values

Creating a dictionary from FASTA file

I have a file that looks like this:
%Labelinfo
string1
string2
%Labelinfo2
string3
string4
string5
I would like to create dictionary that has key a string that is %Labelinfo, and value that is a concatenation of strings from one Labelinfo to next. Basically this :
{%Labelinfo : string1+string2 , %Labelinfo : string2+string3+string4}
Problem is that there can be any number of lines between two "Labelinfo" lines. For example, between %Labelinfo to %Labelinfo2 can be 5 lines. Then, between %Labelinfo2 to %Labelinfo3 can be, let's say 4 lines.
However, the line that containes "Labelinfo" always starts with the same character, for example %.
How do I solve this problem?

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
d = {}
with open('Labelinfo.txt') as f:
for line in f:
if len(line) > 1:
if '%Labelinf' in line:
key = line.strip()
d[key] = ""
else:
d[key] += line.strip() + "+"
d = {key: d[key][:-1] for key in d}
print d
{'%Labelinfo2': 'string3+string4+string5', '%Labelinfo': 'string1+string2'}

Here's how I would write it:
The program loops through every line in the file. Checks to see if that line is empty, if it is, ignore it. If it isn't empty, then we process the line. Anything with a % at the start denotes a variable, so let's go ahead and add that to the dictionary and set that to a variable, current. Then we keep on adding to the dictionary at key current, until the next %
di = {}
with open("fasta.txt","r") as f:
current = ""
for line in f:
line = line.strip()
if line == "":
continue
if line[0] == "%":
di[line] = ""
current = line
else:
if di[current] == "":
di[current] = line
else:
di[current] += "+" + line
print(di)
Output:
{'%Labelinfo2': 'string3+string4+string5', '%Labelinfo': 'string1+string2'}
Note: Dictionaries do not enforce error, so they will be out of order; but stil accessible in the same way. And, just a heads up, your example output is slightly wrong, you forgot to put in the 2 after one of the %Labelinfo.

import re
d = {}
text = open('fasta.txt').read()
for el in [ x for x in re.split(r'\s+', text) if x]:
if el.startswith('%'):
key = el
d[key] = ''
else:
value = d[key] + el
d[key] = value
print(d)
{'%Labelinfo': 'string1string2', '%Labelinfo2': 'string3string4string5'}

How to parse data in a variable length delimited file?

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.
Sample text file :
# # # #
Techy Inn Val NJ
Found the position of # using this code :
1 f = open('sample.txt', 'r')
2 i = 0
3 positions = []
4 for line in f:
5 if line.find('#') > 0:
6 print line
7 for each in line:
8 i += 1
9 if each == '#':
10 positions.append(i)
1 7 11 15 => Positions
So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:

Here's a way to read fixed width fields using regexp
>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>

Off the top of my head:
f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]
Update with tested, fixed code:
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
posns[-1] = len(line)
fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
print fields
Input file:
# # #
Foo BarBaz
123456789abcd
Debug output:
'# # #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']
Robustification notes:
This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).
Final(?) update: Leapfrooging #gnibbler's suggestion to use slice(): set up the slices once before looping.
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
fields = [line[sl].rstrip() for sl in slices]
print fields

Adapted from John Machin's answer
>>> header = "# # # #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']
You can also write the last line like this
>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]
For the other example you give in the comments, you just need to have the correct header
>>> header = "# # # #"
>>> row = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>

Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.
Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)
sample="""
HOTEL CAT ST DEP ##
Test line Techy Inn Val NJ FT FT
"""
data=sample.splitlines()[1:]
def fields(header,line):
previndex=0
prevchar=''
for index,char in enumerate(header):
if char == '#' or (prevchar != char and prevchar == ' '):
if previndex or header[0] != ' ':
yield line[previndex:index]
previndex=index
prevchar = char
yield line[previndex:]
header,dataline = data
print list(fields(header,dataline))
Output
['Techy Inn ', 'Val ', 'NJ ', 'FT ', 'F', 'T']
One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.
Header from sample line:
' Techy_Inn Val NJ FT ##'

def parse(your_file):
first_line = your_file.next().rstrip()
slices = []
start = None
for e, c in enumerate(first_line):
if c != '#':
continue
if start is None:
start = e
continue
slices.append(slice(start, e))
start = e
if start is not None:
slices.append(slice(start, None))
for line in your_file:
parsed = [line[s] for s in slices]
yield parsed

f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
print [line[i:j].strip() for i, j in pos]
f.close()

How about this?
with open('somefile','r') as source:
line= source.next()
sizes= map( len, line.split("#") )[1:]
positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ]
for line in source:
fields = [ line[start,end] for start,end in positions ]
Is this what you're looking for?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse a "here document" in Python? - python

Related

Find specific values in a txt file and adding them up with python

Python extract string starting with index up to character

How do i get this code to count words not letters?

Creating a dictionary from FASTA file

How to parse data in a variable length delimited file?

Categories

Resources