How to parse data in a variable length delimited file?

How to parse data in a variable length delimited file? - python

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.
Sample text file :
# # # #
Techy Inn Val NJ
Found the position of # using this code :
1 f = open('sample.txt', 'r')
2 i = 0
3 positions = []
4 for line in f:
5 if line.find('#') > 0:
6 print line
7 for each in line:
8 i += 1
9 if each == '#':
10 positions.append(i)
1 7 11 15 => Positions
So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:

Here's a way to read fixed width fields using regexp
>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>

Off the top of my head:
f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]
Update with tested, fixed code:
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
posns[-1] = len(line)
fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
print fields
Input file:
# # #
Foo BarBaz
123456789abcd
Debug output:
'# # #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']
Robustification notes:
This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).
Final(?) update: Leapfrooging #gnibbler's suggestion to use slice(): set up the slices once before looping.
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
fields = [line[sl].rstrip() for sl in slices]
print fields

Adapted from John Machin's answer
>>> header = "# # # #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']
You can also write the last line like this
>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]
For the other example you give in the comments, you just need to have the correct header
>>> header = "# # # #"
>>> row = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>

Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.
Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)
sample="""
HOTEL CAT ST DEP ##
Test line Techy Inn Val NJ FT FT
"""
data=sample.splitlines()[1:]
def fields(header,line):
previndex=0
prevchar=''
for index,char in enumerate(header):
if char == '#' or (prevchar != char and prevchar == ' '):
if previndex or header[0] != ' ':
yield line[previndex:index]
previndex=index
prevchar = char
yield line[previndex:]
header,dataline = data
print list(fields(header,dataline))
Output
['Techy Inn ', 'Val ', 'NJ ', 'FT ', 'F', 'T']
One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.
Header from sample line:
' Techy_Inn Val NJ FT ##'

def parse(your_file):
first_line = your_file.next().rstrip()
slices = []
start = None
for e, c in enumerate(first_line):
if c != '#':
continue
if start is None:
start = e
continue
slices.append(slice(start, e))
start = e
if start is not None:
slices.append(slice(start, None))
for line in your_file:
parsed = [line[s] for s in slices]
yield parsed

f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
print [line[i:j].strip() for i, j in pos]
f.close()

How about this?
with open('somefile','r') as source:
line= source.next()
sizes= map( len, line.split("#") )[1:]
positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ]
for line in source:
fields = [ line[start,end] for start,end in positions ]
Is this what you're looking for?

Related

Find specific values in a txt file and adding them up with python

I have a txt file which looks like that:
[Chapter.Title1]
Irrevelent=90 B
Volt=0.10 ienl
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=999
Irrevelent=999
[Chapter.Title3]
Irrevelent=92 B
Volt=0.20 ienl
Watt=5 W
Ampere=6 A
Irrevelent=93 C
What I want is that it catches "Title1" and the values "0,1", "2" and "3". Then adds them up (which would be 5.1).
I don't care about the lines with "irrevelent" at the beginning.
And then the same with the third block. Catching "Title3" and adding "0.2", "5" and "6".
The second block with "Title2" does not contain "Volt", Watt" and "Ampere" and is therefore not relevant.
Can anyone please help me out with this?
Thank you and cheers

You can use regular expressions to get the values and the titles in lists, then use them.
txt = """[Chapter.Title1]
Irrevelent=90 B
Volt=1 V
Watt=2 W
Ampere=3 A
Irrevelent=91 C
[Chapter.Title2]
Irrevelent=92 B
Volt=4 V
Watt=5 W
Ampere=6 A
Irrevelent=93 C"""
#that's just the text
import re
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=(\d+)'
rxv2=r'Watt=(\d+)'
rxv3=r'Ampere=(\d+)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
print(res1)
print(resv1)
print(resv2)
print(resv3)
Here you get the titles and the interesting values you want :
['Title1', 'Title2']
['1', '4']
['2', '5']
['3', '6']
You can then use them as you want, for example :
for title_index in range(len(res1)):
print(res1[title_index])
value=int(resv1[title_index])+int(resv2[title_index])+int(resv3[title_index])
#use float() instead of int() if you have non integer values
print("the value is:", value)
You get :
Title1
the value is: 6
Title2
the value is: 15
Or you can store them in a dictionary or an other structure, for example :
#dict(zip(keys, values))
data= dict(zip(res1, [int(resv1[i])+int(resv2[i])+int(resv3[i]) for i in range(len(res1))] ))
print(data)
You get :
{'Title1': 6, 'Title2': 15}
Edit : added opening of the file
import re
with open('filename.txt', 'r') as file:
txt = file.read()
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
res1 = re.findall(rx1, txt)
resv1 = re.findall(rxv1, txt)
resv2 = re.findall(rxv2, txt)
resv3 = re.findall(rxv3, txt)
data= dict(zip(res1, [float(resv1[i])+float(resv2[i])+float(resv3[i]) for i in range(len(res1))] ))
print(data)
Edit 2 : ignoring missing values
import re
with open('filename.txt', 'r') as file:
txt = file.read()
#divide the text into parts starting with "chapter"
substr = "Chapter"
chunks_idex = [_.start() for _ in re.finditer(substr, txt)]
chunks = [txt[chunks_idex[i]:chunks_idex[i+1]-1] for i in range(len(chunks_idex)-1)]
chunks.append(txt[chunks_idex[-1]:]) #add the last chunk
#print(chunks)
keys=[]
values=[]
rx1=r'Chapter.(.*?)\]'
rxv1=r'Volt=([0-9]+(?:\.[0-9]+)?)'
rxv2=r'Watt=([0-9]+(?:\.[0-9]+)?)'
rxv3=r'Ampere=([0-9]+(?:\.[0-9]+)?)'
for chunk in chunks:
res1 = re.findall(rx1, chunk)
resv1 = re.findall(rxv1, chunk)
resv2 = re.findall(rxv2, chunk)
resv3 = re.findall(rxv3, chunk)
# check if we can find all of them by checking if the lists are not empty
if res1 and resv1 and resv2 and resv3 :
keys.append(res1[0])
values.append(float(resv1[0])+float(resv2[0])+float(resv3[0]))
data= dict(zip(keys, values ))
print(data)

Here's a quick and dirty way to do this, reading line by line, if the input file is predictable enough.
In the example I just print out the titles and the values; you can of course process them however you want.
f = open('file.dat','r')
for line in f.readlines():
## Catch the title of the line:
if '[Chapter' in line:
print(line[9:-2])
## catch the values of Volt, Watt, Amere parameters
elif line[:4] in ['Volt','Watt','Ampe']:
value = line[line.index('=')+1:line.index(' ')]
print(value)
## if line is "Irrelevant", or blank, do nothing
f.close()

There are many ways to achieve this. Here's one:
d = dict()
V = {'Volt', 'Watt', 'Ampere'}
with open('chapter.txt', encoding='utf-8') as f:
key = None
for line in f:
if line.startswith('[Chapter'):
d[key := line.strip()] = 0
elif key and len(t := line.split('=')) > 1 and t[0] in V:
d[key] += float(t[1].split()[0])
for k, v in d.items():
if v > 0:
print(f'Total for {k} = {v}')
Output:
Total for [Chapter.Title1] = 6
Total for [Chapter.Title2] = 15

Find and write to next blank cell in a column

I need to find and write to next blank cell.
import csv
with open(r'C:\\filepath\file.txt', 'r') as input_file:
reader = csv.reader(input_file)
with open (r'C:\filepath\file.csv', 'a', newline = '') as output_file:
writer = csv.writer(output_file)
for row in reader:
content = [i.split('~') for i in row]
for row1 in content:
con = [len(k.split('*')) for k in row1]
conn = [m.split('*') for m in row1]
for b in conn:
if con[0] > 4:
if (b[0] == 'NM1' and b[1] == '82' and b[2] == '1' ):
writer.writerow([b[3]] + [b[4]])
print ( b[3] + b[4] )
elif (b[0] == 'GS' ):
writer.writerow(['','','',b[2]])
print(b[2])
Seeking to get the output as shown in pic above. Right now in the first row only 'App1' is printing then in second row the names etc. Input File I am using as below. :
ISA*16* 00 0*T*>~
GS*IN*APP1*0999~
HPT*1*2~ SE*21*0001~
GE*1*145~
NM1*82*1*Tiger1a*Test1*K****~
NM1*82*1*Lion1a*Test2*K****~
NM1*82*1*Elephant1a*Test3*K****~
ISA*16* 00 0*T*>~
GS*IN*APP2*0999~
HPT*1*2~ SE*21*0001~
GE*1*145~
NM1*82*1*Tiger1a*Test4*K****~
ISA*16* 00 0*T*>~
GS*IN*APP1*0999~
HPT*1*2~
SE*21*0001~
GE*1*145~
NM1*82*1*Tiger1a*Test4*K****~
NM1*82*1*Lion1a*Test5*K****~
NM1*82*1*Elephant1a*Test6*K****~
ISA*16* 00 0*T*>~
GS*IN*APP10999~
HPT*1*2~
SE*21*0001~
GE*1*145~
NM1*82*1*Tiger1a*Test7*K****~
[![enter image description here][2]][2]

Ok, I assume that you have an input file where '~' is a record separator and '*' is a field separator. As the csv module only deals with lines I would first use a generator to split the input file on ~.
Then I would feed 2 lists, one with records starting with NM1*82*1 and containing a list of the 2 following fields, one with records starting with GS containing one single field.
Finally I would add each line of the second list to the corresponding line in the first one.
Code could be:
def splitter(fd, sep):
"""Splits fd (assumed to be an input file object) on sep ignoring end of lines"""
last = ""
for line in fd:
lines = line.strip().split(sep)
lines[0] = last + lines[0]
last = lines.pop()
for l in lines:
yield(l.strip())
if last != "":
yield last.strip()
return
with open(r'C:\\filepath\file.txt', 'r') as input_file, \
open (r'C:\filepath\file.csv', 'a', newline = '') as output_file:
rd = csv.reader(splitter(input_file, '~'), delimiter='*')
wr = csv.writer(output_file)
ls1 = []
ls2 = []
for b in rd:
if b[0] == 'NM1' and b[1] == '82' and b[2] == '1':
ls1.append([b[3], b[4]])
elif b[0] == 'GS':
ls2.append(b[2])
for i, b in enumerate(ls2):
ls1[i].append(b)
wr.writerows(ls1)
I obtain:
Tiger1a,Test1,APP1
Lion1a,Test2,APP2
Elephant1a,Test3,APP1
Tiger1a,Test4,APP10999
Tiger1a,Test4
Lion1a,Test5
Elephant1a,Test6
Tiger1a,Test7

Try reading the files into separate dictionary with lines numbers as keys. You can then iterate through both the dictionaries at the same time using zip function.
def zip(*iterables):
# zip('ABCD', 'xy') --> Ax By
sentinel = object()
iterators = [iter(it) for it in iterables]
while iterators:
result = []
for it in iterators:
elem = next(it, sentinel)
if elem is sentinel:
return
result.append(elem)
yield tuple(result)
More info here: Python3 zip function

How to combine the output of two files into single in python?

My code looks like this:
Right now my code outputs two text file named absorbance.txt and energy.txt separately. I need to modify it so that it outputs only one file named combined.txt such that every line of combined.txt has two values separated by comma. The first value must be from absorbance.txt and second must be from energy.txt. ( I apologize if anyone is confused by my writting, Please comment if you need more clarification)
g = open("absorbance.txt", "w")
h = open("Energy.txt", "w")
ask = easygui.fileopenbox()
f = open( ask, "r")
a = f.readlines()
bg = []
wavelength = []
for string in a:
index_j = 0
comma_count = 0
for j in string:
index_j += 1
if j == ',':
comma_count += 1
if comma_count == 1:
slicing_point = index_j
t = string[slicing_point:]
new_str = string[:(slicing_point- 1)]
new_energ = (float(1239.842 / int (float(new_str))) * 8065.54)
print >>h, new_energ
import math
list = []
for i in range(len(ref)):
try:
ans = ((float (ref[i]) - float (bg[i])) / (float(sample[i]) - float(bg[i])))
print ans
base= 10
final_ans = (math.log(ans, base))
except:
ans = -1 * ((float (ref[i]) - float (bg[i])) / (float(sample[i]) - float(bg[i])))
print ans
base= 10
final_ans = (math.log(ans, base))
print >>g, final_ans

Similar to Robert's approach, but aiming to keep control flow as simple as possible.
absorbance.txt:
Hello
How are you
I am fine
Does anybody want a peanut?
energy.txt:
123
456
789
Code:
input_a = open("absorbance.txt")
input_b = open("energy.txt")
output = open("combined.txt", "w")
for left, right in zip(input_a, input_b):
#use rstrip to remove the newline character from the left string
output.write(left.rstrip() + ", " + right)
input_a.close()
input_b.close()
output.close()
combined.txt:
Hello, 123
How are you, 456
I am fine, 789
Note that the fourth line of absorbance.txt was not included in the result, because energy.txt does not have a fourth line to go with it.

You can open both text files and append them to the new text file as shown below. This is what I gave based on your question, not necessarily the code your provided.
combined = open("Combined.txt","w")
with open(r'Engery.txt', "rU") as EnergyLine:
with open(r'Absorbance.txt', "rU") as AbsorbanceLine:
for line in EnergyLine:
Eng = line[:-1]
for line2 in AbsorbanceLine:
Abs = line2[:-1]
combined.write("%s,%s\n" %(Eng,Abs))
break
combined.close()

Delete and save duplicate in another file

In test.txt:
1 a
2 b
3 c
4 a
5 d
6 c
I want to remove duplicate and save the rest in test2.txt:
2 b
5 d
I tried to start with the codes below.
file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
if line:
sline = line.split('\t')
if sline[1] not in word:
print sline[0], sline[1]
word.add(sline[1])
#file2.close()
The results from the codes showed:
1 a
2 b
3 c
5 d
Any suggestion?

You can use collections.Orderedict here:
>>> from collections import OrderedDict
with open('abc') as f:
dic = OrderedDict()
for line in f:
v,k = line.split()
dic.setdefault(k,[]).append(v)
Now dic looks like:
OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])
Now we only need those keys which contain only 1 items in the list.
for k,v in dic.iteritems():
if len(v) == 1:
print v[0],k
...
2 b
5 d

What you're doing is that you're just making sure every second item (letter) gets printed out only once. Which obviously is not what you're saying you want.
You must split your code into two halfs - reading and gathering statistics about letter counts, and part which prints only those which has count == 1.
Converting your original code (I just made it a little simpler):
file1 = open('../test.txt')
words = {}
for line in file1:
if line:
line_num, letter = line.split('\t')
if letter not in words:
words[letter] = [1, line_num]
else:
words[letter][0] += 1
for letter, (count, line_num) in words.iteritems():
if count == 1:
print line_num, letter

I tried to keep it as similar to your stlye as possible:
file1 = open('../test.txt').read().split('\n')
word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
if line:
sline = line.split(' ')
test.append(" ".join([sline[0], sline[1]]))
if (sline[1] not in word):
word.add(sline[1])
num_lines = num_lines + 1;
else:
sin_duple.append(sline[1])
duplicate.append(" ".join([sline[0], sline[1]]))
num_lines = num_lines + 1;
num_duplicates = num_duplicates + 1;
for i in range (0,num_lines+1):
for item in test:
for j in range(0, num_duplicates):
#print((str(i) + " " + str(sin_duple[j])))
if item == (str(i) + " " + str(sin_duple[j])):
test.remove(item)
file2 = open("../test2.txt", 'w')
for item in test:
file2.write("%s\n" % item)
file2.close()

How about some Pandas
import pandas as pd
a = pd.read_csv("test_remove_dupl.txt",sep=",")
b = a.drop_duplicates(cols="a")

Deleting surplus blank lines using Python

I want the Notepad++'s wonderful feature "Delete Surplus blank lines" in Python.
Say if I have file like this
A
B
C
D
I want
A
B
C
D
What is the pythonic way of doing this?
Here is what I tried
A=['a','\n','\n','\n','a','b','\n','\n','C','\n','\n','\n','\n','\n','\n','D']
B=[]
count=0
for l in range(0,len(A)):
if A[l]=='\n':
count=count+1
else:
count=0
if count>1:
if A[l+1]=='\n':
continue
else:
B.append('\n')
else:
if A[l]!='\n':
B.append(A[l])
print B

Make sure there's no more than \n\n, eg:
import re
print re.sub('\n{3,}', '\n\n', your_string, flags=re.M)
And, using itertools.groupby for large files:
from itertools import groupby
with open('your_file') as fin:
for has_value, lines in groupby(fin, lambda L: bool(L.strip())):
if not has_value:
print
continue
for line in lines:
print line,

Here is a one-liner:
In [35]: A=['a','\n','\n','\n','a','b','\n','\n','C','\n','\n','\n','\n','\n','\n','D']
In [36]: B = [A[0]] + [A[i] for i in range(1, len(A)) if A[i] != '\n' or A[i-1] != '\n']
In [37]: B
Out[37]: ['a', '\n', 'a', 'b', '\n', 'C', '\n', 'D']
It basically omits newlines that follow other newlines.

Is this what you are looking for?
>>> def delete_surplus_blank_lines(text):
while '\n\n\n' in text:
text = text.replace('\n\n\n', '\n\n')
return text
>>> text = 'a\n\n\nab\n\nC\n\n\n\n\n\nD'
>>> print(text)
a
ab
C
D
>>> print(delete_surplus_blank_lines(text))
a
ab
C
D
>>>
A more efficient implementation (based on ideas from NPE) would be:
def delete_surplus_blank_lines(text):
return text[:2] + ''.join(text[index] for index in range(2, len(text))
if text[index-2:index+1] != '\n\n\n')
A one-liner of that function is fairly easy to create with a lambda:
delete_surplus_blank_lines = lambda text: return text[:2] + ''.join(text[index] for index in range(2, len(text)) if text[index-2:index+1] != '\n\n\n')

You have a file, so lets define a function called clean_up to clean up the file you give:
def clean_up(file_name,blanks=1):
with open(file_name,'r+') as f:
blank = 0
for line in f:
if blank < blanks:
if line == "\n":
blank += 1
f.write(line)
else:
blank = 0
if line != "\n":
f.write(line)
Now this will iterate through your file, and make sure there are no more than blanks number of blank lines in a row!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse data in a variable length delimited file? - python

Here's a way to read fixed width fields using regexp >>> import re >>> s="Techy Inn Val NJ" >>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups() >>> var1 'Techy' >>> var2 'Inn' >>> var3 'Val' >>> var4 'NJ' >>>

f = open('sample.txt', 'r') pos = [m.span() for m in re.finditer('#\s*', f.next())] pos[-1] = (pos[-1][0], None) for line in f: print [line[i:j].strip() for i, j in pos] f.close()

How about this? with open('somefile','r') as source: line= source.next() sizes= map( len, line.split("#") )[1:] positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ] for line in source: fields = [ line[start,end] for start,end in positions ] Is this what you're looking for?

Related

Find specific values in a txt file and adding them up with python

Find and write to next blank cell in a column

How to combine the output of two files into single in python?

Delete and save duplicate in another file

Deleting surplus blank lines using Python

Categories

Resources