Python parsing a text file and logical methods - python

I'm a bit stuck with python logic.
I'd like some some advice on how to tackle a problem I'm having with python and the methods to parsing data.
I've spent a bit of time reading the python reference documents and going through this site and I understand there are several ways to do what I'm trying to achieve and this is the path I've gone down.
I'm re-formating some text files with data generated from some satellite hardware to be uploaded into a MySQL database.
This is the raw data
TP N: 1
Frequency: 12288.635 Mhz
Symbol rate: 3000 KS
Polarization: Vertical
Spectrum: Inverted
Standard/Modulation: DVB-S2/QPSK
FEC: 1/2
RollOff: 0.20
Pilot: on
Coding mode: ACM/VCM
Short frame
Transport stream
Single input stream
RF-Level: -49 dBm
Signal/Noise: 6.3 dB
Carrier width: 3.600 Mhz
BitRate: 2.967 Mbit/s
The above section is repeated for each transponder TP N on the satellite
I'm using this script to extract the data I need
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
for line in sat_raw:
if any(s in line for s in strings):
for word in line.split():
if ':' in word:
sat_out.write(line.split(':')[-1])
sat_raw.close()
sat_out.close()
The output data is then formatted like this before its sent to the database
12288.635 Mhz
3000 KS
Vertical
DVB-S2/QPSK
1/2
-49 dBm
6.3 dB
3.600 Mhz
2.967 Mbit/s
This script is working fine but for some features I want to implement on MySQL I need to edit this further.
Remove the decimal point and 3 numbers after it and MHz on the first "frequency" line.
Remove all the trailing measurement references KS,dBm,dB, Mhz, Mbit.
Join the 9 fields into a comma delimited string so each transponders (approx 30 per file ) are on their own line
I'm unsure weather to continue down this path adding onto this existing script (which I'm stuck at the point where the output file is written). Or rethink my approach to the way I'm processing the raw file.

My solution is crude, might not work in corner cases, but it is a good start.
import re
import csv
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
csv_writer = csv.writer(sat_out)
csv_output = []
for line in sat_raw:
if any(s in line for s in strings):
try:
m = re.match(r'^.*:\s+(\S+)', line)
value = m.groups()[0]
# Attempt to convert to int, thus removing the decimal part
value = int(float(value))
except ValueError:
pass # Ignore conversion
except AttributeError:
pass # Ignore case when m is None (no match)
csv_output.append(value)
elif line.startswith('TP N'):
# Before we start a new set of values, write out the old set
if csv_output:
csv_writer.writerow(csv_output)
csv_output=[]
# If we reach the end of the file, don't miss the last set of values
if csv_output:
csv_writer.writerow(csv_output)
sat_raw.close()
sat_out.close()
Discussion
The csv package helps with CSV output
The re (regular expression) module helps parsing the line and extract the value from the line.
In the line that reads, value = int(...), We attempt to turn the string value into an integer, thus removing the dot and following digits.
When the code encounters a line that starts with 'TP N', which signals a new set of values. We write out the old set of value to the CSV file.

import math
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
files=['/BLScan/reports/1520.txt']
sat_out = open('1520out.txt', 'w')
combineOutput=[]
for myfile in files:
sat_raw = open(myfile, 'r')
singleOutput=[]
for line in sat_raw:
if any(s in line for s in strings):
marker=line.split(':')[1]
try:
data=str(int(math.floor(float(marker.split()[0]))))
except:
data=marker.split()[0]
singleOutput.append(data)
combineOutput.append(",".join(singleOutput))
for rec in combineOutput:
sat_out.write("%s\n"%rec)
sat_raw.close()
sat_out.close()
Add all the files that you want to parse in files list. It will write the output of each file as a separate line and each field comma separated.

Related

Python checking if line above or below equals to phrase

I am trying to make an automated monthly cost calculator for my family. The idea is whenever they shop they take a picture of the receipt and send it to an e-mail adress. A Python script downloads that picture and using the Google Vision API scans for the Total amount which then gets written into a .csv file for later use. ( I have yet to make the csv thing so it's only being saved into txts for now.)
This works because in my country the receipts all look the same because of regulations however the Google Vision API returns the OCRed text back line by line. What i am trying to do now is check the text line by line for the total amount which is always in the following format (Numbers space Currency) then i check if the OCR messed up something like put the "Total amount" above or below the actual numbers.
My problem is that if i run this script on more than 3 .txt OCR data then it only gets the first 2 right even though they are the same if i manually check them. If i run it on them 1 by 1 then it gets them perfect everytime.
The OCR data looks like this:
Total amount:
1000 USD
or
1000 USD
Total amount:
My code so far:
import re
import os
import codecs
for files in os.listdir('texts/'):
filedir="texts/"+str(files)
with codecs.open(filedir,'rb','utf-8') as f:
lines=f.readlines()
lines=[l.strip() for l in lines]
for index,line in enumerate(lines):
match=re.search(r"(\d+) USD",line)
if match:
if lines[index+1].endswith("USD"):
amount=re.sub(r'(\d\s+(\d)',r'\1\2',lines[index])
amount=amount.replace(" USD","")
print(amount)
with open('amount.txt',"a") as data:
data.write(amount)
data.write("\n")
if lines[index-1].endswith("USD"):
amount=re.sub(r'(\d\s+(\d)',r'\1\2',lines[index])
amount=amount.replace(" USD","")
print(amount)
with open('amount.txt',"a") as data:
data.write(amount)
data.write("\n")
Question: checking if line above or below equals to phrase
Simplify to the following:
Assumptions:
The Amount line has the following format (Numbers space Currency).
These exact phrase "Total amount:", exists allways in the other line.
The above lines are separated with a blank line.
FILE1 = u"""Total amount:
1000 USD
"""
FILE2 = u"""1000 USD
Total amount:"""
import io
import os
import codecs
total = []
#for files in os.listdir('texts/'):
for files in [FILE1, FILE2]:
# filedir="texts/"+str(files)
# with codecs.open(filedir,'rb','utf-8') as f:
with io.StringIO(files) as f:
v1 = next(f).rstrip()
# eat empty line
next(f)
v2 = next(f).rstrip()
if v1 == 'Total amount:':
total.append(v2.split()[0])
else:
total.append(v1.split()[0])
print(total)
# csv_writer.writerows(total)
Output:
[u'1000', u'1000']

convert a .dat file with python

We have a .dat file that is completely unreadable in a text-editor. This .dat file holds information about materials and I used on a machine which can display the data.
Now we are going to use the files for other purposes end need to convert them to a readable format.
When I open the .dat file in NOtepad++ this is what I see:
2ñg½x¹¾T?B½ÛÁ#½^fÓ¼":°êȽ¸ô»YY‚½g *½$S)¼¤“è¼F„J¼c1¼$ ¼*‡Ç»½Ú7¼F]S¼Ê(Ï<(‚¤½Y´½å½#ø;N‡;o¸¼¨*S:ΣC¼ÎÀR½žO<š_å¼T÷½p4>½8«»«=ýÆZ<¿[=²”<æt¼pc»q³<×R<ï4¼}Ž‚8pýw<~ï»z†:Qš¼^Kp;XI=<Ѷ ¼
j½…é-=*Ý=;-X7½ßÓ:<ÐZ<Ás!=²LÀ;æã=võu<„4½§V9„燺ý$D<"Š|»å€,<E{=+»¥;2wN¼¸rF=h®<ç[=²=é\=Îý<…À¦¼Î,è¼u…<#_.¼¾Ã¨9æ3½Å°“<ª×½°ÇD¼JÝþ»ph{=Ÿv8;Ne¼’Q; ´{»(ì¿<6Þï»éõ¼*p½©m¼ÝM–<ròä¼½™™¼Õö=j|½±‰Í;2¥C¼¯ 輓?½>¼:„3» ­ù¼¦k
¼wÞ¹¼Öm‚»=T¼êy¦¼k[…»ÎÉO¼Žc¼$ï½ÖN;H¼4Ø:8¸ž¼dLý¼ø9ø»cI(;4뼈Q¼ž7½,h?¼À ɼy½Å’œ¼¶Åº¼å"±¼bžu¼ Z;½¨½øáY¼ZÖ»2
½ð^š<Þ„§<»ƒ<#±c<f<ŸPÝ;‹œlºÐöï»ö²ñ;ÜŠb=¦';f´<ò=¬3B<\mÛ¼¹©»åB<»Xô;€ºp»¸ ±¼‰Øâ¼7Ug¼€÷ø¼lËû»j}»²‘ô;wu½®ö²¼Ÿ„¼ŠÉ¼ÖV8 Š¼‹÷¯¼ål¼é°ª¼‹o4½ðî$<4Q:.A<
<Ž¬ë<^·G<n
œ<¶l<: è;’MÜ9êÁa<’¢T;~&¼gY®»"P¼¤µº;$H=½…o<6ëæ»ûÒ¼Ê,<‚p½¯À¼#êw»Ír¥¼¸wغA:«<TDI»Nºµ<€ŠMºwnܸ·6:CÕj<àÆ:Dr<7ëo9STÏ<G¼R?M<:)N;.3 <†L<ºZ=I,Y<ñF;iÙ.» pºå0<;:=Tʪ;—ÄË;?'й0Ž:J’J<jR¯»´/½Ô”ؼ•¥˜¼hμd™<9¼iˆ‘<(Šd<ɇÖ#·³È‚»#O><Úo<Ó¸ <ëî;ÒQ<õöî<#Nm¼öw4¼’O¼v <:3<
We know the data in the .dat file has the following format:
MaterialBase ThicknessBase ThicknessIterated Pixel Value
Plastic 0 0 1 -5.662651e-02
Plastic 0 0 2 -1.501216e-01
Plastic 0 0 3 -4.742368e-02
By searching lots of code on, also here of course, I came up with the following code:
import time
import binascii
import csv
import serial
import numpy as np
with open('validationData.dat.201805271617', 'rb') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('|').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
The error I get now is:
File "c:\Users\joost.bazelmans\Documents\python\dat2csv.py", line 15, in
<module>
newLine = line.strip('|').split()
TypeError: a bytes-like object is required, not 'str'
It looks lik the script is reading the file, but then it cannot split it by the | character. But I am lost now. Any ideas on how to continue?
Edit 2018-07-23 13:00
Based on guidot's reply we tried to work with.struct. We are now able to get a list of floating point values out of the .dat file. This is also what the R-script does. But after that the R-script does translate the floats to readable data.
import struct
datavector = []
with open('validationData.dat.201805271617', "rb") as f:
n = 1000000
count = 0
byte = f.read(4) # same as size = 4 in R
while count < n and byte != b"":
datavector.append(struct.unpack('f',byte))
count += 1
byte = f.read(4)
print(datavector)
the result we get is something like this:
[(-0.05662650614976883,), (-0.1501215696334839,), (-0.047423675656318665,), (-0.04705987498164177,), (-0.025805648416280746,), (0.0006194132147356868,), (-0.09810388088226318,), (-0.007468236610293388,), (-0.06364697962999344,), (-0.04153480753302574,), (-0.010334763675928116,), (-0.028390713036060333,), (-0.01236063800752163,), (-0.010809036903083324,), (-0.0195484422147274,), (-0.006089110858738422,), (-0.011221584863960743,), (-0.012900656089186668,), (0.02528800442814827,), (-0.0803263783454895,), (-0.03630480542778969,), (-0.03244496509432793,), (0.007571130990982056,), (0.004120028577744961,), (-0.022513896226882935,), (0.0008055367507040501,), (-0.011940909549593925,), (-0.05145340412855148,), (0.008258728310465813,), (-0.02799968793988228,), (-0.035880401730537415,), (-0.04643672704696655,), (-0.005221989005804062,), (0.03542486950755119,), (0.013353106565773487,), (0.035976167768239975,), (0.008336232975125313,), (-0.01492307148873806,), (-0.003470425494015217,), (0.02190450392663479,), (0.012822589837014675,), (-0.008801682852208614,), (6.225423567229882e-05,), (0.015136107802391052,), (-0.007297097705304623,), (0.0010259768459945917,), (-0.018891485407948494,), (0.0036666016094386578,), (0.01155313104391098,), (-0.009809211827814579,), (-0.03696637228131294,), (0.04245902970433235,), (0.002897093538194895,), (-0.04476182535290718,), (0.011403053067624569,), (0.01330728828907013,), (0.03941703215241432,), (0.005868517793715,), (0.031955622136592865,), (0.015012135729193687,), (-0.0439620167016983,), (0.00014146660396363586,), (-0.0010368679650127888,), (0.011971709318459034,), (-0.003853448200970888,), (0.010528777725994587,), (0.06129004433751106,), (0.00505771255120635,), (-0.012601660564541817,), (0.01481446623802185,), (0.019019771367311478,), (0.004633020609617233,), (-0.021741455420851707,), (-0.033449672162532806,), (-0.021316081285476685,), (0.00593474181368947,), (0.0030296281911432743,), (0.023055575788021088,), (0.0256675872951746,), (0.03663543984293938,), (0.044298700988292694,), (0.01264342200011015,), (0.032493121922016144,), (-0.06546197831630707,), (0.031123168766498566,), (0.005013703368604183,), (-0.006611336953938007,), (-0.041526272892951965,), (0.0007577596697956324,), (0.030475322157144547,), (0.034476157277822495,), (-0.015037396922707558,), (0.07587681710720062,)]
Now the question is how to convert these flaoting point numbers to readable content
Since you opened your file in binary format, mode='rb', I think you probably just need to specify a bytes-like character to strip:
newLine = line.strip(b'|').split()

Parsing floating number from ping output in text file

So I am writing this python program that must extract the round trip time from a text file that contains numerous pings, whats in the text file I previewed below:
64 bytes from a104-100-153-112.deploy.static.akamaitechnologies.com (104.100.153.112): icmp_seq=1 ttl=60 time=12.6ms
64 bytes from a104-100-153-112.deploy.static.akamaitechnologies.com (104.100.153.112): icmp_seq=2 ttl=60 time=1864ms
64 bytes from a104-100-153-112.deploy.static.akamaitechnologies.com (104.100.153.112): icmp_seq=3 ttl=60 time=107.8ms
What I want to extract from the text file is the 12.6, 1864, and the 107.8. I used regex to do this and have the following:
import re
ping = open("pingoutput.txt")
rawping = ping.read()
roundtriptimes = re.findall(r'times=(\d+.\d+)', rawping)
roundtriptimes.sort()
print (roundtriptimes)
The issue I'm having is that I believe the numbers are being read into the roundtriptimes list as strings so when I go to sort them they do not sort as I would like them to.
Any idea how to modify my regex findall command to make sure it recognizes them as numbers would help tremendously! Thanks!
I don't know of a way to do that in RegEx, but if you add the following line before the sort, it should take care of it for you:
roundtriptimes[:] = [float(x) for x in roundtriptimes]
Non-regex:
Simply performing splits on space, grabbing the last entry, then split on =, grab the second part of the list and omit the last two components (ms). Cast to a float.
All of that is done in a list-comprehension:
Note that readlines is used to have a list containing each line of the file, which will be much easier to manage.
with open('ping_results.txt') as f:
data = f.readlines()
times = [float(line.split()[-1].split('=')[1][:-2]) for line in data]
print(times) # [12.6, 1864.0, 107.8]
regex:
The key thing here is to pay attention to the regex being used:
time=(\d*\.?\d+)
Look for time=, then start a capture group (), and grab digits (\d*), optional decimal (\.?), digits (\d+).
import re
with open('ping_results.txt') as f:
data = f.readlines()
times = [float(re.findall('time=(\d*\.?\d+)', line)[0]) for line in data]
print(times) # [12.6, 1864.0, 107.8]

Correct mistakes in a python program dealing with CSV

I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)
You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.

Convert a Column oriented file to CSV output using shell

I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out
Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together
I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!

Categories