Pythonic way to Compare two CSV files to track changes

Pythonic way to Compare two CSV files to track changes - python

I have a Python Script that generate a CSV (data parsed from a website).
Here is an exemple of the CSV file:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
As you can see,
China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).
I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).
What I'd like to have is a script which can tell me:
- One new line : This;Is;A;New;Line;;
- One removed line : China;Beijing;BeiwaiOnline BFSU;;;
And so on ... !
I've tried but without any success:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Any idea of the best way to proceed ? Thank you so much in advance ;)
EDIT:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
This seems to be a good idea but :
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
EDIT 2 (Please don't care about what is above):
*with your help, here is the script I'm writing :*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Two kind of errors:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
or
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)
Sorry, last question !

If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
To get the list of entries that were added:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
similarly, list of entries that were removed:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.
To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.
To address the additional errors you're seeing, they are both due to the content of your CSV files.
_csv.Error: newline inside string
It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.
I did however manage to reproduce the error as such:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
_csv.Error: line contains NULL byte
I'm guessing this might be down to the encoding of your CSV files. See the following questions:
Python CSV error: line contains NULL byte
"Line contains NULL byte" in CSV reader (Python)

Read the csv files line by line into sets. Compare the sets.
>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])

Related

How to check txt file for content

I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->
# Python program to
# demonstrate reading files
# using for loop
import re
file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
count += 1
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print(line)
countrycounter += 1
else:
print("fail", line)
noncountrycount += 1
print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()
The txt file has this in it
.aaa generic American Automobile Association, Inc.
.aarp generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc generic Disney Enterprises, Inc.
.able generic Able Inc.
.abogado generic Minds + Machines Group Limited
.abudhabi generic Abu Dhabi Systems and Information Centre
.ac country-code Internet Computer Bureau Limited
.academy generic Binky Moon, LLC
.accenture generic Accenture plc
.accountant generic dot Accountant Limited
.accountants generic Binky Moon, LLC
.aco generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor generic United TLD Holdco Ltd.
.ad country-code Andorra Telecom
.adac generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads generic Charleston Road Registry Inc.
.adult generic ICM Registry AD LLC
.ae country-code Telecommunication Regulatory Authority (TRA)
.aeg generic Aktiebolaget Electrolux
.aero sponsored Societe Internationale de Telecommunications Aeronautique (SITA INC USA)
I am getting this error now
File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to

Is this something You were looking for:
with open('lorem.txt') as file:
data = file.readlines()
for line in data:
temp = line.split()[0]
if len(temp) == 3:
print(temp)
In short:
file.readlines() in this case returns a list of all lines in the file, pretty much it split the file by \n.
Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.

It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.
Encoding error
The final hint was the console output (error, stacktrace) you pasted.
Read the stacktrace & research
This is how I read & analyze the error-output (Python's stacktrace):
... C:/Users/tyler/Desktop ...
... findcountrycodes/Test.py", line 15 ...
... Python36\lib\encodings*cp1252*.py ...
... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:
From this output we can extract important contextual information to research & solve the issue:
you are using Windows
the line 15 in your script Test.py points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
you are using Python 3.6 and the currently used encoding was Windows 1252 (cp-1252)
the root-cause is UnicodeDecodeError, a frequently occuring Python Exception when reading files
You can now:
research Stackoverflow and the web for this exception: UnicodeDecodeError.
improve your question by adding this context (as keywords, tag, or dump as plain output)
Try a different encoding
One answer suggests to use the nowadays common UTF-8:
open(filename, encoding="utf8")
Detect the file encoding
An methodical solution-approach would be:
check the file's encoding or charset, e.g. using an editor, on windows Notepad or Notepad++
open the file your Python code with the proper encoding
See also:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
Get encoding of a file in Windows
Filtering lines for country-codes
So you want only the lines with country-codes.
Filtering expected
Then expected these 3 lines of your input file to be filtered:
.ad country-code Andorra Telecom
.ac country-code Internet Computer Bureau Limited
.ae country-code Telecommunication Regulatory Authority (TRA)
Solution using regex
As you already did, test each line of the file.
Test if the line starts with these 4 characters .xx (where xx can be any ASCII-letter).
Regex explained
This regular expression tests for a valid two-letter country code:
^\.\w{2}\s
^ from the start of the string (line)
\. (first) letter should be a dot
\w{2} (followed by) any two word-characters (⚠️ also matches _0)
\s (followed by) a single whitespace (blank, tab, etc.)
Python code
This is done in your code as follows (assuming the line is populated from read lines):
import re
line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print('found country-code')
Here is a runnable demo on IDEone
Further Readings
Filter list with regex
Python 3 documentation: Regular Expression HOWTO
Bharath Sivakumar, on Medium (2020): Extracting Words from a string in Python using the “re” module
koenwoortman's blog (2020): Remove None values from a list in Python

You are splitting on three spaces but the character codes are only followed by one space so your logic is wrong.
>>> s = '.ac country-code Internet Computer Bureau Limited'
>>> s.strip().split(' ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>
Check if the third character is not a space and the fourth character is a space.
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>

non-standard characters cause program to end

I am learning python for data mining and I have a text file that contains a list of world cities and their coordinates. With my code, I am trying to find the coordinates of a list of cities. Everything works well until there is a city name with non-standard characters. I expect the program will skip that name and move to the next, but it terminates. How will I make the program skip names it cannot find and continue to the next?
lst = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria',
'New York', 'Dublin']
source = 'world.txt'
fh = open(source)
n = 0
for line in fh:
line.rstrip()
if lst[n] not in line:
continue
else:
co = line.split(',')
print lst[n], 'Lat: ', co[5], 'Long: ', co[6]
if n < (len(lst)-1):
n = n + 1
else:
break
The outcome of this run is:
>>>
Paris Lat: 33.180704 Long: 67.470836
London Lat: -11.758217 Long: 17.084013
Helsinki Lat: 60.175556 Long: 24.934167
Amsterdam Lat: 6.25 Long: -57.5166667
>>>

Your code has a number of issues. The following fixes most, if not all, of them — and should never terminate when a city isn't found.
# -*- coding: iso-8859-1 -*-
from __future__ import print_function
cities = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria', 'New York',
'Dublin']
SOURCE = 'world.txt'
for city in cities:
with open(SOURCE) as fh:
for line in fh:
if city in line:
fields = line.split(',')
print(fields[0], 'Lat: ', fields[5], 'Long: ', fields[6])
break

It may be an encoding problem. You have to know in which encoding is the file "world.txt".
If you do not know it, try the most commonly used encodings.
Replace the line :
fh = open(source)
With the lines :
import codecs
fh = codecs.open(source, 'r', 'utf-8')
If it still does not work, replace the 'utf-8' by 'cp1252', then by 'iso-8859-1'.
If none of these common encoding works, you have to find the encoding by yourself. Try to open "world.txt" in Notepad++, this text editor is able to make encoding inference. (Not sure if Notepad++ is able to open a 3 million line files, though).
It is also a good practice to know in which encoding is your own python source file, and to tell it explicitly by adding a line like that # -*- coding: utf-8 -*- at the beginning of your source file.
Of course, you have to specify the exact encoding in which your source file is. Once again, you can determine it by opening it in Notepad++.

Correct mistakes in a python program dealing with CSV

I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)

You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.

Convert a Column oriented file to CSV output using shell

I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out

Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together

I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!

Python parsing a text file and logical methods

I'm a bit stuck with python logic.
I'd like some some advice on how to tackle a problem I'm having with python and the methods to parsing data.
I've spent a bit of time reading the python reference documents and going through this site and I understand there are several ways to do what I'm trying to achieve and this is the path I've gone down.
I'm re-formating some text files with data generated from some satellite hardware to be uploaded into a MySQL database.
This is the raw data
TP N: 1
Frequency: 12288.635 Mhz
Symbol rate: 3000 KS
Polarization: Vertical
Spectrum: Inverted
Standard/Modulation: DVB-S2/QPSK
FEC: 1/2
RollOff: 0.20
Pilot: on
Coding mode: ACM/VCM
Short frame
Transport stream
Single input stream
RF-Level: -49 dBm
Signal/Noise: 6.3 dB
Carrier width: 3.600 Mhz
BitRate: 2.967 Mbit/s
The above section is repeated for each transponder TP N on the satellite
I'm using this script to extract the data I need
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
for line in sat_raw:
if any(s in line for s in strings):
for word in line.split():
if ':' in word:
sat_out.write(line.split(':')[-1])
sat_raw.close()
sat_out.close()
The output data is then formatted like this before its sent to the database
12288.635 Mhz
3000 KS
Vertical
DVB-S2/QPSK
1/2
-49 dBm
6.3 dB
3.600 Mhz
2.967 Mbit/s
This script is working fine but for some features I want to implement on MySQL I need to edit this further.
Remove the decimal point and 3 numbers after it and MHz on the first "frequency" line.
Remove all the trailing measurement references KS,dBm,dB, Mhz, Mbit.
Join the 9 fields into a comma delimited string so each transponders (approx 30 per file ) are on their own line
I'm unsure weather to continue down this path adding onto this existing script (which I'm stuck at the point where the output file is written). Or rethink my approach to the way I'm processing the raw file.

My solution is crude, might not work in corner cases, but it is a good start.
import re
import csv
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
csv_writer = csv.writer(sat_out)
csv_output = []
for line in sat_raw:
if any(s in line for s in strings):
try:
m = re.match(r'^.*:\s+(\S+)', line)
value = m.groups()[0]
# Attempt to convert to int, thus removing the decimal part
value = int(float(value))
except ValueError:
pass # Ignore conversion
except AttributeError:
pass # Ignore case when m is None (no match)
csv_output.append(value)
elif line.startswith('TP N'):
# Before we start a new set of values, write out the old set
if csv_output:
csv_writer.writerow(csv_output)
csv_output=[]
# If we reach the end of the file, don't miss the last set of values
if csv_output:
csv_writer.writerow(csv_output)
sat_raw.close()
sat_out.close()
Discussion
The csv package helps with CSV output
The re (regular expression) module helps parsing the line and extract the value from the line.
In the line that reads, value = int(...), We attempt to turn the string value into an integer, thus removing the dot and following digits.
When the code encounters a line that starts with 'TP N', which signals a new set of values. We write out the old set of value to the CSV file.

import math
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
files=['/BLScan/reports/1520.txt']
sat_out = open('1520out.txt', 'w')
combineOutput=[]
for myfile in files:
sat_raw = open(myfile, 'r')
singleOutput=[]
for line in sat_raw:
if any(s in line for s in strings):
marker=line.split(':')[1]
try:
data=str(int(math.floor(float(marker.split()[0]))))
except:
data=marker.split()[0]
singleOutput.append(data)
combineOutput.append(",".join(singleOutput))
for rec in combineOutput:
sat_out.write("%s\n"%rec)
sat_raw.close()
sat_out.close()
Add all the files that you want to parse in files list. It will write the output of each file as a separate line and each field comma separated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to Compare two CSV files to track changes - python

Related

How to check txt file for content

non-standard characters cause program to end

Correct mistakes in a python program dealing with CSV

Convert a Column oriented file to CSV output using shell

Python parsing a text file and logical methods

Categories

Resources