I am trying to make an automated monthly cost calculator for my family. The idea is whenever they shop they take a picture of the receipt and send it to an e-mail adress. A Python script downloads that picture and using the Google Vision API scans for the Total amount which then gets written into a .csv file for later use. ( I have yet to make the csv thing so it's only being saved into txts for now.)
This works because in my country the receipts all look the same because of regulations however the Google Vision API returns the OCRed text back line by line. What i am trying to do now is check the text line by line for the total amount which is always in the following format (Numbers space Currency) then i check if the OCR messed up something like put the "Total amount" above or below the actual numbers.
My problem is that if i run this script on more than 3 .txt OCR data then it only gets the first 2 right even though they are the same if i manually check them. If i run it on them 1 by 1 then it gets them perfect everytime.
The OCR data looks like this:
Total amount:
1000 USD
or
1000 USD
Total amount:
My code so far:
import re
import os
import codecs
for files in os.listdir('texts/'):
filedir="texts/"+str(files)
with codecs.open(filedir,'rb','utf-8') as f:
lines=f.readlines()
lines=[l.strip() for l in lines]
for index,line in enumerate(lines):
match=re.search(r"(\d+) USD",line)
if match:
if lines[index+1].endswith("USD"):
amount=re.sub(r'(\d\s+(\d)',r'\1\2',lines[index])
amount=amount.replace(" USD","")
print(amount)
with open('amount.txt',"a") as data:
data.write(amount)
data.write("\n")
if lines[index-1].endswith("USD"):
amount=re.sub(r'(\d\s+(\d)',r'\1\2',lines[index])
amount=amount.replace(" USD","")
print(amount)
with open('amount.txt',"a") as data:
data.write(amount)
data.write("\n")
Question: checking if line above or below equals to phrase
Simplify to the following:
Assumptions:
The Amount line has the following format (Numbers space Currency).
These exact phrase "Total amount:", exists allways in the other line.
The above lines are separated with a blank line.
FILE1 = u"""Total amount:
1000 USD
"""
FILE2 = u"""1000 USD
Total amount:"""
import io
import os
import codecs
total = []
#for files in os.listdir('texts/'):
for files in [FILE1, FILE2]:
# filedir="texts/"+str(files)
# with codecs.open(filedir,'rb','utf-8') as f:
with io.StringIO(files) as f:
v1 = next(f).rstrip()
# eat empty line
next(f)
v2 = next(f).rstrip()
if v1 == 'Total amount:':
total.append(v2.split()[0])
else:
total.append(v1.split()[0])
print(total)
# csv_writer.writerows(total)
Output:
[u'1000', u'1000']
Related
I’m struggling to write a Python script to process a file and produce an output text file containing the tickets in a format that is ready for printing via a dot matrix printer. For reference I have also attached an example of what the resultant text file should look like.
ConcertTickets.txt and
ConcertTickets_result.txt
My major problem is architecting an approach to this problem. I can’t figure out how to print column by column. I was able to read the file, print row by row, do the validation and write the file with a new name. I’m not sure how to do the layout_name, columns, column_width, column_spacing, left_margin, row spacing and line_item, the best I could do was ljust() for the left margin between the tickets.
I don’t expect someone to do the work for me, but would greatly appreciate tips on architectural approaches with and without third party packages.
The input concert ticket file consists of a header containing formatting information and a body containing the actual tickets.
The header lines are as follows:
download_datetime - the date and time the file was downloaded
order_datetime - the date and time the order for the tickets were placed
layout_name - the name of the layout used for formatting the tickets
columns - the number of columns of tickets per page width
column_width - the width of each ticket column
column_spacing - the number of spaces between ticket columns
left_margin - the leading space to the left of the first ticket column
row_spacing - the number of horizontal lines between tickets
line_item - the line items represent how the ticket elements must appear in the
ticket, e.g. the PIN at the top, followed by two empty lines, then the description, serial number and expiry date. Valid values for line items are: pin, description, serial_number, expiry_date and empty (space)
ticket_summary - Each ticket summary contains the ticket description followed by the number of ticket of that type in the file and the total face value of the ticket, e.g. "Gold 10.00,10,100.00" means there are 10 Gold $10.00 tickets to the value of $100.00 in the file
ticket_fields - the ticket fields indicate the fields and their order that are present in the ticket data that follows. This is the last line of the header and all data that follows this line should be interpreted as body data, i.e. the actual tickets in a CSV type format
The script also needs to do some basic file validation by checking that the number of actual tickets in the body of the file match the ticket summary values in the header of the file. If file validation fails the program must exit with an appropriate error message.
The resultant output file name must be the same as the input file name, but with the word "_result" appended to it just before the file extension. E.g. if the input file name is ConcertTickets.txt then the output file name must be ConcertTickets_result.txt
I also need to develop a set of test cases for the script.
This is my code thus far
data = []
data_description = []
data_pin = []
data_serial_number = []
data_expiry_date = []
tickets_in_body = 0
# read file from line 19 and create two-dimensional array
result_f = open('ConcertTickets.txt')
for each_line in result_f.readlines()[18:]:
(description, pin, serial_number, expiry_date) = each_line.split(',')
data_description.append(description)
data_pin.append(pin)
data_serial_number.append(serial_number)
data_expiry_date.append(expiry_date.replace("\r\n",""))
tickets_in_body += 1
data = [data_description, data_pin, data_serial_number, data_expiry_date]
# ticket validation and writing to file
result_golden_summary = open('ConcertTickets.txt')
golden_summary = result_golden_summary.readlines()
(golden_description, golden_summary_amount, golden_summary_value) = (golden_summary[15 - 1]).split(',')
if int(golden_summary_amount) != tickets_in_body:
print('The ticket summary in the header does not match the amount of tickets in body')
else:
(filename, extension) = (result_f.name).split('.')
result_f = open(filename + "_result.txt", 'w')
for row in data:
result_f.write("".join(str(item).ljust(25) for item in row))
result_f.close()
here's some code for you:
import math
result_f = open('ConcertTickets.txt')
all_lines_arr = []
for each_line in result_f.readlines()[18:]:
(description, pin, serial_number, expiry_date) = each_line.split(',')
line_dict = {}
line_dict["description"] = description
line_dict["pin"] = pin
line_dict["serial_number"] = serial_number
line_dict["expiry_date"] = expiry_date.strip()
all_lines_arr.append(line_dict)
per_row = 5
line_space = 30
rows = math.ceil(len(all_lines_arr)/per_row)
for i in range(0, rows):
row_val = (i*per_row)+per_row
if (row_val > len(all_lines_arr)):
row_val = row_val - (row_val-len(all_lines_arr))
for j in range((i*per_row), row_val):
print(all_lines_arr[j]["pin"] + (line_space-(len(all_lines_arr[j]["pin"]))%line_space)*" ", end="")
print("\n"*2)
for j in range((i*per_row), row_val):
print(all_lines_arr[j]["description"] + (line_space-(len(all_lines_arr[j]["description"]))%line_space)*" ", end="")
print()
for j in range((i*per_row), row_val):
print(all_lines_arr[j]["serial_number"] + (line_space-(len(all_lines_arr[j]["serial_number"]))%line_space)*" ", end="")
print()
for j in range((i*per_row), row_val):
print(all_lines_arr[j]["expiry_date"] + (line_space-(len(all_lines_arr[j]["expiry_date"]))%line_space)*" ", end="")
print("\n"*5)
First we read the lines, and put them into an array of dictionaries i.e. each array element is a dictionary, which has an addressable value e.g. description
Next, we use per_row to decide how many tickets to print per row (you can change this).
Then the code will print the dictionary values for each element in the array.
The key to the formatting is that it uses modulus % to print the correct number of spaces. I used 30 as the separator.
I stripped out a lot of your code in order to just do the print formatting for you. It will be up to you to modify this to print to file or do anything else you need it to.
It is a bit too hardcoded for my liking, but without knowing more about exactly what you need, it works for your simple case.
Hope this helps!
This is the recomended way to open (and close) a file:
# open file as read ('r')
with open('ConcertTickets.txt', 'r') as file:
for line in file.readlines()[18:]:
# your logic
# open result file as write ('w'), '+' creates the file if not exist
with open('ConcertTickets_result.txt', 'w+' as file:
# your logic
I have converted a PDF bank statement to a txt file. Here is a snippet of the .txt file:
15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK
What is the easiest way of re-writing the text file in python to create a new line at certain points. i.e. after a number ‘xx.xx’ there in a new date such as ‘xx APR’
For example the text to become:
15 Apr 20DDOPEN 100.00
BENNON WATER SRVCS29.00
DDBG BUSINESS106.00...(etc)
I am just trying to make a PDF more readable and useful when working amongst my other files.
If you know of another PDF to txt python converter which works better, I would also be interested.
Thanks for your help
First step would be getting the text file into Python
with open(“file.txt”) as file:
data = file.read()
This next part, initially, I thought you wouldn't be able to do, but in your example, each part contains a number XX.XX The important thing to notice here is that there is a '.' in each number.
Using Python's string find command, you can iteratively look for that '.' and add a newline character two characters later. You can change my indices below to remove the DD as well if you want.
index = 0
while(index != -1):
index = data.find('.', index)
if index != -1:
data = data[:index+3] + '\n' + data[index+3:]
Then you need to write the new data back to the file.
file = open('ValidEmails.txt','w')
file.write(data)
For the given input the following should work:
import re
counter = 0
l = "15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK"
nums = re.finditer("[\d]+[\.][\d]+", l)
for elem in nums:
idx = elem.span()[1] + counter
l = l[:idx] + '\n' + l[idx:]
counter += 1
print(l)
The output is:
15 Apr 20DDOPEN 100.00
DDBENNON WATER SRVCS29.00
DDBG BUSINESS106.00
BPC BOB PETROL MINISTRY78.03
BPC BARBARA STREAMING DATA30.50
CRPAYPAL Z4J22FR450.00
CRPAYNAL AAWDL4Z4J22222KHMG30.0019
,028.4917
Apr 20CRCASH IN AT HSBC BANK
Then you should easily able to write line by line to a file.
I'm struggling to get readline() and split() to work together as I was expecting. Im trying to use .split(')') to cut down some data from a text file and write some of that data to a next text file.
I have tried writing everything from the line.
I have tried [cnt % 2] to get what I expected.
line = fp.readline()
fw = open('output.txt', "w+")
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = fp.readline()
line = line.split(')')[0]
fw.write(line + "\n")
cnt += 1
Example from the text file im reading from.
WELD 190 Manufacturing I Introduction to MasterCAM (3)
1½ hours lecture - 4½ hours laboratory
Note: Cross listed as DT 190/ENGR 190/IT 190
This course will introduce the students to MasterCAM and 2D and basic 3D
modeling. Students will receive instructions and drawings of parts requiring
2- or 3-axis machining. Students will design, model, program, set-up and run
their parts on various machines, including plasma cutters, water jet cutters and
milling machines.
WELD 197 Welding Technology Topics (.5 - 3)
I'm very far off from actually effectively scraping this data but I'm trying to get a start.
My goal is to extract only class name and number and remove descriptions.
Thanks as always!
I believe to solve your current problem, if you're only attempting to parse one line, you will simply need to move your second line = fp.readline() line to the end of the while loop. Currently, you are actually starting the parsing from the second line, because you have already used a readline in the first line of your example code.
After the change it would look like this:
line = fp.readline() # read in the first line
fw = open('output.txt', "w+")
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = line.split(')')[0]
fw.write(line + "\n")
cnt += 1
line = fp.readline() # read in next line after parsing done
Output for your example input text:
WELD 190 Manufacturing I Introduction to MasterCAM (3
Assuming your other class text blocks share the same structure than the one you showed you might want to use a regular expression to extract the class name and class number:
Following I assume that every text block contains the information "XX hours lecture" at the same order where 'XX' stands for any kind of number (time frame). In the variable 'match_re' I define a regular matching expression to match only to the defined spot 'XX hours lecture'. And by using 'match.group(2)' I restrict my match to the part within the inmost bracket pair.
The matching expression below probably won't be complete for you yet since I don't know your whole text file.
Below I extract the string: WELD 190 Manufacturing I Introduction to MasterCAM (3)
import re
string = "WELD 190 Manufacturing I Introduction to MasterCAM (3) 1½ hours lecture - 4½ hours laboratory Note: Cross listed as DT 190/ENGR 190/IT 190 This course will introduce the students to MasterCAM and 2D and basic 3D modeling. Students will receive instructions and drawings of parts requiring 2- or 3-axis machining. Students will design, model, program, set-up and run their parts on various machines, including plasma cutters, water jet cutters and milling machines. WELD 197 Welding Technology Topics (.5 - 3)"
match_re = "(^(.*)\d.* hours lecture)"
match = re.search(match_re,string)
if match:
print(match.group(2))
else:
print("No match")
I have a file that come from map reduce output for the format below that needs conversion to CSV using shell script
25-MAY-15
04:20
Client
0000000010
127.0.0.1
PAY
ISO20022
PAIN000
100
1
CUST
API
ABF07
ABC03_LIFE.xml
AFF07/LIFE
100000
Standard Life
================================================
==================================================
AFF07-B000001
2000
ABC Corp
..
BE900000075000027
AFF07-B000002
2000
XYZ corp
..
BE900000075000027
AFF07-B000003
2000
3MM corp
..
BE900000075000027
I need the output like CSV format below where I want to repeat some of the values in the file and add the TRANSACTION ID as below format
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,AFF07-B000002,2000,XYZ Corp,..,BE900000075000027
TRANSACTION ID is AFF07-B000001,AFF07-B000002,AFF07-B000003 which have different values and I have put a marked line from where the Transaction ID starts . Before the demarkation , the values should be repeating and the transaction ID column needs to be added along with the repeating values before the line as given in above format
BASH shell script I may need and CentOS is the flavour of linux
I am getting error as below when I execute the code
Traceback (most recent call last):
File "abc.py", line 37, in <module>
main()
File "abc.py", line 36, in main
createTxns(fh)
File "abc.py", line 7, in createTxns
first17.append( fh.readLine().rstrip() )
AttributeError: 'file' object has no attribute 'readLine'
Can someone help me out
Is this a correct description of the input file and output format?
The input file consists of:
17 lines, followed by
groups of 10 lines each - each group holding one transaction id
Each output row consists of:
29 common fields, followed by
5 fields derived from each of the 10-line groups above
So we just translate this into some Python:
def createTxns(fh):
"""fh is the file handle of the input file"""
# 1. Read 17 lines from fh
first17 = []
for i in range(17):
first17.append( fh.readLine().rstrip() )
# 2. Form the common fields.
commonFields = first17 + first17[0:12]
# 3. Process the rest of the file in groups of ten lines.
while True:
# read 10 lines
group = []
for i in range(10):
x = fh.readline()
if x == '':
break
group.append( x.rstrip() )
if len(group) <> 10:
break # we've reached the end of the file
fields = commonFields + [ group[2], group[4], group[6], group[7[, group[9] ]
row = ",".join(fields)
print row
def main():
with open("input-file", "r") as fh:
createTxns(fh)
main()
This code shows how to:
open a file handle
read lines from a file handle
strip off the ending newline
check for end of input when reading from a file
concatenate lists together
join strings together
I would recommend you to read Input and Output if you are going for the python route.
You just have to break the problem down and try it. For the first 17 line use f.readline() and concat into the string. Then the replace method to get the begining of the string that you want in the csv.
str.replace("\n", ",")
Then use the split method and break them down into the list.
str.split("\n")
Then write the file out in the loop. Use a counter to make your life easier. First write out the header string
25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Then write the item in the list with a ",".
,AFF07-B000001, 2000,ABC Corp,..,BE900000075000027
At the count of 5 write the "\n" with the header again and don't forget to reset your counter so it can begin again.
\n25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API,ABF07,ABC03_LIFE.xml,AFF07/LIFE,100000,Standard Life, 25-MAY-15,04:20,Client,0000000010,127.0.0.1,PAY,ISO2002,PAIN000,100,1,CUST,API
Give it a try and let us know if you need more assistant. I assumed that you have some scripting background :) Good luck!!
I'm a bit stuck with python logic.
I'd like some some advice on how to tackle a problem I'm having with python and the methods to parsing data.
I've spent a bit of time reading the python reference documents and going through this site and I understand there are several ways to do what I'm trying to achieve and this is the path I've gone down.
I'm re-formating some text files with data generated from some satellite hardware to be uploaded into a MySQL database.
This is the raw data
TP N: 1
Frequency: 12288.635 Mhz
Symbol rate: 3000 KS
Polarization: Vertical
Spectrum: Inverted
Standard/Modulation: DVB-S2/QPSK
FEC: 1/2
RollOff: 0.20
Pilot: on
Coding mode: ACM/VCM
Short frame
Transport stream
Single input stream
RF-Level: -49 dBm
Signal/Noise: 6.3 dB
Carrier width: 3.600 Mhz
BitRate: 2.967 Mbit/s
The above section is repeated for each transponder TP N on the satellite
I'm using this script to extract the data I need
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
for line in sat_raw:
if any(s in line for s in strings):
for word in line.split():
if ':' in word:
sat_out.write(line.split(':')[-1])
sat_raw.close()
sat_out.close()
The output data is then formatted like this before its sent to the database
12288.635 Mhz
3000 KS
Vertical
DVB-S2/QPSK
1/2
-49 dBm
6.3 dB
3.600 Mhz
2.967 Mbit/s
This script is working fine but for some features I want to implement on MySQL I need to edit this further.
Remove the decimal point and 3 numbers after it and MHz on the first "frequency" line.
Remove all the trailing measurement references KS,dBm,dB, Mhz, Mbit.
Join the 9 fields into a comma delimited string so each transponders (approx 30 per file ) are on their own line
I'm unsure weather to continue down this path adding onto this existing script (which I'm stuck at the point where the output file is written). Or rethink my approach to the way I'm processing the raw file.
My solution is crude, might not work in corner cases, but it is a good start.
import re
import csv
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
sat_raw = open('/BLScan/reports/1520.txt', 'r')
sat_out = open('1520out.txt', 'w')
csv_writer = csv.writer(sat_out)
csv_output = []
for line in sat_raw:
if any(s in line for s in strings):
try:
m = re.match(r'^.*:\s+(\S+)', line)
value = m.groups()[0]
# Attempt to convert to int, thus removing the decimal part
value = int(float(value))
except ValueError:
pass # Ignore conversion
except AttributeError:
pass # Ignore case when m is None (no match)
csv_output.append(value)
elif line.startswith('TP N'):
# Before we start a new set of values, write out the old set
if csv_output:
csv_writer.writerow(csv_output)
csv_output=[]
# If we reach the end of the file, don't miss the last set of values
if csv_output:
csv_writer.writerow(csv_output)
sat_raw.close()
sat_out.close()
Discussion
The csv package helps with CSV output
The re (regular expression) module helps parsing the line and extract the value from the line.
In the line that reads, value = int(...), We attempt to turn the string value into an integer, thus removing the dot and following digits.
When the code encounters a line that starts with 'TP N', which signals a new set of values. We write out the old set of value to the CSV file.
import math
strings = ("Frequency", "Symbol", "Polar", "Mod", "FEC", "RF", "Signal", "Carrier", "BitRate")
files=['/BLScan/reports/1520.txt']
sat_out = open('1520out.txt', 'w')
combineOutput=[]
for myfile in files:
sat_raw = open(myfile, 'r')
singleOutput=[]
for line in sat_raw:
if any(s in line for s in strings):
marker=line.split(':')[1]
try:
data=str(int(math.floor(float(marker.split()[0]))))
except:
data=marker.split()[0]
singleOutput.append(data)
combineOutput.append(",".join(singleOutput))
for rec in combineOutput:
sat_out.write("%s\n"%rec)
sat_raw.close()
sat_out.close()
Add all the files that you want to parse in files list. It will write the output of each file as a separate line and each field comma separated.