regex extract data from raw text

regex extract data from raw text - python

I work in hotel. here is raw file from rapports i have.I need to extract data in order to have something like data['roomNumber']=('paxNumber',isbb,)
Here is a sample that concern only 2 room, the 10 and 12 so the data i need should be BreakfastData = {'10':['2','BB'],'12':['1','BB']}
1)roomNumber : 'start and ends with number' or 'start with number and strictly one or more space followd by string'
2)paxNumber are the two numbers just before the 'VA' string
3)isbb is defined by the 'BB' or 'HPDJ' occurrence which can be find between two '/'. But sometimes the format is not good so it can be '/HPDJ/' or '/ HPDJ /' or '/ HPDJ/' etc
10 PxxxxD,David,Mme, Mr T- EXPEDIA TRAVEL
08.05.17 12.05.17 TP
SUP DBL / HPDJ / DEBIT CB AGENCE - NR
2 0 VA
NR
12
LxxxxSH,Claudia,Mrs
08.05.17 19.05.17 TP
1 0 VA
NR BB
SUP SGL / BB / EN ATTENTE DE VIREMENT- EVITER LA 66 -
.... etc
edit :latest
import re
data = {}
pax=''
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
r2 = re.compile(r"/\s*(BB|HPDJ)\s*/")
r3 = re.compile(r"\d+\n")
r4 = re.compile(r"\d+\s+\w")
PATH = "/home/ryms/regextest"
with open(PATH, 'rb') as raw:
text=raw.read()
#roomNumber = re.search(r4, text).group()
#roomNumber2 = re.search(r3, text).group()
roomNumber = re.search(r4, text).group().split()[0]
roomNumber2 = re.search(r3, text).group().split()[0]
pax = re.findall(r, text)
adult = pax[0]; enfant = pax[1]
# if enfant is '0':
# pax=adult
# else:
# pax=(str(adult)+'+'+str(enfant))
bb = re.findall(r2, text) #On recherche BB ou HPDJ
data[roomNumber]=pax,bb
print(data)
print(roomNumber)
print(roomNumber2)
return
{'10': ([('2', '2'), ('1', '1')], ['HPDJ', 'BB'])}
10
12
[Finished in 0.1s]
How can i get the two roomNumber in my return?
I have lot of trouble with the \n issue and read(), readline(), readlines().what is the trick?
When i will have all raw data, how will i get the proper BreakfastData{}? will i use .zip()?
At the bigining i wanted to split the file and then parse it , but i try so may things, i get lost. And for that i need a regex that match both pattern.

On first case you want to select two numbers which are followed by 'VA' you can do like this
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
In second case you can get HPDJ or BB like this
r = re.compile(r"/\s*(HPDJ|BB)\s*/")
this will handle all cases you mentioned >> /HPDJ/' or '/ HPDJ /' or '/ HPDJ/'

The regex expression to get the text before the VA is as follows:
r = re.compile(r"(.*) VA")
Then the "number" (which will be a string) will be stored in the first group of the search match object, once you run the search.
I am not quite sure what the room number even is, because your description is a bit unclear, so I cannot help with that unless you clarify.

Related

Process lines with different sizes to csv

I'm trying to convert a PDF bank extract to csv. I'm fairly new into python, but I managed to extract text from pdf. I'm ended with something similar to this:
AMAZON 23/12/2019 15:40 -R$ 100,00 R$ 400,00 credit
Some Restaurant 23/12/2019 14:00 -R$ 10,00 R$ 500 credit
Received from John Doe 22/12/2019 15:00 R$ 510 R$ 500,00
03 Games 22/12/2019 15:00 R$ 10 R$ 10,00 debit
I want this output:
AMAZON;23/12/2019;-100,00
Some Restaurant;23/12/2019;-10,00
Received from John Doe;22/12/2019;510
03 Games;22/12/2019;10
First field have different sizes, I don't need time and currency format. I don't need last 2 fields.
I have this code so far (just extracting text from PDF):
import pdfplumber
import sys
url = sys.argv[1]
pdf = pdfplumber.open(url)
pdf_pages = len(pdf.pages)
for i in range(pdf_pages):
page = pdf.pages[(i)]
text = page.extract_text()
print(text)
pdf.close()
Can anyone give some directions?

Try using this the split method. To split the strings into lines and into the separate parts and pick then the parts.
The following link explains it very nicely.
https://www.w3schools.com/python/showpython.asp?filename=demo_ref_string_split
lines:List[str] = text.split("\n")
for line in lines:
entries:List[str] = line.split()
date_entry_index: int = get_date_index(entries)
name = entries[0]
for index in range(1, date_entry_index + 1):
name += " " + entries[index]
print(f"{name};{entries[date_entry_index]};{entries[date_entry_index + 2]}")
def get_date_index(entries_check:List[str]) -> int:
# either you could use the function below or you check if the entry only contains digits and "/"
for index, entry in enumerate(entries):
if len(entry) == 10:
continue
if entry[2] != "/" or entry[5] != "/":
continue
# here you could check if the other parts of the date are digits or some letters or something similar.
return index
else:
raise ValueError("No Date found")
That should print it.

Python - Printing in line from dict

I'm a beginner at python and I'm experiencing difficulties printing nicely.
I made a program that stores names and prices in dictionary.
( e.g : {"PERSON_1":"50","PERSON_2":"75","PERSON_WITH_EXTREMELY_LONG_NAME":"80"}
Now the problem is that I want to be able to print the keys and their supposed values in a nice scheme.
I used the code:
for i in eter.eters:
print(i + "\t | \t" + str(eter.eters[i]))
with eter.eters being my dictionary.
The problem is that some names are a lot longer than others, so the tabs don't align.
As well as my header: "Names" | "Price" should be aligned with the information below.
I've already looked up some solutions, but I don't really understand the ones I found.
Desired outcome:
**********************************************************************
De mensen die blijven eten zijn:
**********************************************************************
Naam | bedrag
----------------------------------------------------------------------
PERSON 1 | 50
PERSON 2 | 75
PERSON WITH EXTREMELY LONG NAME | 80
**********************************************************************

try this:
given eter.eters is your dictionary
print('%-35s | %6s' % ('Names', 'Price')) # align to the left
for k in eter:
print('%-35s | %6s' % (k,eter[k]))
or
print("{0:<35}".format('Name')+'|'+"{0:>6}".format('Price'))
for k in eter:
print("{0:<35}".format(k)+'|'+"{0:>6}".format(eter.eters[k]))

You can try to get all of the names and find the maximum length of it. Then show every name with special padding instead of tabulator (\t). This code should explain:
>>> d={"Marius":"50","John":"75"}
>>> d
{'Marius': '50', 'John': '75'}
>>> for i in d:
... print(i)
...
Marius
John
>>> d = {"Marius":"50","John":"75"}
>>> m = 0
>>> for i in d:
... m = max(m, len(i))
...
>>> m
6 # now we know the place reserved for Name column should be 6 chars width
>>> for i in d:
... print( i + (m-len(i))*' ' , d[i]) # so add to the name space char that fit this 6 chars space
...
Marius 50
John 75

Using the right python package to achieve result

I have a fixed width text file that I must convert to a .csv where all numbers have to be converted to integers (no commas, dollar signs, quotes, etc). I have currently parsed the text file using plain python, but when utilizing the right package I seem to be at an impasse.
With csv, I can use writer.writerows in place of my print statement to write the output into my csv file, but the problem is that I have more columns (such as the date and time) that I must add after these rows that I cannot seem to do with csv. I also cannot seem to find a way to translate the blank columns in my text document to blank columns in output. csv seems to write in order.
I was reading the documentation on xlsxwriter and I see how you can write to individual columns with a set formatting, but I am unsure if it would work with my .csv requirement
My input text has a series of random groupings throughout the 50k line document but follows the below format
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
My code currently parses this document, pulls the date, time, and prints only the lines where the sequence number starts with 42 and the CR is "C"
lines = []
a = 'PRINT DATE:'
b = 'ARCHIVE'
c = 'PRINT TIME:'
with open(r'textfile.txt') as in_file:
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(line)
print (lines[0])
print (lines[1])
What would be the cleanest way to achieve this result, and am I headed in the right direction by writing out the parsing first or should I have just done everything within a package first?
Any help is appreciated
Edit:
the header block (between 1----------, and +___________) is repeated throughout the document, as well as different sized groupings separated by -------
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------

I would recommend slicing fixed width blocks to parse through the fixed width fields. Something like the following (incomplete) code:
data = """ 5,556 01 4250282689 C 538196640 9835177743 15 $9,064.81 00
CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431
DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896
DR CR
"""
# list of data layout tuples (start_index, stop_index, field name)
# TODO add missing data layout tuples
data_layout = [(0, 12, 'r_nbr'), (12, 22, 't_nbr'), (22, 39, 'seq'), (39, 42, 'cr_db')]
for line in data.splitlines():
# skip "separator" lines
# NOTE this may be an iterative process to identify these
if line.startswith('-----'):
continue
record = {}
for start_index, stop_index, name in data_layout:
record[name] = line[start_index:stop_index].strip()
# your conditional (seems inconsistent with text)
if record['seq'].startswith('42') and record['cr_db'] == 'C':
# perform any special handling for each column
record['r_nbr'] = record['r_nbr'].replace(',', '')
# TODO other special handling (like $)
print('{r_nbr},{t_nbr},{seq},{cr_db},...'.format(**record))
Output is:
5556,01,4250282689,C,...
Update based on seemingly spurious values in undefined columns
Based on the new information provided about the "spurious" columns/fields (appear only rarely), this will likely be an iterative process.
My recommendation would be to narrow (appropriately!) the width of the desired fields. For example, if spurious data is in line[12:14] above, you could change the tuple for (12, 22, 't_nbr') to (14, 22, 't_nbr') to "skip" the spurious field.
An alternative is to add a "garbage" field in the list of tuples to handle those types of lines. Wherever the "spurious" fields appear, the "garbage" field would simply consume it.
If you need these fields, the same general approach to the "garbage" field approach still applies, but you save the data.
Update based on random separators
If they are relatively consistent, I'd simply add some logic (as I did above) to "detect" the separators and skip over them.

re.sub python to gather height

I am writing a python program to parse some user data from a txt file.
One of the rows in the text file will contain the user's height.
I have specified an order that the user is expected to follow like
First line of the file should contain name, the next line, date of birth,
3rd line, height etc.
I have also given a sample file to the user which looks like this
Name: First Name Last Name
DOB: 16.04.2000
Age: 16
Height: 5 feet 9 inch
When I read the file, I looked at each line and split it using ':' as a separator.
The first field is my column name like name, dob, age, height.
In some cases, users forget the ':' after Name or DOB, or they will simply send data like:
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
The logic I have decided to use is:
Look for ':' on each line; if one is found, then I have my field.
Otherwise, try to find out what data it could be.
The logic for height is like this:
if any(heightword in file_line.upper() for heightword in ['FT', 'HEIGHT', 'FEET', 'INCH', 'CM'])
This if condition will look for words associated with height.
Once I have determined that the line from the file contains the height, I want to be able to convert that information to inches before I write it to the database.
Please can someone help me work out how to convert the following data to inches.
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
I know since I am trying to cater to variety of user inputs. This list is not exhaustive; I am trying to use these as an example to understand, and then I will keep adding code if and when I find new patterns.

pyparsing is a nice module for simple parsing situations like this, especially when trying to process less-than-predictable-but-still-fairly-structured human input. You can compose your parser using some friendly-named classes (Keyword, Optional, OneOrMore, and so on) and arithmetic operators ('+' for sequence, '|' for alternatives, etc.), to assemble smaller parsers into larger ones. Here is a parser built up from bits for your example (also support ' and " for feet and inches, and fractional feet and inch values too). (This sample uses the latest version of pyparsing, version 2.1.4):
samples = """\
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
5'-9-1/2"
5' 9-1/2"
5' 9 1/2"
6'
3/4"
3ft-6-1/4 in
"""
from pyparsing import CaselessKeyword, pyparsing_common, Optional
CK = CaselessKeyword
feet_units = CK("feet") | CK("ft") | "'"
inch_units = CK("inches") | CK("inch") | CK("in") | '"'
# pyparsing_common.number will parse an integer or real, and convert to float
integer = pyparsing_common.number
fraction = integer + '/' + integer
fraction.addParseAction(lambda t: t[0]/t[-1])
qty = fraction | (integer + Optional(fraction)).addParseAction(lambda t:sum(t))
# define whole Height feet-inches expression
HEIGHT = CK("height") | CK("ht")
inch_qty = qty("inches")
feet_qty = qty("feet")
height_parser = Optional(HEIGHT) + (inch_qty + inch_units |
feet_qty + feet_units + Optional(inch_qty + inch_units))
# use parse-time callback to convert feet-and-inches to inches
height_parser.addParseAction(lambda t: t.get("feet", 0.0)*12 + t.get("inches", 0.0))
height_parser.ignore("-")
height_parser.runTests(samples)
# how to use the parser in normal code
height_value = height_parser.parseString(samples.splitlines()[0])[0]
print(height_value, type(height_value))
Prints:
Height 5 feet 9 inch
[69.0]
5 feet 9 inch
[69.0]
5ft 9 in
[69.0]
5feet 9inches
[69.0]
5'-9-1/2"
[69.5]
5' 9-1/2"
[69.5]
5' 9 1/2"
[69.5]
6'
[72.0]
3/4"
[0.75]
3ft-6-1/4 in
[42.25]
69.0 <type 'float'>

In JavaScript, there is an operation called "computed access", done as object[key], where the object property read is determined through the result of a given expression, as an alternative to the normal . operator. Personally, I mostly use it for iteration and reading properties with hyphens and stuff, but it can also be used to get associated wanted results from an input string.
So after an entire afternoon of Googling and figuring out Python syntax, etc. I was able to write a short program to do this.
import re
import string
h = 0
r = re.compile(r'(\d+)\s*(\w+)\b')
def incr( m ):
h+=m.group(1)*({'in':1,'inches':1,'inch':1,'foot':12,'feet':12,'cm':0.3937,'centimeter':0.3937,'centimeters':0.3937}[string.lower(m.group(2))]||1) # etc. etc.
return ''
re.sub(r, incr, input)
print h
You may want to restrict the keywords usable to keep the dict from getting too big.

I tried out Stephen's code in the first comment on python 3.6 and had to tweak it to work for me:
import re
h = 0
input = '5 feet 9 inches'
r = re.compile(r'(\d)\s*(\w+)\b')
measures ={'in':1,'inches':1,'inch':1,'foot':12,'feet':12,'ft':12,'cm':0.3937,'centimeter':0.3937,'centimeters':0.3937}
def incr(m):
global h
h+=int(m.group(1))*measures[m.group(2)]
return ''
re.sub(r, incr, input)
print(h)

reading file data mixing strings and numbers in python

I would like to read different files in one directory with the following structure:
# Mj = 1.60 ff = 7580.6 gg = 0.8325
I would like to read the numbers from each file and associate every one to a vector.
If we assume I have 3 files, I will have 3 components for vector Mj, ...
How can I do it in Python?
Thanks for your help.

I'd use a regular expression to take the line apart:
import re
lineRE = re.compile(r'''
\#\s*
Mj\s*=\s*(?P<Mj>[-+0-9eE.]+)\s*
ff\s*=\s*(?P<ff>[-+0-9eE.]+)\s*
gg\s*=\s*(?P<gg>[-+0-9eE.]+)
''', re.VERBOSE)
for filename in filenames:
for line in file(filename, 'r'):
m = lineRE.match(line)
if not m:
continue
Mj = m.group('Mj')
ff = m.group('ff')
gg = m.group('gg')
# Put them in whatever lists you want here.

Here's a pyparsing solution that might be easier to manage than a regex solution:
text = "# Mj = 1.60 ff = 7580.6 gg = 0.8325 "
from pyparsing import Word, nums, Literal
# subexpression for a real number, including conversion to float
realnum = Word(nums+"-+.E").setParseAction(lambda t:float(t[0]))
# overall expression for the full line of data
linepatt = (Literal("#") + "Mj" + "=" + realnum("Mj") +
"ff" + "=" + realnum("ff") +
"gg" + "=" + realnum("gg"))
# use '==' to test for matching line pattern
if text == linepatt:
res = linepatt.parseString(text)
# dump the matched tokens and all named results
print res.dump()
# access the Mj data field
print res.Mj
# use results names with string interpolation to print data fields
print "%(Mj)f %(ff)f %(gg)f" % res
Prints:
['#', 'Mj', '=', 1.6000000000000001, 'ff', '=', 7580.6000000000004, 'gg', '=', 0.83250000000000002]
- Mj: 1.6
- ff: 7580.6
- gg: 0.8325
1.6
1.600000 7580.600000 0.832500

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.