re.sub python to gather height - python

I am writing a python program to parse some user data from a txt file.
One of the rows in the text file will contain the user's height.
I have specified an order that the user is expected to follow like
First line of the file should contain name, the next line, date of birth,
3rd line, height etc.
I have also given a sample file to the user which looks like this
Name: First Name Last Name
DOB: 16.04.2000
Age: 16
Height: 5 feet 9 inch
When I read the file, I looked at each line and split it using ':' as a separator.
The first field is my column name like name, dob, age, height.
In some cases, users forget the ':' after Name or DOB, or they will simply send data like:
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
The logic I have decided to use is:
Look for ':' on each line; if one is found, then I have my field.
Otherwise, try to find out what data it could be.
The logic for height is like this:
if any(heightword in file_line.upper() for heightword in ['FT', 'HEIGHT', 'FEET', 'INCH', 'CM'])
This if condition will look for words associated with height.
Once I have determined that the line from the file contains the height, I want to be able to convert that information to inches before I write it to the database.
Please can someone help me work out how to convert the following data to inches.
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
I know since I am trying to cater to variety of user inputs. This list is not exhaustive; I am trying to use these as an example to understand, and then I will keep adding code if and when I find new patterns.

pyparsing is a nice module for simple parsing situations like this, especially when trying to process less-than-predictable-but-still-fairly-structured human input. You can compose your parser using some friendly-named classes (Keyword, Optional, OneOrMore, and so on) and arithmetic operators ('+' for sequence, '|' for alternatives, etc.), to assemble smaller parsers into larger ones. Here is a parser built up from bits for your example (also support ' and " for feet and inches, and fractional feet and inch values too). (This sample uses the latest version of pyparsing, version 2.1.4):
samples = """\
Height 5 feet 9 inch
5 feet 9 inch
5ft 9 in
5feet 9inches
5'-9-1/2"
5' 9-1/2"
5' 9 1/2"
6'
3/4"
3ft-6-1/4 in
"""
from pyparsing import CaselessKeyword, pyparsing_common, Optional
CK = CaselessKeyword
feet_units = CK("feet") | CK("ft") | "'"
inch_units = CK("inches") | CK("inch") | CK("in") | '"'
# pyparsing_common.number will parse an integer or real, and convert to float
integer = pyparsing_common.number
fraction = integer + '/' + integer
fraction.addParseAction(lambda t: t[0]/t[-1])
qty = fraction | (integer + Optional(fraction)).addParseAction(lambda t:sum(t))
# define whole Height feet-inches expression
HEIGHT = CK("height") | CK("ht")
inch_qty = qty("inches")
feet_qty = qty("feet")
height_parser = Optional(HEIGHT) + (inch_qty + inch_units |
feet_qty + feet_units + Optional(inch_qty + inch_units))
# use parse-time callback to convert feet-and-inches to inches
height_parser.addParseAction(lambda t: t.get("feet", 0.0)*12 + t.get("inches", 0.0))
height_parser.ignore("-")
height_parser.runTests(samples)
# how to use the parser in normal code
height_value = height_parser.parseString(samples.splitlines()[0])[0]
print(height_value, type(height_value))
Prints:
Height 5 feet 9 inch
[69.0]
5 feet 9 inch
[69.0]
5ft 9 in
[69.0]
5feet 9inches
[69.0]
5'-9-1/2"
[69.5]
5' 9-1/2"
[69.5]
5' 9 1/2"
[69.5]
6'
[72.0]
3/4"
[0.75]
3ft-6-1/4 in
[42.25]
69.0 <type 'float'>

In JavaScript, there is an operation called "computed access", done as object[key], where the object property read is determined through the result of a given expression, as an alternative to the normal . operator. Personally, I mostly use it for iteration and reading properties with hyphens and stuff, but it can also be used to get associated wanted results from an input string.
So after an entire afternoon of Googling and figuring out Python syntax, etc. I was able to write a short program to do this.
import re
import string
h = 0
r = re.compile(r'(\d+)\s*(\w+)\b')
def incr( m ):
h+=m.group(1)*({'in':1,'inches':1,'inch':1,'foot':12,'feet':12,'cm':0.3937,'centimeter':0.3937,'centimeters':0.3937}[string.lower(m.group(2))]||1) # etc. etc.
return ''
re.sub(r, incr, input)
print h
You may want to restrict the keywords usable to keep the dict from getting too big.

I tried out Stephen's code in the first comment on python 3.6 and had to tweak it to work for me:
import re
h = 0
input = '5 feet 9 inches'
r = re.compile(r'(\d)\s*(\w+)\b')
measures ={'in':1,'inches':1,'inch':1,'foot':12,'feet':12,'ft':12,'cm':0.3937,'centimeter':0.3937,'centimeters':0.3937}
def incr(m):
global h
h+=int(m.group(1))*measures[m.group(2)]
return ''
re.sub(r, incr, input)
print(h)

Related

Regex in Python returns nothing (search parameters keywords for search for when using regex)

I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.
Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:
import re
import requests
import pdfplumber
import pandas as pd
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)
with pdfplumber.open(ap) as pdf:
page = pdf.pages[1]
text = page.extract_text()
#print(text)
new_vend_re = re.compile(r'\d{1} [A-D]')
for line in text.split('\n'):
if new_vend_re.match(line):
print(line)
When I run the code, I do not get anything in return. Printing the text though will print the whole page.
Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf
You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.
'9700/12 Cambridge International AS/A Level – Mark Scheme March 2019\nPUBLISHED \n \nQuestion Answer Marks \n1 A 1\n2 C 1\n3 C 1\n4 A 1\n5 A 1\n6 C 1\n7 A 1\n8 D 1\n9 A 1\n10 C 1\n11 B 1\n12 D 1\n13 B 1\n...
Change your regex to the following to match not only one, but one or more spaces:
new_vend_re = re.compile(r'\d{1}\s+[A-D]')
See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:
1 A 1
2 C 1
3 C 1
4 A 1
5 A 1
6 C 1
7 A 1
8 D 1
9 A 1
(You can also see here, that there are always two spaces instead of one).
//Edit: Fixed typo in regex

How can I extract a floating point value from a string, in python 3?

string = probability is 0.05
how can I extract 0.05 float value in a variable? There are many such strings in the file,I need to find the average probability, so
I used 'for' loop.
my code :
fname = input("enter file name: ")
fh = open(fname)
count = 0
val = 0
for lx in fh:
if lx.startswith("probability"):
count = count + 1
val = val + #here i need to get the only "float" value which is in string
print(val)
import re
string='probability is 1'
string2='probability is 1.03'
def FindProb(string):
pattern=re.compile('[0-9]')
result=pattern.search(string)
result=result.span()[0]
prob=string[result:]
return(prob)
print(FindProb(string2))
Ok, so.
This is using the regular expression (aka Regex aka re) library
It basically sets up a pattern and then searches for it in a string.
This function takes in a string and finds the first number in the string, then returns the variable prob which would be the string from the first number to the end.
If you need to find the probability multiple times then this might do it:
import re
string='probability is 1'
string2='probability is 1.03 blah blah bllah probablity is 0.2 ugggggggggggggggg probablity is 1.0'
def FindProb(string):
amount=string.count('.')
prob=0
for i in range(amount):
pattern=re.compile('[0-9]+[.][0-9]+')
result=pattern.search(string)
start=result.span()[0]
end=result.span()[1]
prob+=float(string[start:end])
string=string[end:]
return(prob)
print(FindProb(string2))
The caveat to this is that everything has to have a period so 1 would have to be 1.0 but that shouldn't be too much of a problem. If it is, let me know and I will try to find a way

Python send escpos command to thermal printer character size issue

I need to send escpos to a thermal receipt printer. I am running into issues with specifying character size, which is described [https://reference.epson-biz.com/modules/ref_escpos/index.php?content_id=34]. In Python I write this command as
#ESC # for initiate the printer
string = b'\x1b\x40'
#GS ! command in the doc corresponding to 4 times character height and width
string = string + b'\x1d' + b'\x21' + b'\x30' + b'\x03'
string = string + bytes('hello world')
In the first line I initiated the printer corresponding to ESC #
In the second line I wanted to specify the character size to be 4x height and width (see the links to the doc).
In the third line I print out the text.
The problem is that the text printed out has 4x width, but not 4 times height. I have also tried to write the character size as two commands
string = string + b'\x1d' + b'\x21' + b'\x30'
string = string + b'\x1d' + b'\x21' + b'\x03'
In this case, my text is printed out with 4 times height but not 4 times width. I am pretty sure that I have misread the doc, but I don't know how else I should write the command in order to achieve both 4 times height and width.
Also examples exist for the GS ! syntax in escpos, and there it seems to be written as GS ! 0x11 to achieve both 2 times width and height. This doesn't seems to make sense from the table. I am aware that python-escpos exists, however it doesn't work on windows 10 for my usb printer.
From reading the docs, it seems to me that you would have to use
b'\x1d' + b'\x21' + b'\x33'
to get 4 times magnification in height as well as width. The two '3' indicate the magnifications minus one. The first is width, the second is height.
So the problem seems to be that you split width and height into two bytes. They should be collected into one byte.
So, in total:
#ESC # for initiate the printer
string = b'\x1b\x40'
#GS ! command in the doc corresponding to 4 times character height and width
string = string + b'\x1d' + b'\x21' + b'\x33'
string = string + bytes('hello world')
Or, in another way:
def initialize():
# Code for initialization of the printer.
return b'\x1b\x40'
def magnify(wm, hm):
# Code for magnification of characters.
# wm: Width magnification from 1 to 8. Normal width is 1, double is 2, etc.
# hm: Height magnification from 1 to 8. Normal height is 1, double is 2, etc.
return bytes([0x1d, 16*(wm-1) + (hm-1)])
def text(t, encoding="ascii"):
# Code for sending text.
return bytes(t, encoding)
string = initialize() + magnify(4, 4) + text('hello world')

replace values in a float with a string- python

I wanted to replace dot in a float with a string
for example if I have a float 15.444 I need print it something like below 15 eggs 444 chicken
If take I take a real example,here is the code
enter code here
name = "chris"
height_cm = 175
height = (height_cm * 1/2.54) * 1/12
weight = 79
print "He is %s" % name
print "His height is %.2f" %height, "inches"
print "His weight is %d" % weight
the second print line will give the output "my height is 5.7 inches" here how do I replace "." with a string. In this case I need to replace the "." with a string "feet"
++++++++++++++++++output++++++++++++++++++++++++
He is chris
His height is 5.7 inches
His weight is 79
+++++++++++++++++++output+++++++++++++++++++++++
I don't code python but you may want to do something along the line of this(java):
float input = 15.444f;
int[] split = (input+"").split(".");
System.out.println(split[0]+" eggs and "+split[1]+" chickens");
But why would you ever need to do something like this. A float is a horrible method of storage for 2 integer values. Try using an array instead.

Python: Read large file in chunks

Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'

Categories