I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.
Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:
import re
import requests
import pdfplumber
import pandas as pd
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)
with pdfplumber.open(ap) as pdf:
page = pdf.pages[1]
text = page.extract_text()
#print(text)
new_vend_re = re.compile(r'\d{1} [A-D]')
for line in text.split('\n'):
if new_vend_re.match(line):
print(line)
When I run the code, I do not get anything in return. Printing the text though will print the whole page.
Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf
You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.
'9700/12 Cambridge International AS/A Level – Mark Scheme March 2019\nPUBLISHED \n \nQuestion Answer Marks \n1 A 1\n2 C 1\n3 C 1\n4 A 1\n5 A 1\n6 C 1\n7 A 1\n8 D 1\n9 A 1\n10 C 1\n11 B 1\n12 D 1\n13 B 1\n...
Change your regex to the following to match not only one, but one or more spaces:
new_vend_re = re.compile(r'\d{1}\s+[A-D]')
See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:
1 A 1
2 C 1
3 C 1
4 A 1
5 A 1
6 C 1
7 A 1
8 D 1
9 A 1
(You can also see here, that there are always two spaces instead of one).
//Edit: Fixed typo in regex
string = probability is 0.05
how can I extract 0.05 float value in a variable? There are many such strings in the file,I need to find the average probability, so
I used 'for' loop.
my code :
fname = input("enter file name: ")
fh = open(fname)
count = 0
val = 0
for lx in fh:
if lx.startswith("probability"):
count = count + 1
val = val + #here i need to get the only "float" value which is in string
print(val)
import re
string='probability is 1'
string2='probability is 1.03'
def FindProb(string):
pattern=re.compile('[0-9]')
result=pattern.search(string)
result=result.span()[0]
prob=string[result:]
return(prob)
print(FindProb(string2))
Ok, so.
This is using the regular expression (aka Regex aka re) library
It basically sets up a pattern and then searches for it in a string.
This function takes in a string and finds the first number in the string, then returns the variable prob which would be the string from the first number to the end.
If you need to find the probability multiple times then this might do it:
import re
string='probability is 1'
string2='probability is 1.03 blah blah bllah probablity is 0.2 ugggggggggggggggg probablity is 1.0'
def FindProb(string):
amount=string.count('.')
prob=0
for i in range(amount):
pattern=re.compile('[0-9]+[.][0-9]+')
result=pattern.search(string)
start=result.span()[0]
end=result.span()[1]
prob+=float(string[start:end])
string=string[end:]
return(prob)
print(FindProb(string2))
The caveat to this is that everything has to have a period so 1 would have to be 1.0 but that shouldn't be too much of a problem. If it is, let me know and I will try to find a way
I need to send escpos to a thermal receipt printer. I am running into issues with specifying character size, which is described [https://reference.epson-biz.com/modules/ref_escpos/index.php?content_id=34]. In Python I write this command as
#ESC # for initiate the printer
string = b'\x1b\x40'
#GS ! command in the doc corresponding to 4 times character height and width
string = string + b'\x1d' + b'\x21' + b'\x30' + b'\x03'
string = string + bytes('hello world')
In the first line I initiated the printer corresponding to ESC #
In the second line I wanted to specify the character size to be 4x height and width (see the links to the doc).
In the third line I print out the text.
The problem is that the text printed out has 4x width, but not 4 times height. I have also tried to write the character size as two commands
string = string + b'\x1d' + b'\x21' + b'\x30'
string = string + b'\x1d' + b'\x21' + b'\x03'
In this case, my text is printed out with 4 times height but not 4 times width. I am pretty sure that I have misread the doc, but I don't know how else I should write the command in order to achieve both 4 times height and width.
Also examples exist for the GS ! syntax in escpos, and there it seems to be written as GS ! 0x11 to achieve both 2 times width and height. This doesn't seems to make sense from the table. I am aware that python-escpos exists, however it doesn't work on windows 10 for my usb printer.
From reading the docs, it seems to me that you would have to use
b'\x1d' + b'\x21' + b'\x33'
to get 4 times magnification in height as well as width. The two '3' indicate the magnifications minus one. The first is width, the second is height.
So the problem seems to be that you split width and height into two bytes. They should be collected into one byte.
So, in total:
#ESC # for initiate the printer
string = b'\x1b\x40'
#GS ! command in the doc corresponding to 4 times character height and width
string = string + b'\x1d' + b'\x21' + b'\x33'
string = string + bytes('hello world')
Or, in another way:
def initialize():
# Code for initialization of the printer.
return b'\x1b\x40'
def magnify(wm, hm):
# Code for magnification of characters.
# wm: Width magnification from 1 to 8. Normal width is 1, double is 2, etc.
# hm: Height magnification from 1 to 8. Normal height is 1, double is 2, etc.
return bytes([0x1d, 16*(wm-1) + (hm-1)])
def text(t, encoding="ascii"):
# Code for sending text.
return bytes(t, encoding)
string = initialize() + magnify(4, 4) + text('hello world')
I wanted to replace dot in a float with a string
for example if I have a float 15.444 I need print it something like below 15 eggs 444 chicken
If take I take a real example,here is the code
enter code here
name = "chris"
height_cm = 175
height = (height_cm * 1/2.54) * 1/12
weight = 79
print "He is %s" % name
print "His height is %.2f" %height, "inches"
print "His weight is %d" % weight
the second print line will give the output "my height is 5.7 inches" here how do I replace "." with a string. In this case I need to replace the "." with a string "feet"
++++++++++++++++++output++++++++++++++++++++++++
He is chris
His height is 5.7 inches
His weight is 79
+++++++++++++++++++output+++++++++++++++++++++++
I don't code python but you may want to do something along the line of this(java):
float input = 15.444f;
int[] split = (input+"").split(".");
System.out.println(split[0]+" eggs and "+split[1]+" chickens");
But why would you ever need to do something like this. A float is a horrible method of storage for 2 integer values. Try using an array instead.
Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'