how to do complex pdf extraction with regex

how to do complex pdf extraction with regex - python

I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes.
PDF file
i tried this:
import re
import pdfplumber
prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")
with pdfplumber.open("./test11.pdf") as pdf:
for i in range(len(pdf.pages)):
page_text = pdf.pages[i].extract_text()
for line in page_text.split("\n"):
if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
print(line)
and i got this, i don't know how to assign each ticket to its prize, also Cons prizes tickets number seems a little bit strange i don't know why (AN 867952AO 867952AP shoud be => AN 867952 AO 867952 AP...):
1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...
instead i want to get:
[
{
"1st Prize Rs :7000000",
"tickets": [
"AU 867952"
]
},
{
"Cons Prize-Rs :8000",
"tickets": [
"AN 867952",
"AO 867952",
"AP 867952",
"AR 867952",
...
]
},
...
]
how can i achieve this ?

You could first get all the full parts from all the pages in capture groups.
Then you can after process the 3rd capture group to get the separate "tickets" and in a loop create the wanted data structure.
For the first separate groups, you can use a pattern that matches the start of every prize section, and captures all values until the next prize section.
^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)
Regex demo
For the after processing, you can use a pattern for the ticket formats, which matches either 2 uppercase chars, space and 6 digits, or 4 or more digits followed by a whitespace boundary.
(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))
Regex demo
Example code using the pdf file from the question:
import re
import pdfplumber
import json
pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"
with pdfplumber.open("./test11.pdf") as pdf:
all_text = ""
for page in pdf.pages:
all_text += '\n' + page.extract_text()
matches = re.finditer(pattern, all_text, re.MULTILINE)
coll = []
for matchNum, match in enumerate(matches):
dct = {}
dct[match.group(1)] = match.group(2)
dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
coll.append(dct)
print(json.dumps(coll, indent=4))
Output
[
{
"1st Prize Rs ": "120000000",
"tickets": [
"XG 218582"
]
},
{
"Cons Prize-Rs ": "500000",
"tickets": [
"XA 218582",
"XB 218582",
"XC 218582",
"XD 218582",
"XE 218582"
]
},
{
"2nd Prize Rs ": "5000000",
"tickets": [
"XA 788417",
"XB 161796",
"XC 319503",
"XD 713832",
"XE 667708",
"XG 137764"
]
},
....

Related

Get data with boundaries using regex

I would like to get the labels and data from this function using regex, I have tried using this:
pattern = re.compile(r'/blabels: ],/b')
print(pattern)
result = soup.find("script", text=pattern)
But I get None using boundaries
This is the soup:
<script>
Chart.defaults.LineWithLine = Chart.defaults.line;
new Chart(document.getElementById("chart-overall-mentions"), {
type: 'LineWithLine',
data: {
labels: [1637005508000,1637006108000,1637006708000,1637007308000,1637007908000,1637008508000,1637009108000,1637009708000,1637010308000,1637010908000,1637011508000,1637012108000,1637012708000,1637013308000,1637013908000,1637014508000,1637015108000,1637015708000,1637016308000,1637016908000,1637017508000,1637018108000,1637018708000,1637019308000,1637019908000,1637020508000,1637021108000,1637021708000,1637022308000,1637022908000,1637023508000,1637024108000,1637024708000,1637025308000,1637025908000,1637026508000,1637027108000,1637027708000,1637028308000,1637028908000,1637029508000,1637030108000,1637030708000,1637031308000,1637031908000,1637032508000,1637033108000,1637033708000,1637034308000,1637034908000,1637035508000,1637036108000,1637036708000,1637037308000,1637037908000,1637038508000,1637039108000,1637039708000,1637040308000,1637040908000,1637041508000,1637042108000,1637042708000,1637043308000,1637043908000,1637044508000,1637045108000,1637045708000,1637046308000,1637046908000,1637047508000,1637048108000,1637048708000,1637049308000,1637049908000,1637050508000,1637051108000,1637051708000,1637052308000,1637052908000,1637053508000,1637054108000,1637054708000,1637055308000,1637055908000,1637056508000,1637057108000,1637057708000,1637058308000,1637058908000,1637059508000,1637060108000,1637060708000,1637061308000,1637061908000,1637062508000,1637063108000,1637063708000,1637064308000,1637064908000,1637065508000,1637066108000,1637066708000,1637067308000,1637067908000,1637068508000,1637069108000,1637069708000,1637070308000,1637070908000,1637071508000,1637072108000,1637072708000,1637073308000,1637073908000,1637074508000,1637075108000,1637075708000,1637076308000,1637076908000,1637077508000,1637078108000,1637078708000,1637079308000,1637079908000,1637080508000,1637081108000,1637081708000,1637082308000,1637082908000,1637083508000,1637084108000,1637084708000,1637085308000,1637085908000,1637086508000,1637087108000,1637087708000,1637088308000,1637088908000,1637089508000,1637090108000,1637090708000,1637091308000],
datasets: [{
data: [13,10,20,26,21,23,24,21,24,35,25,31,42,24,24,20,23,22,17,23,30,11,16,20,9,10,22,10,19,16,15,16,17,19,10,20,24,14,19,15,13,9,13,17,20,16,15,21,18,25,15,14,16,15,16,14,14,21,10,9,5,9,9,13,14,9,9,18,15,11,11,6,12,14,19,17,16,11,20,14,21,13,15,12,14,10,20,16,25,17,17,11,23,11,13,11,19,10,17,19,10,20,22,19,19,27,28,18,20,22,18,16,17,18,14,17,19,18,20,11,13,20,15,15,18,14,13,14,14,11,19,14,14,11,11,15,26,12,15,15,11,4,3,6],
pointRadius: 0,
borderColor: "#666",
fill: true,
yAxisID:'yAxis1'
},
]
},
options: {
tooltips: {
mode: 'index',
bodyFontSize: 18,
intersect: false,
titleFontSize: 16,
},
.
.
.
</script>

Here is how you can do that:
Get the script tag - you can use a regex, too, if that is the only way to obtain that node
Then run a regex search against the node text/string to get your final output.
You can use
# Get the script node with text matching your pattern
item = soup.find("script", text=re.compile(r'\blabels:\s*\['))
import re
match = re.search(r'\blabels:\s*\[([^][]*)]', item.string)
if match:
labels = map(int, match.group(1).split(','))
Output:
>>> print(list(labels))
[1637005508000, 1637006108000, 1637006708000, 1637007308000, 1637007908000, 1637008508000, 1637009108000, 1637009708000, 1637010308000, 1637010908000, 1637011508000, 1637012108000, 1637012708000, 1637013308000, 1637013908000, 1637014508000, 1637015108000, 1637015708000, 1637016308000, 1637016908000, 1637017508000, 1637018108000, 1637018708000, 1637019308000, 1637019908000, 1637020508000, 1637021108000, 1637021708000, 1637022308000, 1637022908000, 1637023508000, 1637024108000, 1637024708000, 1637025308000, 1637025908000, 1637026508000, 1637027108000, 1637027708000, 1637028308000, 1637028908000, 1637029508000, 1637030108000, 1637030708000, 1637031308000, 1637031908000, 1637032508000, 1637033108000, 1637033708000, 1637034308000, 1637034908000, 1637035508000, 1637036108000, 1637036708000, 1637037308000, 1637037908000, 1637038508000, 1637039108000, 1637039708000, 1637040308000, 1637040908000, 1637041508000, 1637042108000, 1637042708000, 1637043308000, 1637043908000, 1637044508000, 1637045108000, 1637045708000, 1637046308000, 1637046908000, 1637047508000, 1637048108000, 1637048708000, 1637049308000, 1637049908000, 1637050508000, 1637051108000, 1637051708000, 1637052308000, 1637052908000, 1637053508000, 1637054108000, 1637054708000, 1637055308000, 1637055908000, 1637056508000, 1637057108000, 1637057708000, 1637058308000, 1637058908000, 1637059508000, 1637060108000, 1637060708000, 1637061308000, 1637061908000, 1637062508000, 1637063108000, 1637063708000, 1637064308000, 1637064908000, 1637065508000, 1637066108000, 1637066708000, 1637067308000, 1637067908000, 1637068508000, 1637069108000, 1637069708000, 1637070308000, 1637070908000, 1637071508000, 1637072108000, 1637072708000, 1637073308000, 1637073908000, 1637074508000, 1637075108000, 1637075708000, 1637076308000, 1637076908000, 1637077508000, 1637078108000, 1637078708000, 1637079308000, 1637079908000, 1637080508000, 1637081108000, 1637081708000, 1637082308000, 1637082908000, 1637083508000, 1637084108000, 1637084708000, 1637085308000, 1637085908000, 1637086508000, 1637087108000, 1637087708000, 1637088308000, 1637088908000, 1637089508000, 1637090108000, 1637090708000, 1637091308000]
Once the node is obtained the \blabels:\s*\[([^][]*)] regex searches for
\b - a word boundary
labels: - a fixed string
\s* - zero or more whitespaces
\[ - a [ char
([^][]*) - Group 1 (this is what you will need to split with a comma later): any zero or more chars other than ] and [
] - a ] char.

How to convert text file into json file?

I am new to python and I want to convert a text file into json file.
Here's how it looks like:
#Q Three of these animals hibernate. Which one does not?
^ Sloth
A Mouse
B Sloth
C Frog
D Snake
#Q What is the literal translation of the Greek word Embioptera, which denotes an order of insects, also known as webspinners?
^ Lively wings
A Small wings
B None of these
C Yarn knitter
D Lively wings
#Q There is a separate species of scorpions which have two tails, with a venomous sting on each tail.
^ False
A True
B False
Contd
.
.
.
.
^ means Answer.
I want it in json format as shown below.
Example:
{
"questionBank": [
{
"question": "Grand Central Terminal, Park Avenue, New York is the worlds",
"a": "largest railway station",
"b": "Longest railway station",
"c": "highest railway station",
"d": "busiest railway station",
"answer": "largest railway station"
},
{
"question": "Eritrea, which became the 182nd member of the UN in 1993, is in the continent of",
"a": "Asia",
"b": "Africa",
"c": "Europe",
"d": "Oceania",
"answer": "Africa"
}, Contd.....
]
}
I came across a few similar posts and here's what I have tried:
dataset = "file.txt"
data = []
with open(dataset) as ds:
for line in ds:
line = line.strip().split(",")
print(line)
To which the output is:
['']
['#Q What part of their body do the insects from order Archaeognatha use to spring up into the air?']
['^ Tail']
['A Antennae']
['B Front legs']
['C Hind legs']
['D Tail']
['']
['#Q What is the literal translation of the Greek word Embioptera', ' which denotes an order of insects', ' also known as webspinners?']
['^ Lively wings']
['A Small wings']
['B None of these']
['C Yarn knitter']
['D Lively wings']
['']
Contd....
The sentences containing commas are separated by python lists. I tried to use .join but didn't get the results I was expecting.
Please let me know how to approach this.

dataset = "text.txt"
question_bank = []
with open(dataset) as ds:
for i, line in enumerate(ds):
line = line.strip("\n")
if len(line) == 0:
question_bank.append(question)
question = {}
elif line.startswith("#Q"):
question = {"question": line}
elif line.startswith("^"):
question['answer'] = line.split(" ")[1]
else:
key, val = line.split(" ", 1)
question[key] = val
question_bank.append(question)
print({"questionBank":question_bank})
#for storing json file to local directory
final_output = {"questionBank":question_bank}
with open("output.json", "w") as outfile:
outfile.write(json.dumps(final_output, indent=4))

Rather than handling the lines one at a time, I went with using a regex pattern approach.
This also more reliable as it will error out if the input data is in a bad format - rather than silently ignoring a grouping which is missing a field.
PATTERN = r"""[#]Q (?P<question>.+)\n\^ (?P<answer>.+)\nA (?P<option_a>.+)\nB (?P<option_b>.+)\n(?:C (?P<option_c>.+)\n)?(?:D (?P<option_d>.+))?"""
def parse_qa_group(qa_group):
"""
Extact question, answer and 2 to 4 options from input string and return as a dict.
"""
# "group" here is a set of question, answer and options.
matches = PATTERN.search(qa_group)
# "group" here is a regex group.
question = matches.group('question')
answer = matches.group('answer')
try:
c = matches.group('option_c')
except IndexError:
c = None
try:
d = matches.group('option_d')
except IndexError:
d = None
results = {
"question": question,
"answer": answer,
"a": matches.group('option_a'),
"b": matches.group('option_b')
}
if c:
results['c'] = c
if d:
results['d'] = d
return results
# Split into groups using the blank line.
qa_groups = question_answer_str.split('\n\n')
# Process each group, building up a list of all results.
all_results = [parse_qa_group(qa_group) for qa_group in qa_groups]
print(json.dumps(all_results, indent=4))
Further details in my gist. Read more on regex Grouping
I left out reading the text and writing a JSON file.

Looping through tree to create a dictionary_NLTK

I'm new to Python and trying to solve a problem looping through a tree in NLTK. I'm stuck on the final output, it is not entirely correct.
I'm looking to create a dictionary with 2 variables and if there is no quantity then add value 1.
This is the desired final output:
{ quantity =1, food =pizza }, {quantity =1, food = coke }
,{quantity =2, food = beers}, {quantity =1, food = sandwich }
Here is my code, any help is much appreaciated!
'''
import nltk as nltk
nltk.download()
grammar = r""" Food:{<DT>?<VRB>?<NN.*>+}
}<>+{
Quantity: {<CD>|<JJ>|<DT>}
"""
rp = nltk.RegexpParser(grammar)
def RegPar(menu):
grammar = r"""Food:{<DT>?<VRB>?<NN.*>+}
}<>+{
Quantity: {<CD>|<JJ>|<DT>}
"""
rp = nltk.RegexpParser(grammar)
output = rp.parse(menu)
return(output)
Sentences = [ 'A pizza margherita', 'one coke y 2 beers', 'Sandwich']
tagged_array =[]
output_array =[]
for s in Sentences:
tokens = nltk.word_tokenize(s)
tags = nltk.pos_tag(tokens)
tagged_array.append(tags)
output = rp.parse(tags)
output_array.append(output)
print(output)
dat = []
tree = RegPar(output_array)
for subtree in tree.subtrees():
if subtree.label() == 'Food' or subtree.label() =='Quantity':
dat.append({(subtree.label(),subtree.leaves()[0][0])})
print(dat)
##[{('Food', 'A')}, {('Quantity', 'one')}, {('Food', 'coke')}, {('Quantity', '2')}, {('Food', 'beers')}, {('Food', 'Sandwich')}]*
'''

Determining most common name from web scraped birth name data

I have the task to do web scraping from this page https://www.ssa.gov/cgi-bin/popularnames.cgi. There you can find a list of the most common birth names. Now I have to find the most common name that both girls and boys have for a given year (in other words, the exact same name is used in both genders), but I don't know how I am able to do that. With the code below I solved the previous task to output the list for a given year but I have no clue how I can modify my code so I get the most common name that both girls and boys have.
import requests
import lxml.html as lh
url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'
string = input("Year: ")
r = requests.post(url, data=dict(year=string, top="1000", number="n" ))
doc = lh.fromstring(r.content)
tr_elements = doc.xpath('//table[2]//td[2]//tr')
cols = []
for col in tr_elements[0]:
name = col.text_content()
number = col.text_content()
cols.append((number, []))
count=0
for row in tr_elements[1:]:
i = 0
for col in row:
val = col.text_content()
cols[i][1].append(val)
i += 1
if(count<4):
print(val, end = ' ')
count += 1
else:
count=0
print(val)

Here's one approach. The first step is to group the data by name and record how many genders have used the name and their aggregate total. After that, we can filter the structure by names with more than one gender using it. Finally, we sort this multi-gender list by counts and take the 0-th element. This is our most popular multi-gender name for the year.
import requests
import lxml.html as lh
url = "https://www.ssa.gov/cgi-bin/popularnames.cgi"
year = input("Year: ")
response = requests.post(url, data=dict(year=year, top="1000", number="n"))
doc = lh.fromstring(response.content)
tr_elements = doc.xpath("//table[2]//td[2]//tr")
column_names = [col.text_content() for col in tr_elements[0]]
names = {}
most_common_shared_names_by_year = {}
for row in tr_elements[1:-1]:
row = [cell.text_content() for cell in row]
for i, gender in ((1, "male"), (3, "female")):
if row[i] not in names:
names[row[i]] = {"count": 0, "genders": set()}
names[row[i]]["count"] += int(row[i+1].replace(",", ""))
names[row[i]]["genders"].add(gender)
shared_names = [
(name, data) for name, data in names.items() if len(data["genders"]) > 1
]
most_common_shared_names = sorted(shared_names, key=lambda x: -x[1]["count"])
print("%s => %s" % most_common_shared_names[0])
If you're curious, here are the results since 2000:
2000 => Tyler, 22187
2001 => Tyler, 19842
2002 => Tyler, 18788
2003 => Ryan, 20171
2004 => Madison, 20829
2005 => Ryan, 18661
2006 => Ryan, 17116
2007 => Jayden, 17287
2008 => Jayden, 19040
2009 => Jayden, 19053
2010 => Jayden, 18641
2011 => Jayden, 18064
2012 => Jayden, 16952
2013 => Jayden, 15462
2014 => Logan, 14478
2015 => Logan, 13753
2016 => Logan, 12099
2017 => Logan, 15117

Extracting numbers in text file

I have a text file which came from excel. I dont know how to take five digits after a specific character.
I want to take only five digits after #ACA in a text file.
my text is like:
ERROR_MESSAGE
(((#ACA16018)|(#ACA16019))&(#AQV71767='')&(#AQV71765='2'))?1:((#AQV71765='4')?1:((#AQV71767$'')?(((#AQV71765='1')|(#AQV71765='3'))?1:'Hasar veya Lehe Hukuk seçebilirsiniz'):'Rücu sıra numarasını yazıp Hasar veya Lehe Hukuk seçebilirsiniz'))
Rücu Oranı Girilmesi Zorunludur...'
#ACA17660
#ACA16560
#ACA15623
#ACA17804
BU ALANI BOŞ GEÇEMEZSİNİZ.EKSPER RAPORU GELMEDEN DY YE GERİ GÖNDEREMEZSİNİZ. PERT İHBARI VARSA PERT ÇALINMA OPERASYONU AKTİVİTESİ OLUŞTURULMALIDIR.
(#TSC[T008UNSMAS;FIRM_CODE=2 AND UNIT_TYPE='SG' AND UNIT_NO=#AQV71830]>0)?1:'Girdiğiniz değer fihristte yoktur'
#ACA17602
#ACA17604
#ACA56169
BU ALANI BOŞ GEÇEMEZSİNİZ
#ACA17606
#ACA17608
(#AQV71835='')?'Boş geçilemez':1
Lütfen Gönderilecek Kişinin Mail Adresini Giriniz ! '
LÜTFEN RED NEDENİNİ GİRİNİZ.
EKSİK BİLGİ / BELGE ALANINA GİRMİŞ OLDUĞUNUZ DEĞER YANLIŞ VEYA GEÇERŞİZDİR!!! LÜTFEN KONTROL EDİP TEKRAR DENEYİNİZ.'
BU ALAN BOŞ GEÇİLEMEZ. ÖDEME YAPILMADAN EK ÖDEME SÜRECİNİ BAŞLATAMAZSINIZ.
ONAYLANDI VE REDDEDİLDİ SEÇENEKLERİNİ KULLANAMAZSINIZ
BU ALAN BOŞ GEÇİLEMEZ.EVRAKLARINIZI , VARSA EKSPER RAPORUNU VE MUALLAĞI KONTROL EDİNİZ.
Muallak Tutarını kontrol ediniz.
'OTO BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
'OTODIŞI BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
(#AQV70003$'')?((#TSC[T001HASIHB;FIRM_CODE=#FP10100 AND COMPANY_CODE=2 AND CLAIM_NO=#AQV70003]$0)?1:'Bu dosya sistemde bulunmamaktadır'):'Bu alan boş geçilemez'
(#AQV70503='')?'Bu alan boş geçilemez.':((#ACA18635=1)?1:'Mağdura ait uygun kriterli ödeme kaydı mevcut değildir.')
(#AQV71809=0)?'Boş geçilemez':1
(#FD101AQV71904_AFDS<0)?'Tarih bugünün tarihinden büyük olamaz
I want to take every 5 digits which comes after #ACA, so:
16018, 16019, 17660, etc...

grep -oP '#ACA\K[0-9]{5}' file.txt
#ACA\K will match #ACA but not printed as part of output
[0-9]{5} five digits following #ACA
If variable number of digits are needed, use
grep -oP '#ACA\K[0-9]+' file.txt

If you don't know or don't like regular expressions, you can do this, although the code is a bit longer :
if __name__ == '__main__':
pattern = '#ACA'
filename = 'yourfile.txt'
res = list()
with open(filename, 'rb') as f: # open 'yourfile.txt' in byte-reading mode
for line in f: # for each line in the file
for s in line.split(pattern)[1:]: # split the line on '#ACA'
try:
nb = int(s[:5]) # take the first 5 characters after as an int
res.append(nb) # add it to the list of numbers we found
except (NameError, ValueError): # if conversion fails, that wasn't an int
pass
print res # if you want them in the same order as in the file
print sorted(res) # if you want them in ascending order

This should do it
import re
print(re.findall("#ACA(\d+)",str_var))
If you have the whole text in the variable str_var
Output:
['16018', '16019', '17660', '16560', '15623', '17804', '17602', '17604', '56169', '17606', '17608', '18635']

re.findall(r'#ACA(\d{5})', str_var)

[x[:5] for x in content.split("#ACA")[1:]]

PowerShell solution:
$contet = Get-Content -Raw 'your_file'
$match = [regex]::Matches($contet, '#ACA(\d{5})')
$match | ForEach-Object {
$_.Groups[1].Value
}
Output:
16018
16019
17660
16560
15623
17804
17602
17604
56169
17606
17608
18635

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to do complex pdf extraction with regex - python

Related

Get data with boundaries using regex

How to convert text file into json file?

Looping through tree to create a dictionary_NLTK

Determining most common name from web scraped birth name data

Extracting numbers in text file

Categories

Resources