how to do complex pdf extraction with regex - python
I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes.
PDF file
i tried this:
import re
import pdfplumber
prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")
with pdfplumber.open("./test11.pdf") as pdf:
for i in range(len(pdf.pages)):
page_text = pdf.pages[i].extract_text()
for line in page_text.split("\n"):
if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
print(line)
and i got this, i don't know how to assign each ticket to its prize, also Cons prizes tickets number seems a little bit strange i don't know why (AN 867952AO 867952AP shoud be => AN 867952 AO 867952 AP...):
1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...
instead i want to get:
[
{
"1st Prize Rs :7000000",
"tickets": [
"AU 867952"
]
},
{
"Cons Prize-Rs :8000",
"tickets": [
"AN 867952",
"AO 867952",
"AP 867952",
"AR 867952",
...
]
},
...
]
how can i achieve this ?
You could first get all the full parts from all the pages in capture groups.
Then you can after process the 3rd capture group to get the separate "tickets" and in a loop create the wanted data structure.
For the first separate groups, you can use a pattern that matches the start of every prize section, and captures all values until the next prize section.
^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)
Regex demo
For the after processing, you can use a pattern for the ticket formats, which matches either 2 uppercase chars, space and 6 digits, or 4 or more digits followed by a whitespace boundary.
(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))
Regex demo
Example code using the pdf file from the question:
import re
import pdfplumber
import json
pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"
with pdfplumber.open("./test11.pdf") as pdf:
all_text = ""
for page in pdf.pages:
all_text += '\n' + page.extract_text()
matches = re.finditer(pattern, all_text, re.MULTILINE)
coll = []
for matchNum, match in enumerate(matches):
dct = {}
dct[match.group(1)] = match.group(2)
dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
coll.append(dct)
print(json.dumps(coll, indent=4))
Output
[
{
"1st Prize Rs ": "120000000",
"tickets": [
"XG 218582"
]
},
{
"Cons Prize-Rs ": "500000",
"tickets": [
"XA 218582",
"XB 218582",
"XC 218582",
"XD 218582",
"XE 218582"
]
},
{
"2nd Prize Rs ": "5000000",
"tickets": [
"XA 788417",
"XB 161796",
"XC 319503",
"XD 713832",
"XE 667708",
"XG 137764"
]
},
....
Related
Get data with boundaries using regex
I would like to get the labels and data from this function using regex, I have tried using this: pattern = re.compile(r'/blabels: ],/b') print(pattern) result = soup.find("script", text=pattern) But I get None using boundaries This is the soup: <script> Chart.defaults.LineWithLine = Chart.defaults.line; new Chart(document.getElementById("chart-overall-mentions"), { type: 'LineWithLine', data: { labels: [1637005508000,1637006108000,1637006708000,1637007308000,1637007908000,1637008508000,1637009108000,1637009708000,1637010308000,1637010908000,1637011508000,1637012108000,1637012708000,1637013308000,1637013908000,1637014508000,1637015108000,1637015708000,1637016308000,1637016908000,1637017508000,1637018108000,1637018708000,1637019308000,1637019908000,1637020508000,1637021108000,1637021708000,1637022308000,1637022908000,1637023508000,1637024108000,1637024708000,1637025308000,1637025908000,1637026508000,1637027108000,1637027708000,1637028308000,1637028908000,1637029508000,1637030108000,1637030708000,1637031308000,1637031908000,1637032508000,1637033108000,1637033708000,1637034308000,1637034908000,1637035508000,1637036108000,1637036708000,1637037308000,1637037908000,1637038508000,1637039108000,1637039708000,1637040308000,1637040908000,1637041508000,1637042108000,1637042708000,1637043308000,1637043908000,1637044508000,1637045108000,1637045708000,1637046308000,1637046908000,1637047508000,1637048108000,1637048708000,1637049308000,1637049908000,1637050508000,1637051108000,1637051708000,1637052308000,1637052908000,1637053508000,1637054108000,1637054708000,1637055308000,1637055908000,1637056508000,1637057108000,1637057708000,1637058308000,1637058908000,1637059508000,1637060108000,1637060708000,1637061308000,1637061908000,1637062508000,1637063108000,1637063708000,1637064308000,1637064908000,1637065508000,1637066108000,1637066708000,1637067308000,1637067908000,1637068508000,1637069108000,1637069708000,1637070308000,1637070908000,1637071508000,1637072108000,1637072708000,1637073308000,1637073908000,1637074508000,1637075108000,1637075708000,1637076308000,1637076908000,1637077508000,1637078108000,1637078708000,1637079308000,1637079908000,1637080508000,1637081108000,1637081708000,1637082308000,1637082908000,1637083508000,1637084108000,1637084708000,1637085308000,1637085908000,1637086508000,1637087108000,1637087708000,1637088308000,1637088908000,1637089508000,1637090108000,1637090708000,1637091308000], datasets: [{ data: [13,10,20,26,21,23,24,21,24,35,25,31,42,24,24,20,23,22,17,23,30,11,16,20,9,10,22,10,19,16,15,16,17,19,10,20,24,14,19,15,13,9,13,17,20,16,15,21,18,25,15,14,16,15,16,14,14,21,10,9,5,9,9,13,14,9,9,18,15,11,11,6,12,14,19,17,16,11,20,14,21,13,15,12,14,10,20,16,25,17,17,11,23,11,13,11,19,10,17,19,10,20,22,19,19,27,28,18,20,22,18,16,17,18,14,17,19,18,20,11,13,20,15,15,18,14,13,14,14,11,19,14,14,11,11,15,26,12,15,15,11,4,3,6], pointRadius: 0, borderColor: "#666", fill: true, yAxisID:'yAxis1' }, ] }, options: { tooltips: { mode: 'index', bodyFontSize: 18, intersect: false, titleFontSize: 16, }, . . . </script>
Here is how you can do that: Get the script tag - you can use a regex, too, if that is the only way to obtain that node Then run a regex search against the node text/string to get your final output. You can use # Get the script node with text matching your pattern item = soup.find("script", text=re.compile(r'\blabels:\s*\[')) import re match = re.search(r'\blabels:\s*\[([^][]*)]', item.string) if match: labels = map(int, match.group(1).split(',')) Output: >>> print(list(labels)) [1637005508000, 1637006108000, 1637006708000, 1637007308000, 1637007908000, 1637008508000, 1637009108000, 1637009708000, 1637010308000, 1637010908000, 1637011508000, 1637012108000, 1637012708000, 1637013308000, 1637013908000, 1637014508000, 1637015108000, 1637015708000, 1637016308000, 1637016908000, 1637017508000, 1637018108000, 1637018708000, 1637019308000, 1637019908000, 1637020508000, 1637021108000, 1637021708000, 1637022308000, 1637022908000, 1637023508000, 1637024108000, 1637024708000, 1637025308000, 1637025908000, 1637026508000, 1637027108000, 1637027708000, 1637028308000, 1637028908000, 1637029508000, 1637030108000, 1637030708000, 1637031308000, 1637031908000, 1637032508000, 1637033108000, 1637033708000, 1637034308000, 1637034908000, 1637035508000, 1637036108000, 1637036708000, 1637037308000, 1637037908000, 1637038508000, 1637039108000, 1637039708000, 1637040308000, 1637040908000, 1637041508000, 1637042108000, 1637042708000, 1637043308000, 1637043908000, 1637044508000, 1637045108000, 1637045708000, 1637046308000, 1637046908000, 1637047508000, 1637048108000, 1637048708000, 1637049308000, 1637049908000, 1637050508000, 1637051108000, 1637051708000, 1637052308000, 1637052908000, 1637053508000, 1637054108000, 1637054708000, 1637055308000, 1637055908000, 1637056508000, 1637057108000, 1637057708000, 1637058308000, 1637058908000, 1637059508000, 1637060108000, 1637060708000, 1637061308000, 1637061908000, 1637062508000, 1637063108000, 1637063708000, 1637064308000, 1637064908000, 1637065508000, 1637066108000, 1637066708000, 1637067308000, 1637067908000, 1637068508000, 1637069108000, 1637069708000, 1637070308000, 1637070908000, 1637071508000, 1637072108000, 1637072708000, 1637073308000, 1637073908000, 1637074508000, 1637075108000, 1637075708000, 1637076308000, 1637076908000, 1637077508000, 1637078108000, 1637078708000, 1637079308000, 1637079908000, 1637080508000, 1637081108000, 1637081708000, 1637082308000, 1637082908000, 1637083508000, 1637084108000, 1637084708000, 1637085308000, 1637085908000, 1637086508000, 1637087108000, 1637087708000, 1637088308000, 1637088908000, 1637089508000, 1637090108000, 1637090708000, 1637091308000] Once the node is obtained the \blabels:\s*\[([^][]*)] regex searches for \b - a word boundary labels: - a fixed string \s* - zero or more whitespaces \[ - a [ char ([^][]*) - Group 1 (this is what you will need to split with a comma later): any zero or more chars other than ] and [ ] - a ] char.
How to convert text file into json file?
I am new to python and I want to convert a text file into json file. Here's how it looks like: #Q Three of these animals hibernate. Which one does not? ^ Sloth A Mouse B Sloth C Frog D Snake #Q What is the literal translation of the Greek word Embioptera, which denotes an order of insects, also known as webspinners? ^ Lively wings A Small wings B None of these C Yarn knitter D Lively wings #Q There is a separate species of scorpions which have two tails, with a venomous sting on each tail. ^ False A True B False Contd . . . . ^ means Answer. I want it in json format as shown below. Example: { "questionBank": [ { "question": "Grand Central Terminal, Park Avenue, New York is the worlds", "a": "largest railway station", "b": "Longest railway station", "c": "highest railway station", "d": "busiest railway station", "answer": "largest railway station" }, { "question": "Eritrea, which became the 182nd member of the UN in 1993, is in the continent of", "a": "Asia", "b": "Africa", "c": "Europe", "d": "Oceania", "answer": "Africa" }, Contd..... ] } I came across a few similar posts and here's what I have tried: dataset = "file.txt" data = [] with open(dataset) as ds: for line in ds: line = line.strip().split(",") print(line) To which the output is: [''] ['#Q What part of their body do the insects from order Archaeognatha use to spring up into the air?'] ['^ Tail'] ['A Antennae'] ['B Front legs'] ['C Hind legs'] ['D Tail'] [''] ['#Q What is the literal translation of the Greek word Embioptera', ' which denotes an order of insects', ' also known as webspinners?'] ['^ Lively wings'] ['A Small wings'] ['B None of these'] ['C Yarn knitter'] ['D Lively wings'] [''] Contd.... The sentences containing commas are separated by python lists. I tried to use .join but didn't get the results I was expecting. Please let me know how to approach this.
dataset = "text.txt" question_bank = [] with open(dataset) as ds: for i, line in enumerate(ds): line = line.strip("\n") if len(line) == 0: question_bank.append(question) question = {} elif line.startswith("#Q"): question = {"question": line} elif line.startswith("^"): question['answer'] = line.split(" ")[1] else: key, val = line.split(" ", 1) question[key] = val question_bank.append(question) print({"questionBank":question_bank}) #for storing json file to local directory final_output = {"questionBank":question_bank} with open("output.json", "w") as outfile: outfile.write(json.dumps(final_output, indent=4))
Rather than handling the lines one at a time, I went with using a regex pattern approach. This also more reliable as it will error out if the input data is in a bad format - rather than silently ignoring a grouping which is missing a field. PATTERN = r"""[#]Q (?P<question>.+)\n\^ (?P<answer>.+)\nA (?P<option_a>.+)\nB (?P<option_b>.+)\n(?:C (?P<option_c>.+)\n)?(?:D (?P<option_d>.+))?""" def parse_qa_group(qa_group): """ Extact question, answer and 2 to 4 options from input string and return as a dict. """ # "group" here is a set of question, answer and options. matches = PATTERN.search(qa_group) # "group" here is a regex group. question = matches.group('question') answer = matches.group('answer') try: c = matches.group('option_c') except IndexError: c = None try: d = matches.group('option_d') except IndexError: d = None results = { "question": question, "answer": answer, "a": matches.group('option_a'), "b": matches.group('option_b') } if c: results['c'] = c if d: results['d'] = d return results # Split into groups using the blank line. qa_groups = question_answer_str.split('\n\n') # Process each group, building up a list of all results. all_results = [parse_qa_group(qa_group) for qa_group in qa_groups] print(json.dumps(all_results, indent=4)) Further details in my gist. Read more on regex Grouping I left out reading the text and writing a JSON file.
Looping through tree to create a dictionary_NLTK
I'm new to Python and trying to solve a problem looping through a tree in NLTK. I'm stuck on the final output, it is not entirely correct. I'm looking to create a dictionary with 2 variables and if there is no quantity then add value 1. This is the desired final output: { quantity =1, food =pizza }, {quantity =1, food = coke } ,{quantity =2, food = beers}, {quantity =1, food = sandwich } Here is my code, any help is much appreaciated! ''' import nltk as nltk nltk.download() grammar = r""" Food:{<DT>?<VRB>?<NN.*>+} }<>+{ Quantity: {<CD>|<JJ>|<DT>} """ rp = nltk.RegexpParser(grammar) def RegPar(menu): grammar = r"""Food:{<DT>?<VRB>?<NN.*>+} }<>+{ Quantity: {<CD>|<JJ>|<DT>} """ rp = nltk.RegexpParser(grammar) output = rp.parse(menu) return(output) Sentences = [ 'A pizza margherita', 'one coke y 2 beers', 'Sandwich'] tagged_array =[] output_array =[] for s in Sentences: tokens = nltk.word_tokenize(s) tags = nltk.pos_tag(tokens) tagged_array.append(tags) output = rp.parse(tags) output_array.append(output) print(output) dat = [] tree = RegPar(output_array) for subtree in tree.subtrees(): if subtree.label() == 'Food' or subtree.label() =='Quantity': dat.append({(subtree.label(),subtree.leaves()[0][0])}) print(dat) ##[{('Food', 'A')}, {('Quantity', 'one')}, {('Food', 'coke')}, {('Quantity', '2')}, {('Food', 'beers')}, {('Food', 'Sandwich')}]* '''
Determining most common name from web scraped birth name data
I have the task to do web scraping from this page https://www.ssa.gov/cgi-bin/popularnames.cgi. There you can find a list of the most common birth names. Now I have to find the most common name that both girls and boys have for a given year (in other words, the exact same name is used in both genders), but I don't know how I am able to do that. With the code below I solved the previous task to output the list for a given year but I have no clue how I can modify my code so I get the most common name that both girls and boys have. import requests import lxml.html as lh url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi' string = input("Year: ") r = requests.post(url, data=dict(year=string, top="1000", number="n" )) doc = lh.fromstring(r.content) tr_elements = doc.xpath('//table[2]//td[2]//tr') cols = [] for col in tr_elements[0]: name = col.text_content() number = col.text_content() cols.append((number, [])) count=0 for row in tr_elements[1:]: i = 0 for col in row: val = col.text_content() cols[i][1].append(val) i += 1 if(count<4): print(val, end = ' ') count += 1 else: count=0 print(val)
Here's one approach. The first step is to group the data by name and record how many genders have used the name and their aggregate total. After that, we can filter the structure by names with more than one gender using it. Finally, we sort this multi-gender list by counts and take the 0-th element. This is our most popular multi-gender name for the year. import requests import lxml.html as lh url = "https://www.ssa.gov/cgi-bin/popularnames.cgi" year = input("Year: ") response = requests.post(url, data=dict(year=year, top="1000", number="n")) doc = lh.fromstring(response.content) tr_elements = doc.xpath("//table[2]//td[2]//tr") column_names = [col.text_content() for col in tr_elements[0]] names = {} most_common_shared_names_by_year = {} for row in tr_elements[1:-1]: row = [cell.text_content() for cell in row] for i, gender in ((1, "male"), (3, "female")): if row[i] not in names: names[row[i]] = {"count": 0, "genders": set()} names[row[i]]["count"] += int(row[i+1].replace(",", "")) names[row[i]]["genders"].add(gender) shared_names = [ (name, data) for name, data in names.items() if len(data["genders"]) > 1 ] most_common_shared_names = sorted(shared_names, key=lambda x: -x[1]["count"]) print("%s => %s" % most_common_shared_names[0]) If you're curious, here are the results since 2000: 2000 => Tyler, 22187 2001 => Tyler, 19842 2002 => Tyler, 18788 2003 => Ryan, 20171 2004 => Madison, 20829 2005 => Ryan, 18661 2006 => Ryan, 17116 2007 => Jayden, 17287 2008 => Jayden, 19040 2009 => Jayden, 19053 2010 => Jayden, 18641 2011 => Jayden, 18064 2012 => Jayden, 16952 2013 => Jayden, 15462 2014 => Logan, 14478 2015 => Logan, 13753 2016 => Logan, 12099 2017 => Logan, 15117
Extracting numbers in text file
I have a text file which came from excel. I dont know how to take five digits after a specific character. I want to take only five digits after #ACA in a text file. my text is like: ERROR_MESSAGE (((#ACA16018)|(#ACA16019))&(#AQV71767='')&(#AQV71765='2'))?1:((#AQV71765='4')?1:((#AQV71767$'')?(((#AQV71765='1')|(#AQV71765='3'))?1:'Hasar veya Lehe Hukuk seçebilirsiniz'):'Rücu sıra numarasını yazıp Hasar veya Lehe Hukuk seçebilirsiniz')) Rücu Oranı Girilmesi Zorunludur...' #ACA17660 #ACA16560 #ACA15623 #ACA17804 BU ALANI BOŞ GEÇEMEZSİNİZ.EKSPER RAPORU GELMEDEN DY YE GERİ GÖNDEREMEZSİNİZ. PERT İHBARI VARSA PERT ÇALINMA OPERASYONU AKTİVİTESİ OLUŞTURULMALIDIR. (#TSC[T008UNSMAS;FIRM_CODE=2 AND UNIT_TYPE='SG' AND UNIT_NO=#AQV71830]>0)?1:'Girdiğiniz değer fihristte yoktur' #ACA17602 #ACA17604 #ACA56169 BU ALANI BOŞ GEÇEMEZSİNİZ #ACA17606 #ACA17608 (#AQV71835='')?'Boş geçilemez':1 Lütfen Gönderilecek Kişinin Mail Adresini Giriniz ! ' LÜTFEN RED NEDENİNİ GİRİNİZ. EKSİK BİLGİ / BELGE ALANINA GİRMİŞ OLDUĞUNUZ DEĞER YANLIŞ VEYA GEÇERŞİZDİR!!! LÜTFEN KONTROL EDİP TEKRAR DENEYİNİZ.' BU ALAN BOŞ GEÇİLEMEZ. ÖDEME YAPILMADAN EK ÖDEME SÜRECİNİ BAŞLATAMAZSINIZ. ONAYLANDI VE REDDEDİLDİ SEÇENEKLERİNİ KULLANAMAZSINIZ BU ALAN BOŞ GEÇİLEMEZ.EVRAKLARINIZI , VARSA EKSPER RAPORUNU VE MUALLAĞI KONTROL EDİNİZ. Muallak Tutarını kontrol ediniz. 'OTO BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ' 'OTODIŞI BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ' (#AQV70003$'')?((#TSC[T001HASIHB;FIRM_CODE=#FP10100 AND COMPANY_CODE=2 AND CLAIM_NO=#AQV70003]$0)?1:'Bu dosya sistemde bulunmamaktadır'):'Bu alan boş geçilemez' (#AQV70503='')?'Bu alan boş geçilemez.':((#ACA18635=1)?1:'Mağdura ait uygun kriterli ödeme kaydı mevcut değildir.') (#AQV71809=0)?'Boş geçilemez':1 (#FD101AQV71904_AFDS<0)?'Tarih bugünün tarihinden büyük olamaz I want to take every 5 digits which comes after #ACA, so: 16018, 16019, 17660, etc...
grep -oP '#ACA\K[0-9]{5}' file.txt #ACA\K will match #ACA but not printed as part of output [0-9]{5} five digits following #ACA If variable number of digits are needed, use grep -oP '#ACA\K[0-9]+' file.txt
If you don't know or don't like regular expressions, you can do this, although the code is a bit longer : if __name__ == '__main__': pattern = '#ACA' filename = 'yourfile.txt' res = list() with open(filename, 'rb') as f: # open 'yourfile.txt' in byte-reading mode for line in f: # for each line in the file for s in line.split(pattern)[1:]: # split the line on '#ACA' try: nb = int(s[:5]) # take the first 5 characters after as an int res.append(nb) # add it to the list of numbers we found except (NameError, ValueError): # if conversion fails, that wasn't an int pass print res # if you want them in the same order as in the file print sorted(res) # if you want them in ascending order
This should do it import re print(re.findall("#ACA(\d+)",str_var)) If you have the whole text in the variable str_var Output: ['16018', '16019', '17660', '16560', '15623', '17804', '17602', '17604', '56169', '17606', '17608', '18635']
re.findall(r'#ACA(\d{5})', str_var)
[x[:5] for x in content.split("#ACA")[1:]]
PowerShell solution: $contet = Get-Content -Raw 'your_file' $match = [regex]::Matches($contet, '#ACA(\d{5})') $match | ForEach-Object { $_.Groups[1].Value } Output: 16018 16019 17660 16560 15623 17804 17602 17604 56169 17606 17608 18635