I have a string out of an OCR'ed image, and I need to find a way to extract human names from it. here is the image required to OCR, which comes out as:
From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri
There are 4 names in total in the heading, and i am required to get the output:
names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output
but I have no idea how to extract the names. I have tried RegEx and came up with:
names = re.findall(r'(?i)([A-Z][a-z]+[A-Z][a-z][, ] [A-Z][a-z]+)', string) #regex to find names
which searches for a Capital letter, then a comma, then another word starting with a capital letter. it is close to the desired result but it comes out as:
names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result
I have thought of maybe using another string that extracts the room names and excludes them from the list, but i have no idea how to implement that idea. i am new to RegEx, so any help will be appreciated. thanks in advance
Notwithstanding the excellent RE approach suggested by #JvdV, here's a step-by-step way in which you could achieve this:
OCR = """From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri"""
names = []
for line in OCR.split('\n'):
tokens = line.split()
if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
parts = line.split(';')
for i, p in enumerate(parts):
names.append(' '.join(p.split()[i==0:-1]))
print(names)
Depending on the contents of your email, a reasonable approach might be to use:
[:;]\s*(.+?)\s*<
See an online demo.
[:;] - A (semi-)colon;
\s* - 0+ (Greedy) whitespaces;
(.+?) - A 1st capture group of 1+ (Lazy) characters;
\s* - 0+ (Greedy) whitespaces;
< - A literal '<'.
Note that I specifically use (.+?) to capture names since names are notoriously hard to match.
import re
s = """From: Al Amri, Salim <salim.amri#gmail.com>
Sent: 25 August 2021 17:20
To: Al Harthi, Mohammed <mohd4.king#rihal.om>
Ce: Al hajri, Malik <hajri990#ocaa.co.om>; Omar, Naif <nnnn49#apple.com>
Subject: Conference Rooms Booking Details
Dear Mohammed,
As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:
Room: Luban, available on 26/09/2021. Rate: $4540
Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: $3000
Room: Dhofar. Available on 11/11/2021. Rate: $2500
Room: Nizwa. Available on 13/12/2022. Rate: $1200
Please let me know which ones you are interested so we go through more details.
Best regards,
Salim Al Amri"""
print(re.findall(r'[:;]\s*(.+?)\s*<', s))
Prints:
['Al Amri, Salim', 'Al Harthi, Mohammed', 'Al hajri, Malik', 'Omar, Naif']
Related
I have the following pdf file I use PyPDF2 to extract text from it pdf image
and I'm looking for a regex to capture numbered sentences in the pdf file
I tried a couple of regex in the following code but the output is not as needed I need to capture the numbered points each as one sentence like this
expected OUTPUT
['1. Please admit that Plaintiff, JOSHUA PINK, received benefits from a collateral
source, as defined by §768.76, Florida Statutes, for medical bills alleged to have been incurred as
a result of the incident described in the Complaint.',2. please.....]
Instead of two regexes I tried either doesn't capture the full sentence or capture it in multiple lines and consider every \n as a new sentence
Extracted TEXT
" \n IN THE CIRCUIT COURT, OF THE \nEIGHTEENTH JUDICIAL CIRCUIT, IN \nAND FOR SEMINOLE COUNTY, \nFLORIDA \n \nCASE NO: 2022 -CA-002235 \n \nJOSHUA PINK, \n \n Plaintiff, \nvs. \n \nMATHEW ZUMBRUM , \n \n Defendant. \n / \n \nDEFENDANT'S REQUEST FOR ADMISSIONS TO PLAINTIFF, JOSHUA PINK \n \n \nCOME NOW the Defendant , MATHEW ZUMBRUM , by and through the undersigned \nattorneys, and pursuant to Rule 1.370, Florida Rul es of Civil Procedure, requests the Plaintiff, \nJOSHUA PINK, admit in this action that each of the following statements are true: \n1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statute s, for medical bills alleged to have been incurred as \na result of the incident described in the Complaint. \n2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statutes, for loss of wages o r income alleged to have been \nsustained as a result of the incident described in the Complaint. \n3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile policy for medical bills alleged to have been incurred \nas a result of the incident described in the Complaint. \n Filing # 162442429 E-Filed 12/06/2022 09:46:49 AM\n \n2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile insurance policy for loss of wages or income alleged \nto have been sustained as a result of the incident described in the Complaint. \n5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical \npayments provisions of an automobile insurance policy for medical bills alleged to have been \nincurred as a result of the incident described in the Complaint. \n6. Please admit that Plaintiff, JOSHUA PINK , is subject to a deductible under the \nPersonal Injury Protection portion of an automobile insurance policy. \n7. Please admit that Plaintiff, JOSHUA PINK received benefits pursuant to personal \nor group health insurance policy, for medical bills alleged to have been incurred as a result of the \nincident described in the Complaint. \n8. Please admit that Plaintiff, JOSHUA PINK , received benefits pursuant to a \npersonal or group wage continuation plan or policy, for loss of wages or income alleged to have \nbeen sustained as a result of the incident described in the Complaint. \n 9. Please admit that on the date of the accident alleged in your Complaint, Defendant, \nMATHEW ZUMBRUM , complied with and met the security requirements under Chapter \n627.730 - 627.7405, Florida Statutes. \n10. Please admit that Plaintiff, JOSHUA PINK , was partially responsible for the \nsubject accident. \n11. Please admit that Plaintiff, JOSHUA PINK , did NOT suffer a permanent injury as \na result of the subject accident. \nI HEREBY CERTIFY that on the 6th day of December, 2022 a true and correct copy of \nthe foregoing was electronically filed with the Florida Court s E-Filing Portal system which will \n \n3 send a notice of electronic filing to Michael R. Vaughn, Esq., Morgan & Morgan, P.A., 20 N. \nOrange Ave, 16th Floor, Orlando, FL 32801 at mvaughn#forthepeople.com; \njburnham#forthepeople.com; mserrano#forthepeople.com. \nAND REW J. GORMAN & ASSOCIATES \n \nBY: \n \n(Original signed electronically by Attorney.) \nLOURDES CALVO -PAQUETTE, ESQ. \nAttorney for Defendant, Zumbrum \n390 N. Orange Avenue, Suite 1700 \nOrlando, FL 32801 \nTelephone: (407) 872 -2498 \nFacsímile: (855) 369 -8989 \nFlorida Bar No. 0817295 \nE-mail for service (FL R. Jud. Admin. 2.516) : \nflor.law -mlslaw.172o19#statefarm.com \n \nAttorneys and Staff of Andrew J. Gorman & \nAssociates are Employees of the Law Department \nof State Farm Mutual Automobile Insurance \nCompany. \n \n \n\n"
sample output of regex2 (sentence is captured in 2 lines)
[('2022', 'CA-002235 '),
('1', 'Florida Rul es of Civil Procedure, requests the Plaintiff,'),
('1',
'Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral'),
('768',
'Florida Statute s, for medical bills alleged to have been incurred as'),...]
sample output of regex1 (not capturing full sentence)
['1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical ',....]
code:
def read_pdf(name):
reader = PdfReader(name,"rb")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
#regex1 = r'(^[0-9].*)'
regex2 = r'([\d]+).+?([a-zA-Z].+).'
pat = re.compile(regex, re.M)
extracted_text = pat.findall(text)
return text,extracted_text
text,pdf1 = read_pdf(names[0])
I'll provide an answer to go over a couple of different patterns you can use to approach text items like that. Let's say you have a text that is structured like this:
test_str = """
Some preamble.
1. Very
long
sentence.
2. One-line sentence.
3. Another
longer sentence.
A new paragraph.
"""
First scenario: you want to match items that begin with a number followed by a period at the beginning of a line (with optional leading space) and end with a period at the end of a line - irrespective of how many characters it takes, but as few as possible. That's what your question reads like. One pattern that describes this is ^[ \t]*\d+\.[\s\S]*?\.$. The heavy lifting here is done by [\s\S]*? which is a lazy class that just matches any character (by including all spaces and all non-spaces) as few times as possible.
regex1 = re.compile(r"^[ \t]*\d+\.[\s\S]*?\.$", re.MULTILINE)
print(re.findall(regex1, test_str))
Which returns:
[' 1. Very\nlong\nsentence.', ' 2. One-line sentence.', ' 3. Another\nlonger sentence.']
If you want to exclude leading space, you could add a capturing group ^[ \t]*(\d+\.[\s\S]*?\.)$ in which case findall() will only return the captured part. In Python:
regex2 = re.compile(r"^[ \t]*(\d+\.[\s\S]*?\.)$", re.MULTILINE)
print(re.findall(regex2, test_str))
Which returns:
['1. Very\nlong\nsentence.', '2. One-line sentence.', '3. Another\nlonger sentence.']
First scenario, alternative expression: after the leading number, express the match in terms of lines; always get the first line and add every following line as long as the preceding line does not end in a period: ^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$. This will be faster than the lazy class in the first example and returns the same as with regex2.
regex3 = re.compile(r"^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$", re.MULTILINE)
print(re.findall(regex3, test_str))
Second scenario: we don't care what the sentence(s) end in. Just get complete items, which we'll interpret as the leading number followed by all lines that do not start with another leading number or an entirely new paragraph: ^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*).
This makes use of a negative lookahead (?![ \t]*\d+\.|A new) to prevent matching lines that start either with a new item number or some non-item text and allows more control over what kind of lines may constitute an item. Return values are the same.
regex4 = re.compile(r"^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*)", re.MULTILINE)
print(re.findall(regex4, test_str))
If you want to match sentences followed by a dot, you might use:
\b\d+\.[^\S\n][^.]*(?:\.(?=\S)[^.]*)*\.
Explanation
\b A word boundary to prevent a partial word match
\d+\.[^\S\n] Match 1+ digits, a dot and a space
[^.]*(?:\.(?=\S)[^.]*)* Optionally match any character except for dots, and then only match the dot when there is a non whitespace character following.
\. Match a dot
See a regex demo.
A pattern with more punctuation characters:
\b\d+\.[^\S\n][^.!?]*(?:[.!?](?=\S)[^.!?]*)*[.!?]
See another regex demo.
Try this:
(\d+\.\s)(.|\n)*?(?=\d+\.\s|\z|\.\s)
This will match from any number followed by a period and a space to the end of the sentence (period followed by a space) or until the next number followed by a period and a space or the end of the string.
See example here
Recommend using Punkt Sentence Tokenizer or any other NLP package of your choice as writing a general purpose regex to detect sentence can be very tricky unless you have only a very strictly defined pattern with limited scope! For example, if you take only numbered sentences then the following regex might work: "\d\.(.)+[a-z]\."gmi
I have a string like this,
my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'
Now, I want to extract the current champion and the underdog using keywords champion and underdog .
What is really challenging here is both contender's names appear before the keyword inside parenthesis. I want to use regular expression and extract information.
Following is what I did,
champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)
>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']
underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)
>>['"underdog").']
However, I need the results, champion as:
brooklyn centenniel, resident of detroit, michigan
and the underdog as:
kamil kubaru, the challenger from alexandria, virginia
How can I do this using regular expression? (I have been searching, if I could go back couple or words from the keyword to get the result I want, but no luck yet) Any help or suggestion would be appreciated.
You can use named captured group to capture the desired results:
between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
between\s+(?P<champion>.*?)\s+\("champion"\) matches the chunk from between to ("champion") and put the desired portion in between as the named captured group champion
After that, \s+and\s+(?P<underdog>.*?)\s+\("underdog"\) matches the chunk upto ("underdog") and again get the desired portion from here as named captured group underdog
Example:
In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia
...: ("underdog").'
In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)
In [28]: out.groupdict()
Out[28]:
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}
There will be a better answer than this, and I don't know regex at all, but I'm bored, so here's my 2 cents.
Here's how I would go about it:
words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)
for the underdog, you will have to change the 6 to a 7, and '("champion")' to '("underdog").'
Not sure if this will solve your problem, but for this particular string, this worked when I tested it.
You could also use str.strip() to remove punctuation if that trailing period on underdog is a problem.
I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.
Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla
You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())
As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']
// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content
I m new to python. I need to retrieve the list of match
for Example my text is below which is an email.
I need to extract all To, From, Sent, Subject and body from a mail thread.
Result need to From List
From(1) = Crandall, Sean
From(2) = Nettelton, Marcus
To(1)= Crandall, Sean; Badeer, Robert
To(2)= Meredith, Kevin
Like for above Sent, subject etc
"-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."
You can use re.findall() for this, see: https://docs.python.org/2/library/re.html#re.findall. E.g.
re.findall("From: (.*) ", input_string);
would return a list of the From-names (['Crandall, Sean', 'Meredith, Kevin']), assuming it's always the same amount of white spaces.
If you want to get fancy, you could do several searches in the same expression: E.g.
re.findall("From: (.*) \nSent: (.*)", input_string);
would return [('Crandall, Sean', 'Wednesday, May 23, 2001 2:56 PM'), ('Meredith, Kevin', 'Wednesday, May 23, 2001 11:16 AM')]
If you don't know how to use regex and as your problem is not that tough, you may consider to use the split() and replace() functions.
Here are some lines of code that might be a good start:
mails = """-----Original Message-----
From: Crandall, Sean
Sent: Wednesday, May 23, 2001 2:56 PM
To: Meredith, Kevin
Subject: RE: Spreads and Product long desc.
Kevin,
Is the SP and NP language in the spread language the same language we use when we transact SP15 or NP15 on eol?
-----Original Message-----
From: Meredith, Kevin
Sent: Wednesday, May 23, 2001 11:16 AM
To: Crandall, Sean; Badeer, Robert
Subject: FW: Spreads and Product long desc."""
mails_list = mails.split("-----Original Message-----\n")
mails_from = []
mails_sent = []
mails_to = []
mails_subject = []
mails_body = []
for mail in mails_list:
if not mail:
continue
inter = mail.split("From: ")[1].split("\nSent: ")
mails_from.append(inter[0])
inter = inter[1].split("\nTo: ")
mails_sent.append(inter[0])
inter = inter[1].split("\nSubject: ")
mails_to.append(inter[0])
inter = inter[1].split("\n")
mails_subject.append(inter[0])
mails_body.append(inter[0])
See how this only use really basic concepts.
Here are some points that you might need to consider:
Try by yourself, you might need some adjustments.
With that method, the parsing method is quite tough, the format of the mails must be really accurate.
There might be some space that you want to remove, for example with the replace() method.
Hi, I have this string in Python:
'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. # Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'
I need to extract the following:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
I tried to do this by using:
val = desc.split("\r\n")
and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.
Any help will be highly appreciated.
If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:
import re
desc_match = re.search(r'''(?sx)
(?P<loc>Location:.+?)[\n\r]
(?P<time>Time:.+?)[\n\r]
(?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)
if desc_match:
for gname in ['loc', 'time', 'vends']:
print desc_match.group(gname)
Given your definition of desc, this prints out:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.
If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))
r'''(?sx)
Location: \s* (?P<loc>.+?) [n\r]
Time: \s* (?P<time>.+?) [\n\r]
Vendors: \s* (?P<vends>.+?) (?:\n\r?){2}'''
NLNL = "\r\n\r\n"
parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)
which gives
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard