I'm trying to use med7 and negspacy from Spacy but they both need separate version of spacy. How can i use both in the same script?
I'm using en_core_med7_lg from spacy to get disease, drug and other entities out of text
import spacy
import scispacy
from spacy import displacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from negspacy.negation import Negex
med7 = spacy.load("en_core_med7_lg")
# create distinct colours for labels
col_dict = {}
seven_colours = ['#e6194B', '#3cb44b', '#ffe119', '#ffd8b1', '#f58231', '#f032e6', '#42d4f4']
for label, colour in zip(med7.pipe_labels['ner'], seven_colours):
col_dict[label] = colour
options = {'ents': med7.pipe_labels['ner'], 'colors':col_dict}
negex = Negex(med7, ent_types=["DISEASE", "NEG_ENTITY"], name="negex",
neg_termset={'pseudo_negations': ['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily',
'without any further', 'without difficulty', 'without further', 'might not', 'not only',
'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'neither', 'nor'],
'preceding_negations': ['absence of', 'declined', 'denied', 'denies', 'denying', 'no sign of', 'no signs of', 'not',
'not demonstrate', 'symptoms atypical', 'doubt', 'negative for', 'no', 'versus', 'without',
"doesn't", 'doesnt', "don't", 'dont', "didn't", 'didnt', "wasn't", 'wasnt', "weren't", 'werent',
"isn't", 'isnt', "aren't", 'arent', 'cannot', "can't", 'cant', "couldn't", 'couldnt', 'never', 'neither', 'nor'],
'following_negations': ['declined', 'unlikely', 'was not', 'were not', "wasn't", 'wasnt', "weren't", 'werent', 'neither', 'nor'],
'termination': ['although', 'apart from', 'as there are', 'aside from', 'but', 'except', 'however', 'involving', 'nevertheless', 'still', 'though', 'which', 'yet', 'neither', 'nor']},
extension_name='',
chunk_prefix=[]
)
med7.add_pipe("negex", config={"ent_types":["DISEASE", "NEG_ENTITY"]})
text = """77 yo M pt here to discuss Heart rate and BP in office is 192/80 HR is 102 pt states at 3:15pm BP was 170/79 and HR was 114 Patient presented with wife C/o HTN, elevated HR, and Right flank pain Reported wife gave Lasix 20 mg BID in fear of water retention Pmhx CKD4 Sees Nephrology, last OV 5/2020, follow up scheduled 8/2020 Reported HR elevated 112 at home Reported unable to afford clobetasol cream to treat psoriasis on scalp and elbows PROBLEM LIST: Patient Active Problem List Diagnosis Diabetes mellitus Aortic atherosclerosis HTN (hypertension) Diabetic neuropathy Microalbuminuria Diabetic peripheral vascular disorder Diabetic nephropathy associated with type 2 diabetes mellitus CKD (chronic kidney disease) stage 4, GFR 15-29 ml/min Homocysteinemia Refill clinic medication management patient Anemia, chronic renal failure, stage 4 (severe) Chronic systolic congestive heart failure Anxiety Hyperparathyroidism due to renal insufficiency Hypertension associated with type 2 diabetes mellitus Moderate episode of recurrent major depressive disorder PVD (peripheral vascular disease) History of prostate cancer RUQ pain Psoriasis PAST MEDICAL HISTORY: Past Medical History: Diagnosis Date BPPV (benign paroxysmal positional vertigo) 1/9/2019 Bradycardia 6/11/2019 CKD (chronic kidney disease) stage 3, GFR 30-59 ml/min 1/24/2019 GFR on 9/14/2018: 28 GFR on 11/13/2018: 34 GFR on 1/14/2019: 34 GFR on 05/21/2019 :34 CKD (chronic kidney disease) stage 4, GFR 15-29 ml/min 9/24/2018 Hospital discharge follow-up 12/9/2019 Prostate cancer 6/18/2018 Refill clinic medication management patient 5/13/2019 PAST SURGICAL HISTORY: Past Surgical History: Procedure Laterality Date ANGIOPLASTY 08/02/2018 L leg ANGIOPLASTY 06/13/2018 R leg colonoscopy 05/26/2015 Internal hemorrhoids, otherwise normal egd 05/26/2015 normal, mild chronic gastritis FAMILY HISTORY: Family History Problem Relation Name Age of Onset Cancer Son testiular Heart Disease Sister Stroke Brother SOCIAL HISTORY: Social History Socioeconomic History Marital status: Married Spouse name: Not on file Number of children: 3 Years of education: Not on file Highest education level: Not on file Occupational History Not on file Social Needs Financial resource strain: Not on file Food insecurity: Worry: Not on file Inability: Not on file Transportation needs: Medical: Not on file Non-medical: Not on file Tobacco Use Smoking status: Never Smoker Smokeless tobacco: Never Used Substance and Sexual Activity Alcohol use: Yes Comment: Occasionally Drug use: No Sexual activity: Not on file Lifestyle Physical activity: Days per week: Not on file Minutes per session: Not on file Stress: Not on file Relationships Social connections: Talks on phone: Not on file Gets together: Not on file Attends religious service: Not on file Active member of club or organization: Not on file Attends meetings of clubs or organizations: Not on file Relationship status: Not on file Intimate partner violence: Fear of current or ex partner: Not on file Emotionally abused: Not on file Physically abused: Not on file Forced sexual activity: Not on file Other Topics Concern Military Service Not Asked Blood Transfusions Not Asked Caffeine Concern Yes Occupational Exposure Not Asked Hobby Hazards Not Asked Sleep Concern Not Asked Stress Concern Not Asked Weight Concern Not Asked Special Diet Not Asked Back Care Not Asked Exercises Regularly No Bike Helmet Use Not Asked Seat Belt Use Not Asked Performs Self-Exams Not Asked Social History Narrative Not on file Social History Tobacco Use Smoking Status Never Smoker Smokeless Tobacco Never Used Social History Substance and Sexual Activity Alcohol Use Yes Comment: Occasionally Social History Substance and Sexual Activity Drug Use No Immunization History Administered Date(s) Administered Influenza Vaccine (High Dose) >=65 Years 12/04/2018 Influenza Vaccine (Unspecified) 10/12/2012 Influenza Vaccine >=6 Months 10/12/2012 Pneumococcal 13 Vaccine (PREVNAR-13) 12/10/2015 Pneumococcal 23 Vaccine (PNEUMOVAX-23) 11/04/2009, 02/07/2014, 11/12/2016 Td 11/04/2009 Tdap 11/04/2009 CURRENT MEDICATIONS: Current Outpatient Medications on File Prior to Visit Medication Sig Dispense Refill cilostazol (PLETAL) 100 MG tablet TAKE 1 TABLET BY MOUTH TWICE A DAY 180 tablet 3 [DISCONTINUED] citalopram (CELEXA) 10 MG tablet Take 2 tablets (20 mg) by mouth daily. 90 tablet 3 [DISCONTINUED] citalopram (CELEXA) 20 MG tablet TAKE 1 TABLET BY MOUTH ONCE DAILY FOR DEPRESSION 90 tablet 3 [DISCONTINUED] clobetasol propionate (TEMOVATE) 0.05 % cream Apply to affected area twice a day 45 g 1 cloNIDine (CATAPRES) 0.1 MG tablet Take 1 tablet (0.1 mg) by mouth 3 times daily as needed (if SBP > 160). 270 tablet 3 clopidogrel (PLAVIX) 75 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 3 [DISCONTINUED] epoetin alfa (PROCRIT) 10000 UNIT/ML injection Procrit 10,000 unit/mL injection solution felodipine (PLENDIL) 10 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 4 furosemide (LASIX) 20 MG tablet Take 1 tablet (20 mg) by mouth daily as needed (edema). 90 tablet 2 glipiZIDE (GLUCOTROL) 10 MG tablet TAKE 1 TABLET BY MOUTH TWICE DAILY FOR DIABETES 180 tablet 0 hydrALAZINE (APRESOLINE) 25 MG tablet TAKE 2 TABLETS BY MOUTH TWICE A DAY 360 tablet 1 [DISCONTINUED] hydrALAZINE (APRESOLINE) 25 MG tablet Take 25 mg by mouth 3 times daily. 4 Insulin Degludec (TRESIBA) 100 UNIT/ML SOLN 15 Units. 15 units [DISCONTINUED] Insulin Degludec 100 UNIT/ML SOPN 15 units daily 9 mL 2 losartan (COZAAR) 50 MG tablet Take 1 tablet (50 mg) by mouth 2 times daily. Hold dose for systolic blood pressure less than 100. 180 tablet 3 Misc. Devices MISC C-pap setting at 9 cm H2O Dx: Sleep Apnea 1 each 0 [DISCONTINUED] prazosin (MINIPRESS) 1 MG capsule Take 1 capsule (1 mg) by mouth 3 times daily. 90 capsule 11 simvastatin (ZOCOR) 20 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 3 No current facility-administered medications on file prior to visit. Outpatient Medications Marked as Taking for the 7/23/20 encounter (Office Visit) with Diaz, Anarella, NP Medication Sig Dispense Refill Insulin Degludec (TRESIBA) 100 UNIT/ML SOLN 15 Units. 15 units ALLERGIES: Allergies Allergen Reactions Sulfa Drugs Swelling and Cough REVIEW OF SYSTEMS: Review of Systems All other systems reviewed and are negative. refer to HPI PHYSICAL EXAM: 07/23/20 1657 07/23/20 1718 BP: 192/80 160/80 Pulse: 102 86 Resp: 18 Temp: 99 F (37.2 C) SpO2: 96% Body mass index is 27.28 kg/m . Ht Readings from Last 1 Encounters: 07/23/20 5' 6" (1.676 m) Wt Readings from Last 1 Encounters: 07/23/20 76.7 kg (169 lb) Physical Exam Vitals signs and nursing note reviewed. Constitutional: Appearance: Normal appearance. He is well-developed. HENT: Head: Normocephalic and atraumatic. Eyes: Conjunctiva/sclera: Conjunctivae normal. Neck: Musculoskeletal: Normal range of motion. Cardiovascular: Rate and Rhythm: Normal rate and regular rhythm. Pulmonary: Effort: Pulmonary effort is normal. Breath sounds: Normal breath sounds. Abdominal: Palpations: Abdomen is soft. Tenderness: There is no tenderness. There is no right CVA tenderness. Musculoskeletal: Normal range of motion. Skin: General: Skin is warm. Neurological: Mental Status: He is alert and oriented to person, place, and time. Psychiatric: Mood and Affect: Mood normal. Behavior: Behavior normal. Thought Content: Thought content normal. Judgment: Judgment normal. ASSESSMENT & PLAN: Luis was seen today for other. Diagnoses and all orders for this visit: RUQ pain Assessment & Plan: Active Negative abd exam Labs ordered VSS Stay hydrated Cont to monitor Follow up prn results Orders: - CBC w/ Diff Lavender; Future - Comprehensive Metabolic Panel Green; Future Psoriasis Assessment & Plan: Active Ordered Clobetasol solution apply ad Avoid allergens Stay hydrated Cont to monitor Follow up prn Orders: - clobetasol propionate (TMOVATE) 0.05 % solution; Apply 1 mL topically 2 times daily. Use a small amount as directed ICD-10-CM ICD-9-CM 1. D71 pain R10.11 789.01 CBC w/ Diff Lavender Comprehensive Metabolic Panel Green 2. Psoriasis L40.9 696.1 clobetasol propionate (TMOVATE) 0.05 % solution """
doc = med7(text)
[(ent.text, ent.label_) for ent in doc.ents]
negated_concepts = []
for e in doc.ents:
if e._.negex is True:
negated_concepts.append(e.text.lower())
print(e.text, e.label_)
print(negated_concepts)
No need for negex, just add your neg_termset to the pipeline config:
neg_termset={'pseudo_negations': ['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily',
'without any further', 'without difficulty', 'without further', 'might not', 'not only',
'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'neither', 'nor'],
'preceding_negations': ['absence of', 'declined', 'denied', 'denies', 'denying', 'no sign of', 'no signs of', 'not',
'not demonstrate', 'symptoms atypical', 'doubt', 'negative for', 'no', 'versus', 'without',
"doesn't", 'doesnt', "don't", 'dont', "didn't", 'didnt', "wasn't", 'wasnt', "weren't", 'werent',
"isn't", 'isnt', "aren't", 'arent', 'cannot', "can't", 'cant', "couldn't", 'couldnt', 'never', 'neither', 'nor'],
'following_negations': ['declined', 'unlikely', 'was not', 'were not', "wasn't", 'wasnt', "weren't", 'werent', 'neither', 'nor'],
'termination': ['although', 'apart from', 'as there are', 'aside from', 'but', 'except', 'however', 'involving', 'nevertheless', 'still', 'though', 'which', 'yet', 'neither', 'nor']}
med7.add_pipe("negex", config={"neg_termset": neg_termset, "ent_types":["DISEASE", "NEG_ENTITY"]})
Related
This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 8 months ago.
import re
a = """COMPUTATION OF DAMAGES Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00 DIMOPOULOS INJURY This is the bill of 1 2 3 4 5 6 7 8 DIMOPOULOS INJURY """
word1 = "COMPUTATION OF DAMAGES"
word2 = "DIMOPOULOS INJURY"
result = re.search(word1 + '(.*)' + word2, a)
print(result.group(1))
Required op: Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00
How to extract text upto the first "DIMOPOULOS INJURY" keyword. Is there any solution
You are very close, just add ?:
result = re.search(word1+'(.*?)'+word2, a)
The output will be:
"Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00"
I have this string and want to turn it into two arrays, one has the film title and the other one has the year. Their positions in the array need to correspond with each other. Is there a way to do this?
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
First split the input string on comma to generate a list, then use comprehensions to get the title and year as separate lists.
films_list = re.split(r',\s*', films)
titles = [re.split(r'\s*(?=\(\d+\))', x)[0] for x in films_list]
years = [re.split(r'\s*(?=\(\d+\))', x)[1] for x in films_list]
Answer of Tim is well enough. I will try to write an alternative someone who would like to solve the problem without using regex.
a = films.split(",")
years = []
for i in a:
years.append(i[i.find("(")+1:i.find(")")])
Same approach can be applied for titles.
You can do something like this (without any kind of import or extra module needed, or regex complexity):
delimeter = ", "
movies_with_year = pfilms.split(delimeter)
movies = []
years = []
for movie_with_year in movies_with_year:
movie = movie_with_year[:-6]
year = movie_with_year[-6:].replace("(","").replace(")","")
movies.append(movie)
years.append(year)
This script will result in something like this:
movies : ['Endless Love ', ...]
years : ['1981', ...]
You shuold clear all "new line" (|n) and use try/except to pass over the last elemet issue.
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
movies = []
years = []
for item in films.replace("\n", "").split("),"):
try:
movies.append(item.split(" (")[0])
years.append(item.split(" (")[-1])
except:
...
from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
After running the code, here output of dataset["test"]["abstract"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
I would like that each sentence to have this structure of tokenizing. How can I do such thing using huggingface? In fact, I think I have to flatten each list of the above list to get a list of strings and then tokenize each string.
I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.
Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla
You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())
As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']
// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content
Ive documents in the tuple format ("topic","doc"):-
('grain',
'Thailand exported 84,960 tonnes of rice in the week ended February 24, '
'689,038 tonnes of rice between the beginning of January and February 24, '
'up from 556,874 tonnes during the same period last year. It has '
'commitments to export another 658,999 tonnes this year. REUTER '),
('soybean',
'The Tokyo Grain Exchange said it will raise the margin requirement on '
'the spot and nearby month for U.S. And Chinese soybeans and red beans, '
'effective March 2. Spot April U.S. Soybean contracts will increase to '
'90,000 yen per 15 tonne lot from 70,000 now. Other months will stay '
'will be set at 70,000 from March 2. The new margin for red bean spot '),.....
Ive taken only 10 topics for the classification task.
Now my problem is that there how do I classify anything apart from these 10 topics as "NA" (not from the 10 topics) ? Im using Naive_bayes right now. Is there any other classifier better suited to "NA" topics.? If yes then how do we set a threshold for "NA".