Classifying the "not applicable" category in document classification - python

Ive documents in the tuple format ("topic","doc"):-
('grain',
'Thailand exported 84,960 tonnes of rice in the week ended February 24, '
'689,038 tonnes of rice between the beginning of January and February 24, '
'up from 556,874 tonnes during the same period last year. It has '
'commitments to export another 658,999 tonnes this year. REUTER '),
('soybean',
'The Tokyo Grain Exchange said it will raise the margin requirement on '
'the spot and nearby month for U.S. And Chinese soybeans and red beans, '
'effective March 2. Spot April U.S. Soybean contracts will increase to '
'90,000 yen per 15 tonne lot from 70,000 now. Other months will stay '
'will be set at 70,000 from March 2. The new margin for red bean spot '),.....
Ive taken only 10 topics for the classification task.
Now my problem is that there how do I classify anything apart from these 10 topics as "NA" (not from the 10 topics) ? Im using Naive_bayes right now. Is there any other classifier better suited to "NA" topics.? If yes then how do we set a threshold for "NA".

Related

Python - Extract paragraph from text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
This is my first time posting so I apologize if I omit necessary information!
I am trying to extract a paragraph of text in Python that always follows a line starting with "Item 5.02". There is a line space between "Item 5.02" and the paragraph that I am trying to extract. I need the text between the "Item 5.02" line and the next section (in this case the next section starts at "Item 9.01"). Please let me know if I need to clarify anything. I have been tinkering with regular expressions but haven't had much luck. I'm pretty new to them. Thanks for the help!
I would like to extract the following:
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company. Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP. As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001). A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
From the below text:
Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangement of Certain Officers.
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company.
Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP.
As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001).
A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
Item 9.01 Financial Statements and Exhibits.
You could split it by double newlines, find the piece which contains Item 5.02, then take the next one:
def extractPassage(text):
lines = text.split("\n\n")
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
return lines[i+1]
raise Exception("No line found starting with Item 5.02")
I can't tell from the post formatting if there are any tabs or spaces before Item 5.02 on that line. If so, include them in the startswith call.
To get all text between 5.02 and 9.01, we can append lines to a string, starting after the one starting with 5.02, and ending when we see 9.01:
def extractPassage(text):
lines = text.split("\n\n")
output = ""
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
j = i+1
take_line = lines[j]
while not take_line.startswith("Item 9.01"):
output += take_line
j += 1
take_line = lines[j]
return output
raise Exception("No line found starting with Item 5.02")
The following regex will match the word Item followed by a space, one number, one period, and then two more numbers.
import re
re.split('Item \d\.\d\d', text)
To explain the regex: \d will match any number, and then to match a period we have to escape the period using \..
If you would rather accept either 1 or 2 digits after the period, you would use the regex 'Item \d\.\d{1,2}'

Tokenizing sentences a special way

from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
After running the code, here output of dataset["test"]["abstract"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
I would like that each sentence to have this structure of tokenizing. How can I do such thing using huggingface? In fact, I think I have to flatten each list of the above list to get a list of strings and then tokenize each string.

How can I use Med7 and Negspacy simultaneously?

I'm trying to use med7 and negspacy from Spacy but they both need separate version of spacy. How can i use both in the same script?
I'm using en_core_med7_lg from spacy to get disease, drug and other entities out of text
import spacy
import scispacy
from spacy import displacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
from negspacy.negation import Negex
med7 = spacy.load("en_core_med7_lg")
# create distinct colours for labels
col_dict = {}
seven_colours = ['#e6194B', '#3cb44b', '#ffe119', '#ffd8b1', '#f58231', '#f032e6', '#42d4f4']
for label, colour in zip(med7.pipe_labels['ner'], seven_colours):
col_dict[label] = colour
options = {'ents': med7.pipe_labels['ner'], 'colors':col_dict}
negex = Negex(med7, ent_types=["DISEASE", "NEG_ENTITY"], name="negex",
neg_termset={'pseudo_negations': ['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily',
'without any further', 'without difficulty', 'without further', 'might not', 'not only',
'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'neither', 'nor'],
'preceding_negations': ['absence of', 'declined', 'denied', 'denies', 'denying', 'no sign of', 'no signs of', 'not',
'not demonstrate', 'symptoms atypical', 'doubt', 'negative for', 'no', 'versus', 'without',
"doesn't", 'doesnt', "don't", 'dont', "didn't", 'didnt', "wasn't", 'wasnt', "weren't", 'werent',
"isn't", 'isnt', "aren't", 'arent', 'cannot', "can't", 'cant', "couldn't", 'couldnt', 'never', 'neither', 'nor'],
'following_negations': ['declined', 'unlikely', 'was not', 'were not', "wasn't", 'wasnt', "weren't", 'werent', 'neither', 'nor'],
'termination': ['although', 'apart from', 'as there are', 'aside from', 'but', 'except', 'however', 'involving', 'nevertheless', 'still', 'though', 'which', 'yet', 'neither', 'nor']},
extension_name='',
chunk_prefix=[]
)
med7.add_pipe("negex", config={"ent_types":["DISEASE", "NEG_ENTITY"]})
text = """77 yo M pt here to discuss Heart rate and BP in office is 192/80 HR is 102 pt states at 3:15pm BP was 170/79 and HR was 114 Patient presented with wife C/o HTN, elevated HR, and Right flank pain Reported wife gave Lasix 20 mg BID in fear of water retention Pmhx CKD4 Sees Nephrology, last OV 5/2020, follow up scheduled 8/2020 Reported HR elevated 112 at home Reported unable to afford clobetasol cream to treat psoriasis on scalp and elbows PROBLEM LIST: Patient Active Problem List Diagnosis Diabetes mellitus Aortic atherosclerosis HTN (hypertension) Diabetic neuropathy Microalbuminuria Diabetic peripheral vascular disorder Diabetic nephropathy associated with type 2 diabetes mellitus CKD (chronic kidney disease) stage 4, GFR 15-29 ml/min Homocysteinemia Refill clinic medication management patient Anemia, chronic renal failure, stage 4 (severe) Chronic systolic congestive heart failure Anxiety Hyperparathyroidism due to renal insufficiency Hypertension associated with type 2 diabetes mellitus Moderate episode of recurrent major depressive disorder PVD (peripheral vascular disease) History of prostate cancer RUQ pain Psoriasis PAST MEDICAL HISTORY: Past Medical History: Diagnosis Date BPPV (benign paroxysmal positional vertigo) 1/9/2019 Bradycardia 6/11/2019 CKD (chronic kidney disease) stage 3, GFR 30-59 ml/min 1/24/2019 GFR on 9/14/2018: 28 GFR on 11/13/2018: 34 GFR on 1/14/2019: 34 GFR on 05/21/2019 :34 CKD (chronic kidney disease) stage 4, GFR 15-29 ml/min 9/24/2018 Hospital discharge follow-up 12/9/2019 Prostate cancer 6/18/2018 Refill clinic medication management patient 5/13/2019 PAST SURGICAL HISTORY: Past Surgical History: Procedure Laterality Date ANGIOPLASTY 08/02/2018 L leg ANGIOPLASTY 06/13/2018 R leg colonoscopy 05/26/2015 Internal hemorrhoids, otherwise normal egd 05/26/2015 normal, mild chronic gastritis FAMILY HISTORY: Family History Problem Relation Name Age of Onset Cancer Son testiular Heart Disease Sister Stroke Brother SOCIAL HISTORY: Social History Socioeconomic History Marital status: Married Spouse name: Not on file Number of children: 3 Years of education: Not on file Highest education level: Not on file Occupational History Not on file Social Needs Financial resource strain: Not on file Food insecurity: Worry: Not on file Inability: Not on file Transportation needs: Medical: Not on file Non-medical: Not on file Tobacco Use Smoking status: Never Smoker Smokeless tobacco: Never Used Substance and Sexual Activity Alcohol use: Yes Comment: Occasionally Drug use: No Sexual activity: Not on file Lifestyle Physical activity: Days per week: Not on file Minutes per session: Not on file Stress: Not on file Relationships Social connections: Talks on phone: Not on file Gets together: Not on file Attends religious service: Not on file Active member of club or organization: Not on file Attends meetings of clubs or organizations: Not on file Relationship status: Not on file Intimate partner violence: Fear of current or ex partner: Not on file Emotionally abused: Not on file Physically abused: Not on file Forced sexual activity: Not on file Other Topics Concern Military Service Not Asked Blood Transfusions Not Asked Caffeine Concern Yes Occupational Exposure Not Asked Hobby Hazards Not Asked Sleep Concern Not Asked Stress Concern Not Asked Weight Concern Not Asked Special Diet Not Asked Back Care Not Asked Exercises Regularly No Bike Helmet Use Not Asked Seat Belt Use Not Asked Performs Self-Exams Not Asked Social History Narrative Not on file Social History Tobacco Use Smoking Status Never Smoker Smokeless Tobacco Never Used Social History Substance and Sexual Activity Alcohol Use Yes Comment: Occasionally Social History Substance and Sexual Activity Drug Use No Immunization History Administered Date(s) Administered Influenza Vaccine (High Dose) >=65 Years 12/04/2018 Influenza Vaccine (Unspecified) 10/12/2012 Influenza Vaccine >=6 Months 10/12/2012 Pneumococcal 13 Vaccine (PREVNAR-13) 12/10/2015 Pneumococcal 23 Vaccine (PNEUMOVAX-23) 11/04/2009, 02/07/2014, 11/12/2016 Td 11/04/2009 Tdap 11/04/2009 CURRENT MEDICATIONS: Current Outpatient Medications on File Prior to Visit Medication Sig Dispense Refill cilostazol (PLETAL) 100 MG tablet TAKE 1 TABLET BY MOUTH TWICE A DAY 180 tablet 3 [DISCONTINUED] citalopram (CELEXA) 10 MG tablet Take 2 tablets (20 mg) by mouth daily. 90 tablet 3 [DISCONTINUED] citalopram (CELEXA) 20 MG tablet TAKE 1 TABLET BY MOUTH ONCE DAILY FOR DEPRESSION 90 tablet 3 [DISCONTINUED] clobetasol propionate (TEMOVATE) 0.05 % cream Apply to affected area twice a day 45 g 1 cloNIDine (CATAPRES) 0.1 MG tablet Take 1 tablet (0.1 mg) by mouth 3 times daily as needed (if SBP > 160). 270 tablet 3 clopidogrel (PLAVIX) 75 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 3 [DISCONTINUED] epoetin alfa (PROCRIT) 10000 UNIT/ML injection Procrit 10,000 unit/mL injection solution felodipine (PLENDIL) 10 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 4 furosemide (LASIX) 20 MG tablet Take 1 tablet (20 mg) by mouth daily as needed (edema). 90 tablet 2 glipiZIDE (GLUCOTROL) 10 MG tablet TAKE 1 TABLET BY MOUTH TWICE DAILY FOR DIABETES 180 tablet 0 hydrALAZINE (APRESOLINE) 25 MG tablet TAKE 2 TABLETS BY MOUTH TWICE A DAY 360 tablet 1 [DISCONTINUED] hydrALAZINE (APRESOLINE) 25 MG tablet Take 25 mg by mouth 3 times daily. 4 Insulin Degludec (TRESIBA) 100 UNIT/ML SOLN 15 Units. 15 units [DISCONTINUED] Insulin Degludec 100 UNIT/ML SOPN 15 units daily 9 mL 2 losartan (COZAAR) 50 MG tablet Take 1 tablet (50 mg) by mouth 2 times daily. Hold dose for systolic blood pressure less than 100. 180 tablet 3 Misc. Devices MISC C-pap setting at 9 cm H2O Dx: Sleep Apnea 1 each 0 [DISCONTINUED] prazosin (MINIPRESS) 1 MG capsule Take 1 capsule (1 mg) by mouth 3 times daily. 90 capsule 11 simvastatin (ZOCOR) 20 MG tablet TAKE 1 TABLET BY MOUTH EVERY DAY 90 tablet 3 No current facility-administered medications on file prior to visit. Outpatient Medications Marked as Taking for the 7/23/20 encounter (Office Visit) with Diaz, Anarella, NP Medication Sig Dispense Refill Insulin Degludec (TRESIBA) 100 UNIT/ML SOLN 15 Units. 15 units ALLERGIES: Allergies Allergen Reactions Sulfa Drugs Swelling and Cough REVIEW OF SYSTEMS: Review of Systems All other systems reviewed and are negative. refer to HPI PHYSICAL EXAM: 07/23/20 1657 07/23/20 1718 BP: 192/80 160/80 Pulse: 102 86 Resp: 18 Temp: 99 F (37.2 C) SpO2: 96% Body mass index is 27.28 kg/m . Ht Readings from Last 1 Encounters: 07/23/20 5' 6" (1.676 m) Wt Readings from Last 1 Encounters: 07/23/20 76.7 kg (169 lb) Physical Exam Vitals signs and nursing note reviewed. Constitutional: Appearance: Normal appearance. He is well-developed. HENT: Head: Normocephalic and atraumatic. Eyes: Conjunctiva/sclera: Conjunctivae normal. Neck: Musculoskeletal: Normal range of motion. Cardiovascular: Rate and Rhythm: Normal rate and regular rhythm. Pulmonary: Effort: Pulmonary effort is normal. Breath sounds: Normal breath sounds. Abdominal: Palpations: Abdomen is soft. Tenderness: There is no tenderness. There is no right CVA tenderness. Musculoskeletal: Normal range of motion. Skin: General: Skin is warm. Neurological: Mental Status: He is alert and oriented to person, place, and time. Psychiatric: Mood and Affect: Mood normal. Behavior: Behavior normal. Thought Content: Thought content normal. Judgment: Judgment normal. ASSESSMENT & PLAN: Luis was seen today for other. Diagnoses and all orders for this visit: RUQ pain Assessment & Plan: Active Negative abd exam Labs ordered VSS Stay hydrated Cont to monitor Follow up prn results Orders: - CBC w/ Diff Lavender; Future - Comprehensive Metabolic Panel Green; Future Psoriasis Assessment & Plan: Active Ordered Clobetasol solution apply ad Avoid allergens Stay hydrated Cont to monitor Follow up prn Orders: - clobetasol propionate (TMOVATE) 0.05 % solution; Apply 1 mL topically 2 times daily. Use a small amount as directed ICD-10-CM ICD-9-CM 1. D71 pain R10.11 789.01 CBC w/ Diff Lavender Comprehensive Metabolic Panel Green 2. Psoriasis L40.9 696.1 clobetasol propionate (TMOVATE) 0.05 % solution """
doc = med7(text)
[(ent.text, ent.label_) for ent in doc.ents]
negated_concepts = []
for e in doc.ents:
if e._.negex is True:
negated_concepts.append(e.text.lower())
print(e.text, e.label_)
print(negated_concepts)
No need for negex, just add your neg_termset to the pipeline config:
neg_termset={'pseudo_negations': ['no further', 'not able to be', 'not certain if', 'not certain whether', 'not necessarily',
'without any further', 'without difficulty', 'without further', 'might not', 'not only',
'no increase', 'no significant change', 'no change', 'no definite change', 'not extend', 'not cause', 'neither', 'nor'],
'preceding_negations': ['absence of', 'declined', 'denied', 'denies', 'denying', 'no sign of', 'no signs of', 'not',
'not demonstrate', 'symptoms atypical', 'doubt', 'negative for', 'no', 'versus', 'without',
"doesn't", 'doesnt', "don't", 'dont', "didn't", 'didnt', "wasn't", 'wasnt', "weren't", 'werent',
"isn't", 'isnt', "aren't", 'arent', 'cannot', "can't", 'cant', "couldn't", 'couldnt', 'never', 'neither', 'nor'],
'following_negations': ['declined', 'unlikely', 'was not', 'were not', "wasn't", 'wasnt', "weren't", 'werent', 'neither', 'nor'],
'termination': ['although', 'apart from', 'as there are', 'aside from', 'but', 'except', 'however', 'involving', 'nevertheless', 'still', 'though', 'which', 'yet', 'neither', 'nor']}
med7.add_pipe("negex", config={"neg_termset": neg_termset, "ent_types":["DISEASE", "NEG_ENTITY"]})

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?
urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")
Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.
Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Splitting a string with no line breaks into a list of lines with a maximum column count

I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."
EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond
You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html

Categories