How to extract text between two words [duplicate] - python

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 8 months ago.
import re
a = """COMPUTATION OF DAMAGES Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00 DIMOPOULOS INJURY This is the bill of 1 2 3 4 5 6 7 8 DIMOPOULOS INJURY """
word1 = "COMPUTATION OF DAMAGES"
word2 = "DIMOPOULOS INJURY"
result = re.search(word1 + '(.*)' + word2, a)
print(result.group(1))
Required op: Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00
How to extract text upto the first "DIMOPOULOS INJURY" keyword. Is there any solution

You are very close, just add ?:
result = re.search(word1+'(.*?)'+word2, a)
The output will be:
"Plaintiff Maurice’s computation of damages to date includes all the above related medical specials, totaling $98,429.00. The Minimally Invasive Hand Institute $49,949.00 Interventional Pain & Spine Institute $1,190.00 Premier Physical Therapy $8,600.00 Clinical Neurology Specialist $3,090.00 Red Rock Surgery Center $34,510.00"

Related

why part of my code is ruining the other part?

Hello guys I'm trying to create a program to count the total words and the total unique words in a file but when I run the 2 parts of the codes together only the unique words counter part will work and when I delete the unique words counter part the normal words counter will work normally
here is the code full code
f = open('icuhistory.txt','r')
wordCount = 0
text = f.read()
for line in f:
lin = line.rstrip()
wds = line.split()
wordCount += len(wds) #this section alone works fine
text = text.lower() #when I start writing this one the first one will stop working
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
unique = []
for word in words:
if word not in unique:
unique.append(word)
unique.sort()
print("number of words: ",wordCount)
print("number of unique words: ",len(unique))
Here is the inside of the file
in the fall of 1945 just weeks after the end
of world war ii a group of japanese christian educators
initiated a move to establish a university based on christian
principles the foreign missions conference of north america and the
us education mission both visiting japan at the time
gave their wholehearted support conveying this plan to people in
the us amidst the post-war yearning for reconciliation and
world peace americans supported this project with great enthusiasm in
1948 the japan international christian university foundation jicuf was
established in new york to coordinate fund-raising efforts in the
us people in japan also found hope in a
cause dedicated toworld peace organizations firms and individuals made donations
to this ambitious undertaking regardless of their religious orientation anddespite
the often destitute circumstances in the immediate post-war years bank
of japan governor hisato ichimada headed the supporting organization to
lead the national fund raising drive icu has been unique from
its inception with its endowment procured through good will transcending
national borders
on june 15 1949 japanese and north american christian leaders
convened at the gotemba ymca camp to establish international christian
university with the inauguration of the board of trustees and
the board of councillors the founding principles and a fundamental
educational plan were laid down establishing an interdenominational christian university
had been a dream of japanese and american christians for
half a century the gotemba conference had finally realized their
aspirations
in 1950 icu purchased a spacious site in mitaka city
on the outskirts of tokyo with the donations it received
within japan the campus was dedicated on april 29 1952
with the language institute set up in the first year
in march 1953 the japanese ministry of education authorized icu
as an incorporated educational institution the college of liberal arts
opening on april 1 as the first four-year liberal arts
college in japan
the university celebrated its 50th anniversary in 1999 with diverse
events and projects during the commemorative five year period leading to
march 2004 in 2003 the ministry of education culture sports
science and technology selected icu s research and education
for peace security and conviviality for the 21st century center
of excellence program and its liberal arts to nurture
responsible global citizens for the distinctive university education support program
good practice
in 2008 an academic reform was enforced in the college
of liberal arts which replaced the system of six divisions
with a new organization of the division of arts
and sciences and a system of academic majors as of
april 2008 all new students simply start as college of
liberal arts students making their choice of major from 31
areas by the end of their sophomore year students now
have more time to make a decision while they study
diverse subjects through general education and foundation courses mext chose
icu for its fiscal year 2007 distinctive university education support
program educational support for liberal arts to nurture international
learning from academic advising to academic planning in acknowledgement of
the university s efforts for educational improvement in 2010 the
graduate school also conducted a reform and integrated the four
divisions into a new school of arts and sciences
icu is continually working to reconfirm its responsibilities and fulfill
its mission for the changing times
The entire file content appears to be lowercase so it's as easy as this:
result = {}
with open('icuhistory.txt') as icu:
for word in icu.read().split():
word = word.strip('.,!;()[]').replace("'s", "")
result[word] = result.get(word, 0) + 1
print(f'Number of words = {sum(result.values())}')
print(f'Number of unique words = {len(result)}')
Output:
Number of words = 547
Number of unique words = 273
Take a look at the text = f.read() line. Is it at the right place?
Also, the Python script you pasted does not have consistent indenting. Are you able to clean it up so that it looks just like the original?
Also curious if you have explored the set type in Python? It is a little like a list, but you may find it applicable in your scenario.
Explenation:
Behind files and open stands a concept of streaming or if you are more familiar with iterators think of f = open('icuhistory.txt','r') as an iterator.
You can go through it only once (if you don't tell it to reset)
text = f.read()
Will go through it once, then f is at the end of the file.
for line in f:
Now tries to continue where f currently is... at the end of the file.
So this loop will try to loop over the 0 lines left at the end.
As there is nothing left to iterate over it will not enter the for loop.
Solutions:
You could reset it with f.seek(0) this will tell the object to go back to the start of the file.
But more efficient would be if you either combine both your actions in the loop (more memory friendly) or work with the text text = f.read()
There's no need to read by line as you are counting words, also avoid sorting unless it's needed, as this can be expensive. Converting a list to a set will remove duplicates, and you can chain string methods.
with open('icuhistory.txt','r') as f:
text = f.read().lower()
words = [word.strip('.,!;()[]').replace("'s", '') for word in text.split()]
unique_words = set(words)
print("number of words: ", len(words))
print("number of unique words: ", len(unique_words))

How to highlight MULTIPLE matching sequences of words in two strings in Python?

I'm using the code below to highlight a single matching sequence. (Just copy-paste it in a new Colab notebook, it'll work perfectly.
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
Now when I try highlighting multiple sequences, the code breaks (doesn't show the full text, and doesn't highlight the second or more sequence).
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. North American mainland at least 12,000 years ago, and advanced cultures began to appear later on.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
I need to highlight at least 2 sequences. I'm trying to make some sort of if else statement right now, maybe it'll work. Or is there a better library?

How do i turn this string into 2 arrays in python?

I have this string and want to turn it into two arrays, one has the film title and the other one has the year. Their positions in the array need to correspond with each other. Is there a way to do this?
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
First split the input string on comma to generate a list, then use comprehensions to get the title and year as separate lists.
films_list = re.split(r',\s*', films)
titles = [re.split(r'\s*(?=\(\d+\))', x)[0] for x in films_list]
years = [re.split(r'\s*(?=\(\d+\))', x)[1] for x in films_list]
Answer of Tim is well enough. I will try to write an alternative someone who would like to solve the problem without using regex.
a = films.split(",")
years = []
for i in a:
years.append(i[i.find("(")+1:i.find(")")])
Same approach can be applied for titles.
You can do something like this (without any kind of import or extra module needed, or regex complexity):
delimeter = ", "
movies_with_year = pfilms.split(delimeter)
movies = []
years = []
for movie_with_year in movies_with_year:
movie = movie_with_year[:-6]
year = movie_with_year[-6:].replace("(","").replace(")","")
movies.append(movie)
years.append(year)
This script will result in something like this:
movies : ['Endless Love ', ...]
years : ['1981', ...]
You shuold clear all "new line" (|n) and use try/except to pass over the last elemet issue.
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
movies = []
years = []
for item in films.replace("\n", "").split("),"):
try:
movies.append(item.split(" (")[0])
years.append(item.split(" (")[-1])
except:
...

Create tuples of (lemma, NER type) in python , Nlp problem

I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)
I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?
urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")
Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.
Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Categories