parsing string - regex help in python

parsing string - regex help in python - python

Hi, I have this string in Python:
'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. # Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'
I need to extract the following:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
I tried to do this by using:
val = desc.split("\r\n")
and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.
Any help will be highly appreciated.

If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:
import re
desc_match = re.search(r'''(?sx)
(?P<loc>Location:.+?)[\n\r]
(?P<time>Time:.+?)[\n\r]
(?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)
if desc_match:
for gname in ['loc', 'time', 'vends']:
print desc_match.group(gname)
Given your definition of desc, this prints out:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.
If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))
r'''(?sx)
Location: \s* (?P<loc>.+?) [n\r]
Time: \s* (?P<time>.+?) [\n\r]
Vendors: \s* (?P<vends>.+?) (?:\n\r?){2}'''

NLNL = "\r\n\r\n"
parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)
which gives
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Related

Regex: Match words at end of line but do not include X

I am trying to get the span of the city name from some addresses, however I am struggling with the required regex. Examples of the address format is below.
flat 1, tower block, 34 long road, Major city
flat 1, tower block, 34 long road, town and parking space
34 short road, village on the river and carpark (7X3 8RG)
The expected text to be captured in each case is "Major city", "town" and "village on the river". The issue is that sometimes "and parking space" or a variant is included in the address. Using a regex such as "(?<=,\s)\w+" would return "town and parking space" in the case of example 2.
The city is always after the last comma of the address.
I have tried to re-work this question but have not successfuly managed to exclude the "and parking space" section.
I have already created a regex that excludes the postcodes this is just included as an answer would ideally allow for that part of the regex to be bolted on the end.
How would I create a regex that starts after the last comma and runs to the end of the address but stops at any "and parking" or postcodes?

You can capture these strings using
,\s*((?:(?!\sand\s)[^,])*)(?=[^,]*$)
,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
.*,\s*((?:(?!\sand\s)[^,])*)
.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
See this regex demo or this regex demo.
Details:
, - a comma ]
\s* - zero or more whitespaces
((?:(?!\sand\s)[^,])*) - Group 1: any char other than a comma, zero or more occurrences, that does not start whitespace + and + whitespace char sequence
(?=[^,]*$) - there must be any zero or more chars other than a comma till end of string.
In Python, you would use
m = re.search(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)', text)
if m:
print(m.group(1))
See the demo:
import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
rx = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')
for text in texts:
m = re.search(rx, text)
if m:
print(m.group(1))
Output:
Major city
town
village on the river

I would do:
import re
exp = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
for e in (re.split(',\s*', x)[-1] for x in exp):
print(re.sub(r'(?:\s+and car.*)|(?:\s+and parking.*)','',e))
Prints:
Major city
town
village on the river
Works like this:
Split the string on ,\s* and take the last portion;
Remove anything from the end of that string that starts with the specified (?:\s+and car.*)|(?:\s+and parking.*)
You can easily add addition clauses to remove with this approach.

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Python: Find and remove a string starting and ending with a specific substring in python

I have a string and it has a number of substrings that I'd like to delete.
Each of the substrings start with ApPle and end with THE BEST PIE — STRAWBERRY.
I tried the suggestions on this post, but they didn't work.
Input
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian ApPle Sep 12 THE BEST PIE —
STRAWBERRY pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine.
Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually ApPle Aug 4 THE BEST PIE — STRAWBERRY containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]
import re
array = []
#open the file and delete new lines
with open('canoli.txt', 'r') as myfile:
file = myfile.readlines()
array = [s.rstrip('\n') for s in file]
text = ' '.join(array)
attempt1 = re.sub(r'/ApPle+THE.BEST.PIE.-.STRAWBERRY/','',text)
attempt2 = re.sub(r'/ApPle:.*?:THE.BEST.PIE.-.STRAWBERRY/','',text)
print(attempt1)
print(attempt2)
Desired Output
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine. Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]

I think your regex should be: ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY
and you need to add the regex option DOTALL to handle newlines properly, try this:
re.sub(r'ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY','',text, flags=re.DOTALL)

Using Regex to get text only between specified characters

I am trying to use negative and positive lookaheads to capture a certain area of text but am struggling. I'm not sure if this is the best way to do this.
This is the exact text I am using regex for: Gold Coast area Partly cloudy.
I got it from web-scraping, and the "Partly cloudy" text changes every day, so I can't use regex to search for those exact words.
I want to retrieve the words "Party cloudy" between "Gold Coast area" and the full stop after "Partly cloudy".
Thank you very much for your help.

If you know that a string always begins with Gold Coast area and ends with a full stop, you can just truncate the string without regex:
s = 'Gold Coast area Partly cloudy.'
new_s = s[16:-1]
print(new_s) # prints 'Partly cloudy'

Try this:
/([A-Za-z ]+?) area ([A-Za-z ]+)\./
It captures the area in the first capture group and the weather to the second one. In case you are only interested in the Gold Coast area then replace the first capture group with the hard-coded "Gold Coast" string.
As a proof of concept:
import re
arr = ["Gold Coast area Partly cloudy.", "Gold Coast area clear skies.", "Some other area overcast."]
for s in arr:
match = re.match(r"([A-Za-z ]+?) area ([A-Za-z ]+)\.", s)
if match:
print(match.group(1)+": "+match.group(2))
Outputs:
Gold Coast: Partly cloudy
Gold Coast: clear skies
Some other: overcast

Extract words/sentence that occurs before a keyword from a string - Python

I have a string like this,
my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'
Now, I want to extract the current champion and the underdog using keywords champion and underdog .
What is really challenging here is both contender's names appear before the keyword inside parenthesis. I want to use regular expression and extract information.
Following is what I did,
champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)
>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']
underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)
>>['"underdog").']
However, I need the results, champion as:
brooklyn centenniel, resident of detroit, michigan
and the underdog as:
kamil kubaru, the challenger from alexandria, virginia
How can I do this using regular expression? (I have been searching, if I could go back couple or words from the keyword to get the result I want, but no luck yet) Any help or suggestion would be appreciated.

You can use named captured group to capture the desired results:
between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
between\s+(?P<champion>.*?)\s+\("champion"\) matches the chunk from between to ("champion") and put the desired portion in between as the named captured group champion
After that, \s+and\s+(?P<underdog>.*?)\s+\("underdog"\) matches the chunk upto ("underdog") and again get the desired portion from here as named captured group underdog
Example:
In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia
...: ("underdog").'
In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)
In [28]: out.groupdict()
Out[28]:
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}

There will be a better answer than this, and I don't know regex at all, but I'm bored, so here's my 2 cents.
Here's how I would go about it:
words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)
for the underdog, you will have to change the 6 to a 7, and '("champion")' to '("underdog").'
Not sure if this will solve your problem, but for this particular string, this worked when I tested it.
You could also use str.strip() to remove punctuation if that trailing period on underdog is a problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing string - regex help in python - python

NLNL = "\r\n\r\n" parts = s.split(NLNL) result = NLNL.join(parts[1:3]) print(result) which gives Location: 5th St. # Minna St. Time: 11:00am-2:00pm Vendors: Kasa Indian Fiveten Burger Hiyaaa The Rib Whip Mayo & Mustard

Related

Regex: Match words at end of line but do not include X

Extract information part f URL in python

Python: Find and remove a string starting and ending with a specific substring in python

Using Regex to get text only between specified characters

Extract words/sentence that occurs before a keyword from a string - Python

Categories

Resources