Using Regex to get text only between specified characters - python

I am trying to use negative and positive lookaheads to capture a certain area of text but am struggling. I'm not sure if this is the best way to do this.
This is the exact text I am using regex for: Gold Coast area Partly cloudy.
I got it from web-scraping, and the "Partly cloudy" text changes every day, so I can't use regex to search for those exact words.
I want to retrieve the words "Party cloudy" between "Gold Coast area" and the full stop after "Partly cloudy".
Thank you very much for your help.

If you know that a string always begins with Gold Coast area and ends with a full stop, you can just truncate the string without regex:
s = 'Gold Coast area Partly cloudy.'
new_s = s[16:-1]
print(new_s) # prints 'Partly cloudy'

Try this:
/([A-Za-z ]+?) area ([A-Za-z ]+)\./
It captures the area in the first capture group and the weather to the second one. In case you are only interested in the Gold Coast area then replace the first capture group with the hard-coded "Gold Coast" string.
As a proof of concept:
import re
arr = ["Gold Coast area Partly cloudy.", "Gold Coast area clear skies.", "Some other area overcast."]
for s in arr:
match = re.match(r"([A-Za-z ]+?) area ([A-Za-z ]+)\.", s)
if match:
print(match.group(1)+": "+match.group(2))
Outputs:
Gold Coast: Partly cloudy
Gold Coast: clear skies
Some other: overcast

Related

Need a Regex that adds a space after a period, but can account for abbreviations such as U.S. or D.C

Here is what I have so far:
text = re.sub((?<=\.)(?=[A-Z]), text)
This already avoids numbers and it gets around non-capital letters, but I need it to account for the edge case where initials are separated by periods.
An example sentence where I wouldn't want to add a space would be:
The U.S. health care is more expensive than U.K health care.
Currently, my regex makes it like:
The U. S. health care is more expensive than U. K health care.
But I want it to look exactly like the first sentence without the spaces separating U.S and U.K
I'm not sure how to do this, any advice would be appreciated!
EDIT:
(?<=\.)(?=[A-Z][a-z]{1,})
makes it so that it avoids one word abbreviations.
I think that this does what you want. We find points which do not have a capital letter before them, nor a space after.
import re
text="The U.S. health care is more expensive than U.K health care.The end."
text = re.sub(r'((?<![A-Z])\.(?!\s))',r'\1 ', text)
print('<',text,'>')
output (with '<' and '>' to show the beginning and end of the text more clearly.
< The U.S. health care is more expensive than U.K health care. The end. >

Remove all bracketed text except for percentages

I'm trying to write a regex for removing text within brackets () or []. But, only places where it's not numbers with a percent symbol. Also, to remove the farthest bracket.
2.1.1. Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg.
What I have now is removing everything between the brackets. But not considering the far end of the bracket.
re.sub("[\(\[].*?[\)\]]", "", sentence).strip()
You may remove all substrings between nested square brackets and remove all substrings inside parentheses except those with a number and a percentage symbol inside with
import re
def remove_text_nested(text, pattern):
n = 1 # run at least once
while n:
text, n = re.subn(pattern, '', text) # remove non-nested/flat balanced parts
return text
text = "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg."
text = remove_text_nested(text, r'\((?!\d+%\))[^()]*\)')
text = remove_text_nested(text, r'\[[^][]*]')
print(text)
Output:
Berlin is the capital and largest city of Germany by both area and population. Its 3,769,495 inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration. Many other immigrants came from Bohemia, Poland, and Salzburg.
See the Python demo
Basically, the remove_text_nested method removes all matches in a loop until no replacement occurs.
The \((?!\d+%\))[^()]*\) pattern matches (, then fails the match if there are 1+ digits, %) to the right of the current location, then matches 0+ chars other than ( and ) and then matches ). See this regex demo.
The \[[^][]*] pattern simply matches [, then 0 or more chars other than [ and ] and then a ]. See the regex demo.

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Python: Find and remove a string starting and ending with a specific substring in python

I have a string and it has a number of substrings that I'd like to delete.
Each of the substrings start with ApPle and end with THE BEST PIE — STRAWBERRY.
I tried the suggestions on this post, but they didn't work.
Input
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian ApPle Sep 12 THE BEST PIE —
STRAWBERRY pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine.
Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually ApPle Aug 4 THE BEST PIE — STRAWBERRY containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]
import re
array = []
#open the file and delete new lines
with open('canoli.txt', 'r') as myfile:
file = myfile.readlines()
array = [s.rstrip('\n') for s in file]
text = ' '.join(array)
attempt1 = re.sub(r'/ApPle+THE.BEST.PIE.-.STRAWBERRY/','',text)
attempt2 = re.sub(r'/ApPle:.*?:THE.BEST.PIE.-.STRAWBERRY/','',text)
print(attempt1)
print(attempt2)
Desired Output
Cannoli (Italian pronunciation: [kanˈnɔːli]; Sicilian: cannula) are
Italian pastries that
originated on the island of Sicily and are today a staple of Sicilian
cuisine1[2] as well as Italian-American cuisine. Cannoli consist of
tube-shaped shells of fried pastry dough, filled with a sweet, creamy
filling usually containing
ricotta. They range in size from "cannulicchi", no bigger than a
finger, to the fist-sized proportions typically found south of
Palermo, Sicily, in Piana degli Albanesi.[2]
I think your regex should be: ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY
and you need to add the regex option DOTALL to handle newlines properly, try this:
re.sub(r'ApPle.*?THE\sBEST\sPIE\s—\sSTRAWBERRY','',text, flags=re.DOTALL)

parsing string - regex help in python

Hi, I have this string in Python:
'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. # Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'
I need to extract the following:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
I tried to do this by using:
val = desc.split("\r\n")
and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.
Any help will be highly appreciated.
If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:
import re
desc_match = re.search(r'''(?sx)
(?P<loc>Location:.+?)[\n\r]
(?P<time>Time:.+?)[\n\r]
(?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)
if desc_match:
for gname in ['loc', 'time', 'vends']:
print desc_match.group(gname)
Given your definition of desc, this prints out:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.
If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))
r'''(?sx)
Location: \s* (?P<loc>.+?) [n\r]
Time: \s* (?P<time>.+?) [\n\r]
Vendors: \s* (?P<vends>.+?) (?:\n\r?){2}'''
NLNL = "\r\n\r\n"
parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)
which gives
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Categories