Getting quantity and unit - python

I want to get bold parts in sentences below.
Examples:
SmellNice Coffee 450 gr
Clean 2 k Rice
LukaLuka 1,5lt cold drink
Jumbo 7 gutgut eggs 12'li
Espresso 5 Klasik 10 Ad
Expression below works well until to the last two.
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)
I have added \s|$ end of the expression. Thinking that If the unit is not the last word then there should be a space after it. But it didn't work. Briefly, how can I capture all bold expressions?

It works with brackets:
\d+[.,]?\d*\s*[’']?\s*(gr|g|kg|k|adet|ad|lı|li|lu|lü|cc|cl|ml|lt|l|mm|cm|mt|m)(\s+|$)

x2 = (
"\d+" #digit
"[,'\s]" #space comma apostrophe
"[\d*\s*]?" #opt digit or space
"((gr)|g|(kg)|k|(adet)|([Aa]d)|(lı)|(li)|(lu)|(lü)|(cc)|(cl)|(ml)|(lt)|l|(mm)|(cm)|(mt)|m)" #all the weights to look for
"(\s+|$)" #it's gotta be followed with a space, or with end of line.
)

Related

Regex split the string at \n but skip the first one if it is \n\n

I want to split some strings on Python by separating at \n and use them in that format, but some of those strings have unexpected newlines and I want to ignore them.
TO CLARIFY: Both examples have only one string.
For example this is a regular string with no unexpected newlines:
Step 1
Cut peppers into strips.
Step 2
Heat a non-stick skillet over medium-high heat. Add peppers and cook on stove top for about 5 minutes.
Step 3
Toast the wheat bread and then spread hummus, flax seeds, and spinach on top
Step 4
Lastly add the peppers. Enjoy!
but some of them are like this:
Step 1
Using a fork, mash up the tuna really well until the consistency is even.
Step 2
Mix in the avocado until smooth.
Step 3
Add salt and pepper to taste. Enjoy!
I have to say I am new at regex and if the solution is obvious, please forgive
Edit: Here is my regex
stepOrder = []
# STEPS
txtSteps = re.split("\n",directions.text)
listOfLists = [[] for i in range(len(txtSteps)) if i % 2 == 0]
for i in range(len(listOfLists)):
listOfLists[i] = [txtSteps[i*2],txtSteps[i*2+1]]
recipe["steps"] = listOfLists
print(listOfLists)
directions.text is every one of these examples I gave. I can share what it is too, but I think it's irrelevant.
You can solve this problem by splitting on the following regex:
(?<=\d\n).*
Basically it will get any character in the same line .* which is preceeded by one digit \d and one new line character \n.
Check the regex demo here.
Your whole Python snippet then becomes simplified as follows using the re.findall method:
# STEPS
steps = re.findall("(?<=\d\n).*", directions.text)
out = [[{'order':i+1, 'step': step}] for i, step in enumerate(steps)]
f = open("your_file_name")
content = f.read()
f.close()
for line in content.split("\n"):
if re.match("^&",line):
continue
print(line)

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.
Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

Regex: Extract only number before the word

I have some sentence like
"Apartment at Chennai has 4 rooms with a swimming pool"
"this apartment has 2/3 room with a coridor"
How to extract the only number before the word "room|rooms".
looking for an answer which must be 4 and 2/3.
The code i tried,
room_found =re.findall(r"\d\s?\/?\d?\s?(?=(rooms)|(room))", str_arg)
print(room_found)
This prints [('', '4 room')] and [('', '2/3 room')], but i am expecting only 4 and 2/3 to be printed.
You can use
\d+(?:/\d+)?(?=\s?rooms?\b)
Explanation
\d+ Match 1+ digits (Or \d for a single digit)
(?:/\d+)? Optionally match / and 1+ digits
(?= Positive lookahead to assert what is directly at the right is
\s?rooms?\b Match an optional whitspace char followed by room or rooms
) Close the lookahead
Regex demo
import re
str_arg = "\"Apartment at Chennai has 4 rooms with a swimming pool\" \"this apartment has 2/3 room with a coridor\""
room_found =re.findall(r"\d+(?:/\d+)?(?=\s?rooms?\b)", str_arg)
print(room_found)
Output
['4', '2/3']
You can add a condition to find all the numbers before the word room and rooms.
And You can enclose all the valid values inside the square brackets like you are allowing forward slash apart from digits as well.
[\d\/]+ => It selects the digits and forward slash.
(?=\srooms?) => It selects the numbers before the word room and rooms.
[\d\/]+(?=\srooms?)
Please find demo here
You can use ([0-9]+) rooms. So your final code would look like:
import re
str_arg = "\"Apartment at Chennai has 4 rooms with a swimming pool\" \"this apartment has 2/3 room with a coridor\""
room_found = re.findall(r"([0-9]+) rooms", str_arg)
print(room_found)

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Categories