A Regular expression to find second occurence in DataFrame - python

I have a string column in a Data Frame that i want to extract a rate from that is the last occurrence of backslash.
After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP*
I want to get 14/10

try this code to get list of all your date in this sentence
import re
re.findall(r"\d+/\d+","After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP* ")
to get the second just select the second item in this list

This regex works:
>>> s = "After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP*"
>>> re.findall("\d+/(?=[^/]*$)\d+", s)
['14/10']

Related

Using difflib.get_close_matches to replace word in string - Python

If difflib.get_close_matches can return a single close match. Where I supply the sample string and close match. How can I utilize the 'close match' to replace the string token found?
# difflibQuestion.py
import difflib
word = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
result = difflib.get_close_matches(line,word,n=1)
print(result)
Output:
['Winterstreamrise']
I want to produce the line:
I went up to Winterstreamrise.
For many lines and words.
I have checked the docs
can't find any ref to string index of found match difflib.getget_close_matches
the other module classes & functions return lists
I Googled "python replace word in line using difflib" etc. I can't find any reference to anyone else asking/writing about it. It would seem a common scenario to me.
This example is of course a simplified version of my 'real world' scenario. Which may be of help. Since I am dealing more with table data (rather than line)
Surname, First names, Street Address, Town, Job Description
And my 'words' are a large list of street base names eg MAIN, EVERY, EASY, LOVERS (without the Road, Street, Lane) So my difflib.get_close_matches could be used to substitute the string of column x 'line' with the closest match 'word'.
However I would appreciate anyone suggesting an approach to either of these examples.
You could try something like this:
import difflib
possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
newWords = []
for word in line.split():
result = difflib.get_close_matches(word, possibilities, n=1)
newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)
Output:
I went up to Winterstreamrise
Explanation:
The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.

I have a list and i want to print a specific string from it how can i do that?

So far I have done this but this returns the movie name but i want the year 1995 like this in separate list.
moviename=[]
for i in names:
moviename.append(i.split(' (',1)[0])
One issue with the code you have is that you're getting the first element of the array returned by split, which is the movie title. You want the second argument split()[1].
That being said, this solution won't work very well for a couple of reasons.
You will still have the second parenthesis in the year "1995)"
It won't work if the title itself has parenthesis (e.g. for Shanghai Triad)
If the year is always at the end of each string, you could do something like this.
movie_years = []
for movie in names:
movie_years.append(movie[-5:-1])
You could use a regular expression.
\(\d*\) will match an opening bracket, following by any number of digit characters (0-9), followed by a closing bracket.
Put only the \d+ part inside a capturing group to get only that part.
year_regex = r'\((\d+)\)'
moviename=[]
for i in names:
if re.search(year_regex, i):
moviename.append(re.search(year_regex, i).group(1))
By the way, you can make this all more concise using a list comprehension:
year_regex = r'\((\d+)\)'
moviename = [re.search(year_regex, name_and_year).group(1)
for name_and_year in names
if re.search(year_regex, name_and_year)]

Python Regex: how to not select whitespace before last string?

I am (a newbie,) struggling with separating a database in columns with regex.findall().
I want to separate these Dutch street names into name and number.
Roemer Visscherstraat 15
Vondelstraat 102-huis
For the number I use
\S*$
Which works just fine. For the street name I use
^\S.+[^\S$]
Or: use everything but the last element, which may be a number or a combination of a number and something else.
Problem is: Python then also keeps the last whitespace after the last name, so I get:
'Roemer Visscherstraat '
Any way I can stop this from happening?
Also, Findall returns a list consisting of the bit of database I wanted, and an empty string. How does this happen and can i prevent it somehow?
Thanks so much in advance for you help.
You can rstrip() the name to remove any spaces at the end of it:
>>>'Roemer Visscherstraat '.rstrip()
'Roemer Visscherstraat'
But if the input is similar to the one you posted, you can simply use split() instead of regex, for example:
st = 'Roemer Visscherstraat 15'
data = st.split()
num = st[-1]
name = ' '.join(st[:-1])
print 'Name: {}, Number: {}'.format(name, num)
output:
Name: Roemer Visscherstraat, Number: 15
For the number you should use the following:
\S+$
Using a + instead of a * will ensure that you have at least one character in the match.
For the street name you can use the following:
^.+(?=\s\S+$)
What this does is selects text up until the number.
However, what you may consider doing is using one regex match with capture groups instead. The following would work:
^(.+(?=\s\S+$))\s(\S+$)
In this case, the first capture group gives you the street name, and the second gives you the number.
([^\d]*)\s+(\d.*)
In this regex the first group captures everything before a space and a number and the 2nd group gives the desired number
my assumption is that number would begin with a digit and the name would not have a digit in it
take a look at https://regex101.com/r/eW0UP2/1
Roemer Visscherstraat 15
Full match 0-24 `Roemer Visscherstraat 15`
Group 1. 0-21 `Roemer Visscherstraat`
Group 2. 22-24 `15`
Vondelstraat 102-huis
Full match 24-46 `Vondelstraat 102-huis`
Group 1. 24-37 `Vondelstraat`
Group 2. 38-46 `102-huis`

python regex boolean statement not working

My problem is that this simple regex statement with a boolean operator only gives me the result I want when the first item on the left side of the bitwise operator | is present in the sentence. Could someone tell me why it isn't working on the alternative as well?
import re
b = 'this is a good day to die hard'
jeff = re.search('good night (.+)hard|good day (.+)hard', b)
print jeff.group(1)
You have two sets of capturing parentheses - therefore you have two numbered capturing groups. If the second branch matches, the group(1) will be set to None, and group(2) will contain that which was matched by the second group.
There are several ways to fix this. One would be to write so that there is just one group, for example
jeff = re.search('good (?:day|night) (.+)hard', b)
The second (...) creates the second capturing group that you need to access with .group(2).
You may write a regex that will capture day or night, and the second group will fetch all up to the last hard.
import re
b = 'this is a good day to die hard'
jeff = re.search('good (day|night) (.+)', b)
if jeff:
print jeff.group(1)
print jeff.group(2)
Output of the demo:
day
to die hard

Python: Replace all substring occurrences with regular expressions

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words
You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)
You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>
You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'
Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

Categories