Extract a part of a string using Regex in Python Pandas [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I'm a student working on a data science project and I need to extract a part from one column of my dataframe.
The dataframe looks like this :
column.
I want to extract the part HOTHOTVIDEO from a string like "HOTHOTVIDEOHOT0501005107FilmVidéoClub"
So I wrote this instruction using a regex like this :
facturation['annotation']=facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)
It extracts everything correclty, except sometimes when I have strings like these : "CTVCANALVODCTV0200052670CTV0200052670", it returns CTV0200052670CTV0200052670, but only want the first occurence: Like this
Can someone help me to fix this issue please :)

I think the problem is in your apply + join and findall methods, because you have matched 2 times this pattern in your data and next you have joined it. findall returns for you list. From the list you need only 1st item, not all.

Well thanks everyone who helped me :) I found the answer :
facturation['annotation'] = facturation['annotation'].str.findall('([A-Z0-9]{3}\d+)').apply(''.join)
facturation['annotation'] = facturation['annotation'].str.extract('(.{0,13})')

Related

Regular Expression in Python for (\xa0) and (<a).*(>).*(</a>) [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Just reading through some code for pre-processing text data, and came across these regex and am struggling to figure out what they mean.
ReviewText = ReviewText.str.replace('(<a).*(>).*(</a>)', '')
ReviewText = ReviewText.str.replace('(\xa0)', ' ')
Well, it looks like they are playing with HTML using regexp . . . generally, folks frown on that but given you are using, not developing we'll ignore that issue for now.
Looks like the first line would take:
Visit W3Schools.com!
and suppress it to nothing.
The second one takes the shown string and changes it to a space.
As the person above stated, you need both the regexp and input to actually do anything with that. Once you have both the regexp and some input, I recommend playing with the input with a regexp checker . . . like here (or equal): https://pythex.org/

findall string that starts with letter "CU" and return full string [duplicate]

This question already has answers here:
pandas select from Dataframe using startswith
(5 answers)
Closed 3 years ago.
It seems like straight forward thing however could not find appropriate SO answer.
I have a column called title which contain strings. I want to find out rows that starts with letter "CU".
I've tried using df.loc however It's giving me indexError,
Using regex, re.findall(r'^CU', string)
returns 'CU' instead of full name ex: 'CU abcd'. How can I get full name that starts with 'CU'?
EDIT: SORRY, I did not notice it was a duplicate question, problem solved by reading duplicate question.
You can try:
string.startswith("CU")

get strings between 2 delimiter in python [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I would like to get, from the following string "/path/to/%directory_1%/%directory_2%.csv"
the following list: [directory_1, directory_2]. I would like to avoid using split by "%" my string. I was hoping to find a regex that could help me. However I cannot find the correct one.
For now, I have the following:
re.findall('%(.*)%', dirty_arg)
which output ["directory_1%/%directory_2"]
Do you have any recommandation about that?
Thank you very much for your help.
Try this:
import re
regex = r"%(.*?)%"
dirty_arg = "/path/to/%directory_1%/%directory_2%.csv"
print(re.findall(regex, dirty_arg))
I've added ? to your regex which makes sure it matches as few times as possible. The output of this code is ['directory_1', 'directory_2']

How to write the regex in Python to maintain the group content unchanged? [duplicate]

This question already has an answer here:
Python re.sub back reference not back referencing [duplicate]
(1 answer)
Closed 5 years ago.
I'm trying to use python to run regex to do the replacement like below:
a = "%I'm a sentence.|"
re.sub(r"%(.*?)\|", "<\1>", a)
Then b = <\1>, but I want to get the result of <I'm a sentences.>
How am I supposed to achieve this? I tried to group I'm a sentence, but I feel I did something wrong, so the result doesn't maintain the group 1.
If you have any ideas, please let me know. Thank you very much in advance!
Use a raw string for the replacement, otherwise \1 will be interpreted as an octal character code, not a back-reference.
And assign the result to b.
b = re.sub(r"%(.*?)\|", r"<\1>", a)
DEMO
to capture group use \g<1>
a = "%I'm a sentence.|"
a = re.sub(r"%(.*?)\|", "<\g<1>>", a)
# <I'm a sentence.>

Splitting an input in to smaller fragments [duplicate]

This question already has an answer here:
Splitting an input in to fragments (Python)
(1 answer)
Closed 9 years ago.
I need to somehow convert a mathematical input(str) to a number,
e.g.
4-3*2-1+5 = ((((4-3)*2)-1)+5).
Current code looks like this:
Answer = input ('Put your answer here: ')
4-3*2-1+5
Somehow, I need to remake the string in to smaller fragments so that it reads from left to right, and to remake the numbers in to integers, but I have no idea how to do it.
I tried doing
Answer.split('+','-','*','/')
But it says TypeError: split() takes at most 2 arguments (4 given)
Also tried adding the answer to a list to see if that helped me at all:
li.append(Answer)
(li = ['4-3*2-1+5']
But I don't see anything beneficial with that..
Please help!
(I'm new to SOF, so if there's any information that's missing, please tell me what and I will try to correct it).
What you need to write is a parser and simple evaulator for simple expressions.
I would start reading any of the following:
http://pyparsing.wikispaces.com/HowToUsePyparsing
http://pyparsing.wikispaces.com/Examples
http://kmkeen.com/funcparserlib/
There are many other parser libraires, but these are just a couple.
You could also just use the rply library which if you have a look at the PyPi page has an example that directly implement and simple expression parser and evaluater just like what you're describing in your question.

Categories