I have a file. I want to create a pattern. Then, I will calculate some values. I am using code below. How can I split these 3 values? Also, there white spaces in some columns as below pic.
title=movies[3].split(",")[1]
title
pattern=r"(\d+)[(\d+)-(\d+)]"
import re
re.findall(pattern,title)
You forgot to escape the characters [ and ]. Now it is seen as a definition of the defined set of characters between those brackets.
Change your regular expression to:
pattern = r"(\d+)\[(\d+)-(\d+)\]"
To add optional whitespaces to the regular expression you can use \s*. So the full regular expression would be
pattern = r"(\d+)\[\s*(\d+)-(\d+)\s*\]"
If you're looking specifically for the - character, you will need to escape it using \, as this will make the expression look for the character instead of using it for its other usage of defining a range.
This means your new pattern should look like: (\d+)[(\d+)\-(\d+)]
As a side-note, I can recommend using regex101 to double-check your patterns before using them!
Adding Matthias' answer onto here. If the square brackets are part of the title, you will also want to escape them too like so: (\d+)\[(\d+)\-(\d+)\]
This will look for two values separated by a hyphen, within square brackets.
This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')
I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo
I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?
Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.
Give an string like '/apps/platform/app/app_name/etc', I can use
p = re.compile('/apps/(?P<p1>.*)/app/(?P<p2>.*)/')
to get two matched groups of platform and app_name, but how can I use re.sub function (or maybe better way) to replace those two groups with other string like windows and facebook? So the final string would like /apps/windows/app/facebook/etc.
Separate group replacement wouldn't be possible through regex. So i suggest you to do like this.
(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/
DEMO
Then replace the matched characters with windows\2facebook/ . And also i suggest you to define your regex as raw string. Lookbehind is used inorder to avoid extra capturing group.
>>> s = '/apps/platform/app/app_name/etc'
>>> re.sub(r'(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/', r'windows\2facebook/', s)
'/apps/windows/app/facebook/etc'