Regular expression match / split

Regular expression match / split - python

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?

You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo

It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

Related

Finding pattern in Python

I have a file. I want to create a pattern. Then, I will calculate some values. I am using code below. How can I split these 3 values? Also, there white spaces in some columns as below pic.
title=movies[3].split(",")[1]
title
pattern=r"(\d+)[(\d+)-(\d+)]"
import re
re.findall(pattern,title)

You forgot to escape the characters [ and ]. Now it is seen as a definition of the defined set of characters between those brackets.
Change your regular expression to:
pattern = r"(\d+)\[(\d+)-(\d+)\]"
To add optional whitespaces to the regular expression you can use \s*. So the full regular expression would be
pattern = r"(\d+)\[\s*(\d+)-(\d+)\s*\]"

If you're looking specifically for the - character, you will need to escape it using \, as this will make the expression look for the character instead of using it for its other usage of defining a range.
This means your new pattern should look like: (\d+)[(\d+)\-(\d+)]
As a side-note, I can recommend using regex101 to double-check your patterns before using them!
Adding Matthias' answer onto here. If the square brackets are part of the title, you will also want to escape them too like so: (\d+)\[(\d+)\-(\d+)\]
This will look for two values separated by a hyphen, within square brackets.

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0

Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time

Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time

str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.

First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.

To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?

You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.

You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

extract string betwen two strings in pandas

I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?

Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo

Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

python how to replace string by regex group?

Give an string like '/apps/platform/app/app_name/etc', I can use
p = re.compile('/apps/(?P<p1>.*)/app/(?P<p2>.*)/')
to get two matched groups of platform and app_name, but how can I use re.sub function (or maybe better way) to replace those two groups with other string like windows and facebook? So the final string would like /apps/windows/app/facebook/etc.

Separate group replacement wouldn't be possible through regex. So i suggest you to do like this.
(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/
DEMO
Then replace the matched characters with windows\2facebook/ . And also i suggest you to define your regex as raw string. Lookbehind is used inorder to avoid extra capturing group.
>>> s = '/apps/platform/app/app_name/etc'
>>> re.sub(r'(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/', r'windows\2facebook/', s)
'/apps/windows/app/facebook/etc'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.