Python regex to extract substring at start and end of string

Python regex to extract substring at start and end of string - python

I am looking for a regex that will extract everything up to the first . (period) in a string, and everything including and after the last . (period)
For example:
my_file.10.4.5.6.csv
myfile2.56.3.9.txt
Ideally the regex when run against these strings would return:
my_file.csv
myfile2.txt
The numeric stamp in the file will be different each time the script is run, so I am looking essentially to exclude it.
The following prints out the string up to the first . (period)
print re.search("^[^.]*", data_file).group(0)
I am having trouble though getting it to also return the the last period and string after it.
Sorry just to update this based upon feedback and comments below:
This does need to be a regex. The regex will be passed into the program from a configuration file. The user will not have access to the source code as it will be packaged.
The user may need to change the regex based upon some arbitrary criteria, so they will need to update the config file, rather than edit the application and re-build the package.
Thanks

You don’t need a regular expression!
parts = data_file.split(".")
print parts[0] + "." + parts[-1]

Instead of regular expressions, I would suggest using str.split. For example:
>>> data_file = 'my_file.10.4.5.6.csv'
>>> parts = data_file.split('.')
>>> print parts[0] + '.' + parts[-1]
my_file.csv
However if you insist on regular expressions, here is one approach:
>>> print re.sub(r'\..*\.', '.', data_file)
my_file.csv

You don't need a regex.
tokens = expanded_name.split('.')
compressed_name = '.'.join((tokens[0], tokens[-1]))
If you are concerned about performance, you could use a length limit and rsplit() to only chop up the string as much as you need.
compressed_name = expanded_name.split('.', 1)[0] + '.' + expanded_name.rsplit('.', 1)[1]

Do you need a regex here?
>>> address = "my_file.10.4.5.6.csv"
>>> split_by_periods = address.split(".")
>>> "{}.{}".format(address[0], address[-1])
>>> "my_file.csv"

Related

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.

You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf

Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)

To answer "How to get everything after x amount of characters in python"
string[x:]

PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.

You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera

Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').

If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))

You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

In my date time value I want to use regex to strip out the slash and colon from time and replace it with underscore

I am using Python, Webdriver for my automated test. My scenario is on the Admin page of our website I click Add project button and i enter a project name.
Project Name I enter is in the format of LADEMO_IE_05/20/1515:11:38
It is a date and time at the end.
What I would like to do is using a regex I would like to find the / and :
and replace them with an underscore _
I have worked out the regex expression:
[0-9]{2}[/][0-9]{2}[/][0-9]{4}:[0-9]{2}[:][0-9]{2}
This finds 2 digits then / followed by 2 digits then / and so on.
I would like to replace / and : with _.
Can I do this in Python using import re? I need some help with the syntax please.
My method which returns the date is:
def get_datetime_now(self):
dateTime_now = datetime.datetime.now().strftime("%x%X")
print dateTime_now #prints e.g. 05/20/1515:11:38
return dateTime_now
My code snippet for entering the project name into the text field is:
project_name_textfield.send_keys('LADEMO_IE_' + self.get_datetime_now())
The Output is e.g.
LADEMO_IE_05/20/1515:11:38
I would like the Output to be:
LADEMO_IE_05_20_1515_11_38

Just format the datetime using strftime() into the desired format:
>>> datetime.datetime.now().strftime("%m_%d_%y%H_%M_%S")
'05_20_1517_20_16'

Another simple option is just using string replace :
s = "your time string"
s = s.replace("/", "_").replace(":", "_")

Two ways:
i) use strftime with the format:
strftime("%m_%d_%y_%H_%M_%S")
ii) simply use replace() method of strings to replace '/' and ':' to '_'

Basically, you want ton replace every unadvised character by an underscore. To do it, instead of using regex, you could simply use the str.replace method. For example:
out_string = in_string.replace('/', '_').replace(':', '_')
In this example, the first replace returns a string with all the slash replaced, and the second call replace the colons. I think it's the simplest way for replacing one or two characters. But, if you want your program to be able to evolve, I advise you using re.sub, as follows:
# first we compile the regex, for speed sake
# this regex match every one of the bad characters, and it's modular: just add one, in case
bad_characters = re.compile(r'/|:')
# your code
# replacement
out_string = re.sub(bad_characters, '_', in_string)

python regex on variable

Please help with my regex problem
Here is my string
source="http://www.amazon.com/ref=s9_hps_bw_g200_t2?pf_rd_m=ATVPDKIKX0DER&pf_rd_i=3421"
source_resource="pf_rd_m=ATVPDKIKX0DER"
The source_resource is in the source may end with & or with .[for example].
So far,
regex = re.compile("pf_rd_m=ATVPDKIKX0DER+[&.]")
regex.findall(source)
[u'pf_rd_m=ATVPDKIKX0DER&']
I have used the text here. Rather using text, how can i use source_resource variable with & or . to find this out.

If the goal is to extract the pf_rd_m value (which it apparently is as you are using regex.findall), than I'm not sure regex are the easiest solution here:
>>> import urlparse
>>> qs = urlparse.urlparse(source).query
>>> urlparse.parse_qs(qs)
{'pf_rd_m': ['ATVPDKIKX0DER'], 'pf_rd_i': ['3421']}
>>> urlparse.parse_qs(qs)['pf_rd_m']
['ATVPDKIKX0DER']

You also have to escape the .
pattern=re.compile(source_resource + '[&\.]')

You can just build the string for the regular expression like a normal string, utilizing all string-formatting options available in Python:
import re
source_and="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER&"
source_dot="http://rads.stackoverflow.com/amzn/click/B0030DI8NA/pf_rd_m=ATVPDKIKX0DER."
source_resource="pf_rd_m=ATVPDKIKX0DER"
regex_string = source_resource + "[&\.]"
regex = re.compile(regex_string)
print regex.findall(source_and)
print regex.findall(source_dot)
>>> ['pf_rd_m=ATVPDKIKX0DER&']
['pf_rd_m=ATVPDKIKX0DER.']
I hope this is what you mean.
Just take note that I modified your regular expression: the . is a special symbol and needs to be escaped, as is the + (I just assumed the string will only occur once, which makes the use of + unnecessary).

Python Regular Expression findall with variable

I am trying to use re.findall with look-behind and look-forward to extract data. The regular expression works fine when I am not using a raw_input variable, but I need users to be able to input a variety of different search terms.
Here is the current code:
me = re.findall(r"(?<='(.+)'+variable+'(.+)')(.*?)(?='(.+)+variable+(.+)')", raw)
As you can see, I am trying to pull out strings between one search term.
However, each time I use this type of formatting, I get a fixed width error. Is there anyway around this?
I have also tried the following formats with no success.
variable = raw_input('Term? ')
'.*' + variable + '.*'
and
'.*%s.*' % (variable, )
and
'.*{0}.*'.format(variable)
and
'.*{variable}.*'.format(variable=variable)

I'm not sure if this is what you mean, but it may get you started. As far as I understand your question, you don't need lookaheads or lookbehinds. This is for Python 2.x (won't work with Python 3):
>>> import re
>>> string_to_search = 'fish, hook, swallowed, reeled, boat, fish'
>>> entered_by_user = 'fish'
>>> search_regex = r"{0}(.+){0}".format(entered_by_user)
>>> match = re.search(search_regex, string_to_search)
>>> if match:
... print "result:", match.group(1).strip(' ,')
...
result: hook, swallowed, reeled, boat
If you really want the last 'fish' in the result as in your comment above, then just remove the second {0} from the format() string.

This solution should work:
me = re.findall(rf"(?<='(.+)'+{variable}+'(.+)')(.*?)(?='(.+)+{variable}+(.+)')", raw)
You also can add many different variables as you wish.
Add rf for the regular expression and the desired variables in between {}
import re
text = "regex is the best"
var1 = "is the"
var2 = "best"
yes = re.findall(rf"regex {var1} {var2}", text)
print(yes)
['regex is the best']

The way lookbehind is usually implemented (including its Python implementation) has an inherent limitation that you are unfortunately running into: lookbehinds cannot be variable-length. The "Important Notes About Lookbehind" section here explains why. I think you should be able to do the regex without a lookbehind, though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex to extract substring at start and end of string - python

You don’t need a regular expression! parts = data_file.split(".") print parts[0] + "." + parts[-1]

Do you need a regex here? >>> address = "my_file.10.4.5.6.csv" >>> split_by_periods = address.split(".") >>> "{}.{}".format(address[0], address[-1]) >>> "my_file.csv"

Related

How to get everything after string x in python

Python 3.6 Identifying a string and if X in Y

In my date time value I want to use regex to strip out the slash and colon from time and replace it with underscore

python regex on variable

Python Regular Expression findall with variable

Categories

Resources