How can I remove the string from a character in python? - python

I have some URLs and I need some of them to be stripped from the question mark (?)
Ex. https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1
I need it to return https://www.yelp.com/biz/starbucks-san-leandro-4
How can I do that?

you can also use .split() method
The split() method splits a string into a list.
You can specify the separator, default separator is any whitespace.
Syntax
string.split(separator, maxsplit)
data = 'https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1'
print (data.split('?')[0])
output:
https://www.yelp.com/biz/starbucks-san-leandro-4

You could use rfind and slice the string up to the returned index:
s = 'https://www.yelp.com/biz/starbucks-san-leandro-4?large_photo=1'
s[:s.rfind('?')]
# 'https://www.yelp.com/biz/starbucks-san-leandro-4'

Go for a regular expression
import re
new_string = re.sub(r'\?.+$', '', your_string)
See a demo on regex101.com.

I would parse the url and the rebuild it with the parts that you want to keep. For example you can use urllib.parse

Related

Python string.rstrip() doesn't strip specified characters

string = "hi())("
string = string.rstrip("abcdefghijklmnoprstuwxyz")
print(string)
I want to remove every letter from given string using rstrip method, however it does not change the string in the slightest.
Output:
'hi())('
What i Want:
'())('
I know that I can use regex, but I really don't understand why it doesn't work.
Note : It is a part of the Valid Parentheses challenge on code-wars
You have to use lstrip instead of rstrip:
>>> string = "hi())("
>>> string = string.lstrip("abcdefghijklmnoprstuwxyz")
>>> string
'())('

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Python RE question - proper state initial formatting

I have a string that I need to edit, it looks something similar to this:
string = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
If you notice the state initial "Mn" is not in proper formatting. I'm trying to use a regular expression to change this:
re.sub("[A-Z][a-z],", "[A-Z][A-Z],", string)
However, re.sub treats the second part as a literal and will change Mn, to [A-Z][A-Z],. How would I use re.sub (or something similar and simple) to properly change Mn, to MN, in this string?
Thank you in advance!
Your re.sub might modify also parts of the string you would not want to modify. Try to process the right element in your list explicitly:
input = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
elems = input.split(',')
elems[3] = elems[3].upper()
output = ','.join(elems)
returns
'Idaho Ave N,,Crystal,MN,55427-1463,US,,610839124763,Expedited'
You can pass a function as the replacement parameter to re.sub to generate the replacement string from the match object, e.g.:
import re
s = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
def upcase(match):
return match.group().upper()
print re.sub("[A-Z][a-z],", upcase, s)
(This is ignoring the concern of whether you're genuinely finding state initials with this method.)
The appropriate documentation for re.sub is here.
sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
re.sub("[A-Z][a-z]", lambda m: m.group(0).upper(), myString)
I would avoid calling your variable string since that is a type name.
You create a group by surrounding it in parentheses withing your regex, then refer to is by its group number:
re.sub("([A-Z][a-z]),", "\1,".upper(), string)

Regex to Split 1st Colon

I have a time in ISO 8601 ( 2009-11-19T19:55:00 ) which is also paired with a name commence. I'm trying to parse this into two. I'm currently up to here:
import re
sColon = re.compile('[:]')
aString = sColon.split("commence:2009-11-19T19:55:00")
Obviously this returns:
>>> aString
['commence','2009-11-19T19','55','00']
What I'd like it to return is this:
>>>aString
['commence','2009-11-19T19:55:00']
How would I go about do this in the original creation of sColon? Also, do you recommend any Regular Expression links or books that you have found useful, as I can see myself needing it in the future!
EDIT:
To clarify... I'd need a regular expression that would just parse at the very first instance of :, is this possible? The text ( commence ) before the colon can chance, yes...
>>> first, colon, rest = "commence:2009-11-19T19:55:00".partition(':')
>>> print (first, colon, rest)
('commence', ':', '2009-11-19T19:55:00')
You could put maximum split parameter in split function
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
Official Docs
S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.
Looks like you need .IndexOf(":"), then .Substring()?
#OP, don't do the unnecessary. Regex is not needed with what you are doing. Python has very good string manipulation methods that you can use. All you need is split(), and slicing. Those are the very basics of Python.
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
>>>

Categories