Finding text after a certain character - python

I want to find something after a certain character. I am aware of how to find something before using rfind, but not so sure of the syntax to find something after. Here is an example
text = 'Hello.world'
#to find something before
print(text[:text.rfind('.')])
# out :
Hello
# to find something after, I tried this, but of course it's incorrect
print(text[text:.rfind('.')])
Any ideas on how to use the syntax to find something after

print(text[text.rfind('.')+1:])

Two other methods you may try might include splitting the string, and also doing a regex substitution to isolate the substring you want:
text = 'Hello.world'
print(text.split('.')[1])
print(re.sub(r'^.*\.', '', text))
Splitting would proabably outperform re.sub here, so I recommend split() first.

print(text[text:.rfind('.')]) => print(text[text.find('.')+1:])
Here's another way to do it, str.partition and str.rpartition
def find(text, sep=' ', right=False):
if not (text and sep) or sep not in text:
return None
return text.rpartition(sep)[2] if right else text.partition(sep)[0]
find('Hello.Word', '.') # 'Hello'
find('Hello.Word', '.', True) # 'Word'

Related

How to format long `if` statement

I want to test if certain characters are in a line of text. The condition is simple but characters to be tested are many.
Currently I am using \ for easy viewing, but it feels clumsy. What's the way to make the lines look nicer?
text = "Tel+971-2526-821     Fax:+971-2526-821"
if "971" in text or \
"(84)" in text or \
"+66" in text or \
"(452)" in text or \
"19 " in text:
print "foreign"
Why don't extract the phone numbers from the string and do your tests
text = "Tel:+971-2526-821 Fax:+971-2526-821"
tel, fax = text.split()
tel_prefix, *_ = tel.split(':')[-1].split('-')
fax_prefix, *_ = fax.split(':')[-1].split('-')
if tel_prefix in ("971", "(84)"):
print("Foreigner")
for python 2.x
tel_prefix = tel.split(':')[-1].split('-')[0]
fax_prefix = fax.split(':')[-1].split('-')[0]
Enlightened by #Patrick Haugh in the comment. We can do:
text = "Tel+971-2526-821     Fax:+971-2526-821"
if any(x in text for x in ("971", "(84)", "+66", "(452)", "19 ")):
print "foreign"
You can use any builtin function to check if any one of the token exists in the text. If you would like to check if all the token exists in the string you can replace the below any with all function. Cheers!
text = 'Hello your number is 19 '
tokens = ('971', '(84)', '+66', '(452)', '19 ')
if any(token in text for token in tokens):
print('Foriegn')
Output:
Foriegn
Existing comments mention that you can't really have multiple or statements like you intend, but using generators/comprehensions and the any() function you are able to come up with a serviceable option, such as the snippet if any(x in text for x in ('971', '(84)', '+66', '(452)', '19 ')): that #Patrick Haugh recommended.
I would recommend using regular expressions instead as a more versatile and efficient way of solving the problem. You could either generate the pattern dynamically, or for the purpose of this problem, the following snippet would work (don't forget to escape parentheses):
import re
text = 'Tel:+971-2526-821 Fax:+971-2526-821'
pattern = u'(971|\(84\)|66|\(452\)|19)'
prog = re.compile(pattern)
if prog.search(text):
print 'foreign'
If you are searching many lines of text or large bodies of text for multiple possible substrings, this approach will be faster and more reusable. You only have to compile prog once, and then you can use it as often as you'd like.
As far as dynamic generation of a pattern is concerned, a naive implementation might do something like this:
match_list = ['971', '(84)', '66', '(452)', '19']
pattern = '|'.join(map(lambda s: s.replace('(', '\(').replace(')', '\)'), match_list)).join(['(', ')'])
The variable match_list could then be updated and modified as needed. There is a slight inefficiency in running two passes of replace(), and #Andrew Clark has a good trick for fixing that here, but I don't want this answer to be too long and cumbersome.
You can construct a lambda function that checks if a value is in the text, and then map this function to all of the values:
text = "Tel:+971-2526-821 Fax:+971-2526-821"
print any(map((lambda x: x in text), ["971", "(84)", "+66", "(452)", "19 "]))
The result is True, which means at least one of the values is in text.

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"
Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

Matching two almost similar string (python)

In a file I can have either of the following two string formats:
::WORD1::WORD2= ANYTHING
::WORD3::WORD4::WORD5= ANYTHING2
This is the regex I came up with:
::(\w+)(?:::(\w+))?::(\w+)=(.*)
regex.findall(..)
[(u'WORD1', u'', u'WORD2', u' ANYTHING'),
(u'WORD3', u'WORD4', u'WORD5', u' ANYTHING2')]
My first question is, why do I get this empty u'' when matching the first string ?
My second question is, is there an easier way to write this regex? the two strings are very similar, except that sometimes i have this extra ::WORD5
My last question is: most of the time I have only word between the :: so that's why \w+ is enough, but sometime I can get stuff like 2-WORD2 or 3-2-WORD2 etc.. there is this - that appears. How can I add it into the \w+ ?
for last question:
[\w\-]+
explain:
\w
Matches any word character.
Captured groups are always included in re.findall results, even if they don't match anything. That's why you get an empty string. If you just want to get what's between the delimiters, try split instead of findall:
a = '::WORD1::WORD2= ANYTHING'
b = '::WORD3::WORD4::WORD5= ANYTHING2'
print re.split(r'::|= ', a)[1:] # ['WORD1', 'WORD2', 'ANYTHING']
print re.split(r'::|= ', b)[1:] # ['WORD3', 'WORD4', 'WORD5', 'ANYTHING2']
In response to the comments, if "ANYTHING" could be well, anything, it's easier to use string functions rather than regexps:
x, y = a.split('= ', 1)
results = x.split('::')[1:] + [y]
Based on the answer of thg435 you can just split to the "=" and then do exactly the same somethign like
left,right = a.split('=', 1)
answer = left.split('::')[1:] + [right]
For you last question you can do something like (that accept letters, numbers and "-")
[a-zA-Z0-9\-]+

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

Categories