Slice a string after a certain phrase? - python

I've got a batch of strings that I need to cut down. They're basically a descriptor followed by codes. I only want to keep the descriptor.
'a descriptor dps 23 fd'
'another 23 fd'
'and another fd'
'and one without a code'
The codes above are dps, 23 and fd. They can come in any order, are unrelated to each other and might not exist at all (as in the last case).
The list of codes is fixed (or can be predicted, at least), so assuming a code is never used within a legitimate descriptor, how can I strip off everything after the first instance of a code.
I'm using Python.

The short answer, as #THC4K points out in a comment:
string.split(pattern, 1)[0]
where string is your original string, pattern is your "break" pattern, 1 indicates to split no more than 1 time, and [0] means take the first element returned by split.
In action:
>>> s = "a descriptor 23 fd"
>>> s.split("23", 1)[0]
'a descriptor '
>>> s.split("fdasfdsafdsa", 1)[0]
'a descriptor 23 fd'
This is a much shorter way of expressing what I had written earlier, which I will keep here anyway.
And if you need to remove multiple patterns, this is a great candidate for the reduce builtin:
>>> string = "a descriptor dps foo 23 bar fd quux"
>>> patterns = ["dps", "23", "fd"]
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, string)
'a descriptor '
>>> reduce(lambda s, pat: s.split(pat, 1)[0], patterns, "uiopuiopuiopuipouiop")
'uiopuiopuiopuipouiop'
This basically says: for each pat in patterns: take string and repeatedly apply string.split(pat, 1)[0] (like explained above), operating on the result of the previously returned value each time. As you can see, if none of the patterns are in the string, the original string is still returned.
The simplest answer is a list/string slice combined with a string.find:
>>> s = "a descriptor 23 fd"
>>> s[:s.find("fd")]
'a descriptor 23 '
>>> s[:s.find("23")]
'a descriptor '
>>> s[:s.find("gggfdf")] # <-- look out! last character got cut off
'a descriptor 23 f'
A better approach (to avoid cutting off the last character in a missing pattern when s.find returns -1) might be to wrap in a simple function:
>>> def cutoff(string, pattern):
... idx = string.find(pattern)
... return string[:idx if idx != -1 else len(string)]
...
>>> cutoff(s, "23")
'a descriptor '
>>> cutoff(s, "asdfdsafdsa")
'a descriptor 23 fd'
The [:s.find(x)] syntax means take the part of the string from index 0 until the right-hand side of the colon; and in this case, the RHS is the result of s.find, which returns the index of the string you passed.

You seem to be describing something like this:
def get_descriptor(text):
codes = ('12', 'dps', '23')
for c in codes:
try:
return text[:text.index(c)].rstrip()
except ValueError:
continue
raise ValueError("No descriptor found in `%s'" % (text))
E.g.,
>>> get_descriptor('a descriptor dps 23 fd')
'a descriptor'

codes = ('12', 'dps', '23')
def get_descriptor(text):
words = text.split()
for c in codes:
if c in words:
i = words.index(c)
return " ".join(words[:i])
raise ValueError("No code found in `%s'" % (text))

I'd probably use a regular expression to do this:
>>> import re
>>> descriptors = ('foo x', 'foo y', 'bar $', 'baz', 'bat')
>>> data = ['foo x 123', 'foo y 123', 'bar $123', 'baz 123', 'bat 123', 'nothing']
>>> p = re.compile("(" + "|".join(map(re.escape, descriptors)) + ")")
>>> for s in data:
m = re.match(p, s)
if m: print m.groups()[0]
foo x
foo y
bar $
baz
bat
It wasn't entirely clear to me whether you want what you're extracting to include text that precedes the descriptors, or if you expect each line of text to start with a descriptor; the above deals with the latter. For the former, just change the pattern slightly to make it capture all characters before the first occurrence of a descriptor:
>>> p = re.compile("(.*(" + "|".join(map(re.escape, descriptors)) + "))")

Here's an answer that works for all codes rather than forcing you to call the function for each code, and is a bit simpler than some of the answers above. It also works for all of your examples.
strings = ('a descriptor dps 23 fd', 'another 23 fd', 'and another fd',
'and one without a code')
codes = ('dps', '23', 'fd')
def strip(s):
try:
return s[:min(s.find(c) for c in codes if c in s)]
except ValueError:
return s
print map(strip, strings)
Output:
['a descriptor ', 'another ', 'and another ', 'and one without a code']
I believe this satisfies all of your criteria.
Edit: I realized quickly you could remove the try catch if you don't like expecting the exception:
def strip(s):
if not any(c in s for c in codes):
return s
return s[:min(s.find(c) for c in codes if c in s)]

def crop_string(string, pattern):
del_items = []
for indx, val in enumerate(pattern):
a = string.split(val, 1)
del_items.append(a[indx])
for del_item in del_items:
string = string.replace(del_item, "")
return string
example:
I want to crop the string and get only the array out of it..
strin = "crop the array [1,2,3,4,5]
pattern["[","]"]
usage:
a = crop_string(strin ,pattern )
print a
# --- Prints "[1,2,3,4,5]"

Related

Capitalize text without capitalizing links in python

I need to capitalize a line of input, but if I just use the upper() function, link addresses get capitalized, thus making them unusable.
For example: "Cool Video www.youtube.com/watch?v=dQw4w9WgXcQ"
will turn to: "COOL VIDEO WWW.YOUTUBE.COM/WATCH?V=DQW4W9WGXCQ"
The link address has changes and won't work anymore. Is there any way to ignore links?
If I was correct to understand your goal here, then you should first look for the part of string to upper case and then joined back with the rest of the original string, this way:
>>> import re
>>> s = "Cool Video -> www.youtube.com/watch?v=dQw4w9WgXcQ"
>>> #Look for the part of string you want to upper case
>>> m = re.search(r'^.*(?=\s+->)', s)
>>> m
<_sre.SRE_Match object; span=(0, 10), match='Cool Video'>
>>> #m.start() and m.end() will give you start and endo position of matched string.
>>> new_s = s[m.start():m.end()].upper() + s[m.end():]
>>> #remember that strings are immutable, so make new one
>>> new_s
'COOL VIDEO -> www.youtube.com/watch?v=dQw4w9WgXcQ'
>>> #OR
>>> new_s = m.group().upper() + s[m.end():]
>>> new_s
'COOL VIDEO -> www.youtube.com/watch?v=dQw4w9WgXcQ'
EDIT:
Otherway, is to look for string preceding a link and then apply upper method on it:
>>> s = "Cool Video www.youtube.com/watch?v=dQw4w9WgXcQ"
>>> m = re.search(r'(.*)(?=www.*)',s)
>>> s = m.group().upper() + s[m.end():]
>>> s
'COOL VIDEO www.youtube.com/watch?v=dQw4w9WgXcQ'

Python Regular Expression with special characters

Having trouble writing a robust regular expression to grab information out of a string.
$ string1 = 'A_XYZ_THESE_WORDS'
$ string2 = 'A_ABC_THOSE_WORDS'
I would like a robust solution that pulls out from string1 or string2 respectfully 'THESE_WORDS' or 'THOSE_WORDS'.
Basically, I need something that removes everything before the first two underscores (_), but the text before them will vary.
$ get_text = re.search('(?<=A_)\w+(_)',string1)
$ print get_text.group()
$ 'XYZ_THESE_'
Based on your problem statement:
I need something that removes everything before the first two underscores
you don't necessarily need a regular expression:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[2]
'THESE_WORDS'
The second argument to str.split is the maximum number of times to split. This will split on the first two '_'s, then take the third item (the rest of the string) from the resulting list.
This will throw an IndexError if there are fewer than two underscores in the string - this lets you know that the string is not in a format you expect, but if this behaviour is not desirable, consider:
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string1.split("_", 2)[-1]
'THESE_WORDS'
Which takes the last item in the list from str.split, rather than assuming that there will be three. Comparison:
>>> "JUST_ONE".split("_", 2)[2]
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
"JUST_ONE".split("_", 2)[2]
IndexError: list index out of range
>>> "JUST_ONE".split("_", 2)[-1]
'ONE'
The below regex will print the texts which was just after to the second underscore(_),
>>> import re
>>> string1 = 'A_XYZ_THESE_WORDS'
>>> string2 = 'A_ABC_THOSE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string1)
>>> m.group(1)
'THESE_WORDS'
>>> m = re.search(r'^[^_]*_[^_]*_(.*)$', string2)
>>> m.group(1)
'THOSE_WORDS'
In [21]: regex = re.compile(r'^([a-zA-Z]+_){2}(.*)$')
In [22]: m = regex.search(string1)
In [23]: m.groups()
Out[23]: ('XYZ_', 'THESE_WORDS')
In [24]: m = regex.search(string2)
In [25]: m.groups()
Out[25]: ('ABC_', 'THOSE_WORDS')

How to explain the behavior of 'Abc123P'.istitle() in Python?

I cannot understand the return result of 'istitle()' method of Python marked as 'Incomprehensible' below:
>>> # Comprehensible
...
>>> print 'Abc123'.istitle()
True
>>> # Incomprehensible
...
>>> print 'Abc123P'.istitle()
True
>>> # Comprehensible
...
>>> print 'This is 27Python'.istitle()
False
>>> # Comprehensible
...
>>> print 'ABc123D'.istitle()
False
>>> # Incomprehensible
...
>>> print 'Abc1D'.istitle()
True
The documentation of this method is:
"i.e. uppercase characters may only follow uncased characters and lowercase characters only cased ones. Return False otherwise."
I thought it might be some special behavior of String, say, regard '1D' as a decimal '1', but seem it isn't when I printed it out:
>>> # Check
...
>>> print 'Abc1D'
Abc1D
>>> l = []
>>> l.extend('Abc1D')
>>> print l
['A', 'b', 'c', '1', 'D']
I really cannot understand it, or is this a bug of Python?
I'm using Python 2.7 on Windows 7 Enterprise 64bit.
Take Abc123P as example.
Uppercase characters: A and P. A follows nothing while P follows a decimal digit which is uncased.
Lowercase characters: b and c. b follows A which is cased; c follows b which is also cased.
Thus, Abc123P follows the definition of istitle().
The implementation is here.
It's pretty easy to read. The trick is to realize that numbers are neither upper or lower case, so it causes a reset of the previous_is_cased clause. The same would go for any other non-letter character: Abc&D -> True, ABc&D -> False.
For a more simple explanation, think of your string if you replaced all non-letter characters with spaces. The result of the translated string will be the same as the result of the original.

Alternative to python string item assignment

What is the best / correct way to use item assignment for python string ?
i.e s = "ABCDEFGH" s[1] = 'a' s[-1]='b' ?
Normal way will throw : 'str' object does not support item assignment
Strings are immutable. That means you can't assign to them at all. You could use formatting:
>>> s = 'abc{0}efg'.format('d')
>>> s
'abcdefg'
Or concatenation:
>>> s = 'abc' + 'd' + 'efg'
>>> s
'abcdefg'
Or replacement (thanks Odomontois for reminding me):
>>> s = 'abc0efg'
>>> s.replace('0', 'd')
'abcdefg'
But keep in mind that all of these methods create copies of the string, rather than modifying it in-place. If you want in-place modification, you could use a bytearray -- though that will only work for plain ascii strings, as alexis points out.
>>> b = bytearray('abc0efg')
>>> b[3] = 'd'
>>> b
bytearray(b'abcdefg')
Or you could create a list of characters and manipulate that. This is probably the most efficient and correct way to do frequent, large-scale string manipulation:
>>> l = list('abc0efg')
>>> l[3] = 'd'
>>> l
['a', 'b', 'c', 'd', 'e', 'f', 'g']
>>> ''.join(l)
'abcdefg'
And consider the re module for more complex operations.
String formatting and list manipulation are the two methods that are most likely to be correct and efficient IMO -- string formatting when only a few insertions are required, and list manipulation when you need to frequently update your string.
Since strings are "immutable", you get the effect of editing by constructing a modified version of the string and assigning it over the old value. If you want to replace or insert to a specific position in the string, the most array-like syntax is to use slices:
s = "ABCDEFGH"
s = s[:3] + 'd' + s[4:] # Change D to d at position 3
It's more likely that you want to replace a particular character or string with another. Do that with re, again collecting the result rather than modifying in place:
import re
s = "ABCDEFGH"
s = re.sub("DE", "--", s)
I guess this Object could help:
class Charray(list):
def __init__(self, mapping=[]):
"A character array."
if type(mapping) in [int, float, long]:
mapping = str(mapping)
list.__init__(self, mapping)
def __getslice__(self,i,j):
return Charray(list.__getslice__(self,i,j))
def __setitem__(self,i,x):
if type(x) <> str or len(x) > 1:
raise TypeError
else:
list.__setitem__(self,i,x)
def __repr__(self):
return "charray['%s']" % self
def __str__(self):
return "".join(self)
For example:
>>> carray = Charray("Stack Overflow")
>>> carray
charray['Stack Overflow']
>>> carray[:5]
charray['Stack']
>>> carray[-8:]
charray['Overflow']
>>> str(carray)
'Stack Overflow'
>>> carray[6] = 'z'
>>> carray
charray['Stack zverflow']
s = "ABCDEFGH" s[1] = 'a' s[-1]='b'
you can use like this
s=s[0:1]+'a'+s[2:]
this is very simple than other complex ways

Put a letter to the front of a string in python

is there a quick way to place a string in the front of another string in python? if so how?
as an example let's say that string = 'pple'. How would I put the string_2 = 'a' at the start of string?
concatenate it:
string=char+string
>>> strg = 'pple'
>>> char = 'a'
>>> char + strg
'apple'
>>> strg = char + strg
>>> strg
'apple'
>>>
Here's a quick example:
string='a'+string
Lots of correct answers. You could also use "string interpolation" (or just "string formatting", when referring to str.format) if you are really looking to do some string manipulating (I only mention it because figuring out what the % is called can be frustrating):
>>> one_string = 'One string'
>>> two_string = 'two string'
>>> one_two = '%s, %s, red string, blue string' % (one_string, two_string)
>>> one_two
'One string, two string, red string, blue string'
I'll leave you to check it out if you like. See, for example, http://docs.python.org/library/stdtypes.html#string-formatting-operations

Categories