String tokenization python

String tokenization python - python

I wanted to split a string in python.
s= "ot.jpg/n"
I used str.split() but it gives me ['ot.jpg']. I want to get ot.jpg without brackets.

You want to use str.strip() to get rid of the newline. If you need to use split, it returns a list. To get the nth item from a list, index the list: ['foo', 'bar'][n].
Incidentally, naming your string str is a bad idea, since it shadows the built-in str function.

The return value of the split() method is always a list -- in this case, it's given you a single-element list theList = ['ot.jpg']. Like any list, you get what you want out of it by indexing it:
myString = theList[0]

sounds like you want replace.
s= "ot.jpg/n".replace("/n", "")
"ot.jpg"

Related

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?

Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']

Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

Can you cast a string to upper case while formating

Motivation
Suppose you have a string that is used twice in one string. However, in one case it is upper, and the other it is lower. If a dictionary is used, it would seem the solution would be to add a new element that is uppercase.
Suppose I have a python string ready to be formatted as:
string = "{a}_{b}_{c}_{a}"
With a desired output of:
HELLO_by_hi_hello
I also have a dictionary ready as:
dictionary = {a: "hello", b: "bye", c: "hi"}
Without interacting with the dictionary to set a new element d as being "HELLO" such as:
dictionary['d'] = dictionary['a'].upper()
string = "{d}_{b}_{c}_{a}"
string.format(**dictionary)
print(string)
>>> HELLO_bye_hi_hello
Is there a way to set element a to always be uppercase in one case of the string? For example something like:
string= "{a:upper}_{b}_{c}_{a}"
string.format(**dictionary)
print(string)
>>> HELLO_bye_hi_hello

Yes, you can do that:
string = "{d.upper()}_{b.lower()}_{c.lower()}_{a.lower()}"

Nope, you can't do that.
In the simplest solution, you can write a lambda to capitalize the values in your string. Or you can subclass strnig.Formatter if you really want to achieve your goal that way.
Following link can help if you are going for the harder method.
Python: Capitalize a word using string.format()

How to pass a multiple elements of the list to a re.split() function ?

f = open('sentences.txt')
lines = [line.lower() for line in f]
print lines[0:5]
words = re.split("\s+", lines[0:5])
with "print" it works perfectly well, but when I try to do the same inside of re.split(), I get an error "TypeError: expected string or buffer"

I think you're searching for join, i.e.:
words = "".join(lines[0:5]).split()
Note:
No need for re module, split() is enough.

Why not just:
words = re.split("\s+", ''.join(lines))
The split function expects a string, which is then split into substrings based on the regex and returned as a list. Passing a list would not make a whole lot of sense. If you're expecting it to take your list of strings and split each string element individually and then return a list of lists of strings, you'll have to do that yourself:
lines_split = []
for line in lines:
lines_split.append(re.split("\s+", line))

As you see, you are getting a TypeError in your function call, which means that you are passing the wrong parameter from what the function is expecting. So you need to think about what you are passing.
If you have a debugger or IDE you can step through and see what type your parameter has, or even use type to print it, via
print(type(lines[0:5]))
which returns
<class 'list'>
so you need to transform that into a String. Each element in your list is a String, so think of a way to get each row out of the list. An example would be
words = [re.split('\s+', line) for line in lines]
where I am using a list comprehension to step through lines and process each row individually.

Your re.split('\s+', line) is the equivalent of line.split() so you can write
words = [line.split() for line in lines]
See the documentation for str.split.

Format string in python list

I have a list which should contain string with a particular format or character i.e. {{str(x), str(y)},{str(x), str(y)}}. I tried to do string concat like: "{{"+str(x), str(y)+"},{"+str(x), str(y)+"}}" and append to list, but it gets surrounded by brackets: [({{str(x), str(y)}),({str(x), str(y)}})]
How can I get rid of the brackets or betterstill, is there a better approach to having a list without brackets like this: [{{string, string},{string, string}}]

The parentheses are because you're creating a tuple of three items:
"{{"+str(x)
str(y)+"},{"+str(x)
str(y)+"}}"
Try replacing those bare commas between str(x) and str(y) with +","+:
"{{"+str(x)+","+str(y)+"},{"+str(x)+","str(y)+"}}"

Remove items in a sequence from a string Python

Okay so I'm trying to make a function that will take a string and a sequence of items (in the form of either a list, a tuple or a string) and remove all items from that list from the string.
So far my attempt looks like this:
def eliminate(s, bad_characters):
for item in bad_characters:
s = s.strip(item)
return s
However, for some reason when I try this or variations of this, it only returns either the original string or a version with only the first item in bad_characters removed.
>>> eliminate("foobar",["o","b"])
'foobar'
Is there a way to remove all items in bad_characters from the given string?

The reason your solution doesn't work is because str.strip() only removes characters from the outsides of the string, i.e. characters on the leftmost or rightmost end of the string. So, in the case of 'foobar', str.strip() with a single character argument would only work if you wanted to remove the characters 'f' and 'r'.
You could eliminate more of the inner characters with strip, but you would need to include one of the outer characters as well.
>>> 'foobar'.strip('of')
'bar'
>>> 'foobar'.strip('o')
'foobar'
Here's how to do it by string-joining a generator expression:
def eliminate(s, bad_characters):
bc = set(bad_characters)
return ''.join(c for c in s if c not in bc)

Try to replace the bad characters as empty strings.
def eliminate(s, bad_characters):
for item in bad_characters:
s = s.replace(item, '')
return s
strip() doesn't work as it tries to remove beginning and tail part of the original string only.

strip is not a correct choice for this task as it remove the characters from leading and trailing of the string, instead you can use str.translate method :
>>> s,l="foobar",["o","b"]
>>> s.translate(None,''.join(l))
'far'

Try this, may be time consuming using recursion
def eliminate(s, seq):
while seq:
return eliminate(s.replace(seq.pop(),""), seq)
return s
>>>eliminate("foobar",["o","b"])
'far'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

String tokenization python - python

I wanted to split a string in python. s= "ot.jpg/n" I used str.split() but it gives me ['ot.jpg']. I want to get ot.jpg without brackets.

You want to use str.strip() to get rid of the newline. If you need to use split, it returns a list. To get the nth item from a list, index the list: ['foo', 'bar'][n]. Incidentally, naming your string str is a bad idea, since it shadows the built-in str function.

The return value of the split() method is always a list -- in this case, it's given you a single-element list theList = ['ot.jpg']. Like any list, you get what you want out of it by indexing it: myString = theList[0]

sounds like you want replace. s= "ot.jpg/n".replace("/n", "") "ot.jpg"

Related

How to remove a substrings from a list of strings?

Can you cast a string to upper case while formating

How to pass a multiple elements of the list to a re.split() function ?

Format string in python list

Remove items in a sequence from a string Python

Categories

Resources