Splitting with Regular Expression in Python [duplicate] - python

This question already has an answer here:
Does '[ab]+' equal '(a|b)+' in python re module?
(1 answer)
Closed 5 years ago.
I am relatively new to Python, and I am trying to split a string using re.
I have researched a bit, and I have come across a few examples and I tried them out. They seem to work, but with limitation.
I am using a dictionary with a string key that is associated with an integer value. I'm trying to apply a weight to each word that depends on the integer value associated with the key string. My issue is that the string isn't formatted perfectly and I need to split it on underscores ( _ ) as well as whitespace and other various delimiters. From what I understand, this needs to be done with regular expressions. My bit of code is as follows:
for key, value in sorted_articles.items():
wordList = print(re.split(r'(_|\s|:|)',key))
When I print this out, it splits everything fine, but it also prints out the delimiters rather than ignoring them in the list. For example, the string "Hello_how are you_" gets stored in the list as ['Hello', '_', 'how', ' ', 'are', ' ', 'you','_'].
I'm not sure why the delimiters would be added to the list and I can't figure out how to fix it. Thanks in advance for the help!

You can split using the \W+ character, which will split at all not alpha string items and use |_ to specifically search for underscores:
for key, value in sorted_articles.items():
wordList = print(re.split('\W+|_',key))
For instance:
s = "Hello_how are you_"
print(re.split("\W+|_", s))
Output:
['Hello', 'how', 'are', 'you', '']

Related

Wyh there are empty items after re.split()? [duplicate]

This question already has an answer here:
re.split() gives empty elements in list
(1 answer)
Closed 23 days ago.
The community is reviewing whether to reopen this question as of 22 days ago.
I assume I misunderstand how re.split() works.
Here is a real and simple example.
>>> import re
>>> re.split('(abc)', 'abc')
['', 'abc', '']
I'm confused about the first and last empty ('') element in the resulting list. The result expected by me would be this:
['abc']
This was a very simplified example. Please let me give something more complex.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', '[[one][two]]')
['', 'one', 'two', '']
Here the result expect by me would be:
['one', 'two']
This third example with words before and after works as expected.
>>> re.split(r'\[\[(.+?)\]\[(.+?)\]\]', 'zero [[one][two]] three')
['zero ', 'one', 'two', ' three']
My final goal is to split (tokenize) a string with a regex, get the splitted parts as results but also the separators (the regex matches). That is why I'm not able to handle that with re.findall().
If you use capturing groups in the re.split expression, the splitting part (abc) is also returned in the output. This can be very useful with eg tokenization tasks.
Every second item in the return value is the captured split pattern; e. g. if (a.c) was the splitter and dabcdagde then splittee, you'd get ['d', 'abc', 'd', 'agd', 'e'].
In your first example, since the split expression is the whole string, you get empty strings "on the sides".
My answer is based on that answer in a similar question.
The behavior is as specified in the docs:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
That way, separator components are always found at the same relative indices within the result list.
Especially the last sentence to describe why this behavior is useful.
In short: The user/developer is every time able to identify the separators/matches in the resulting list when using catch groups.
When using catching groups the user/developer always can expect the separators (the matches) at the same position in the resulting list. Assuming one catching group each second element in the result is the matched separator (the catched group).
If you have two catch groups as in my example the relative position changes. You have to count to three. 0 is the splitted token, 1 is the first catch group, 2 is the second catch group, and again...

Split the string every special character with regular expressions [duplicate]

This question already has answers here:
What are non-word boundary in regex (\B), compared to word-boundary?
(2 answers)
Closed 3 years ago.
I want to split my string into pieces but every some text and a special character. I have a string:
str = "ImEmRe#b'aEmRe#b'testEmRe#b'string"
I want my string to be split every EmRe#b' characters as you can see it contais the ' and that's the problem.
I tried doing re.split(r"EmRe#b'\B", str), re.split(r"EmRe#b?='\B", str) and also I tried both of them but without the r before the pattern. How do I do it? I'm really new to regular expressions. I would even say I've never used them.
Firstly, change the name of your variable, since str() is a built-in Python function.
If you named your variable word, you could get a list of elements split by your specified string by doing this:
>>> word = "ImEmRe#b'aEmRe#b'testEmRe#b'string"
>>> word
"ImEmRe#b'aEmRe#b'testEmRe#b'string"
>>> word.split("EmRe#b'")
['Im', 'a', 'test', 'string']
Allowing you to use them in many more ways than just a string! It can be saved to a variable, of course:
>>> foo = word.split("EmRe#b'")
>>> foo
['Im', 'a', 'test', 'string']

Python - Delete a character from list or string

If the verb ends in e, drop the e and add -ing.
I'm inputing a string (English verb). And my goal is to delete last char of the word if it's "e". And add 3 more characters "i","n" and "g".
I'd like to know how to delete the list object or if possible a string character. And how to switch a list into a string.
Currently im on.
if verb_list[-1] == ["e"]: #verb_list is a inputed string putted into a list
verb_list[-1] = "i"
verb_list.append("n")
verb_list.append("g")
This isnt a proper solution for me. I'd like to know how to delete for example [-1] element from list or from string. Also here im left with a list, and i want my output to be a string.
Thanks for any help!
You can use re.sub:
re.sub('e$', 'ing', s)
The $ in the regex matches the pattern only if it's at the end of a string.
Example usage:
import re
data = ['date', 'today', 'done', 'cereal']
print([re.sub('e$', 'ing', s) for s in data])
#['dating', 'today', 'doning', 'cereal']
I know the words in data aren't verbs but those were words off the top of my head.
This should suffice
if verb[-1]=='e':
verb = verb[:-1]+"ing"
For more about slicing in Python - Understanding slice notation
Try this:
li=list(verb)
if li[-1]=='e':
li[-1]='ing'
verb=''.join(li)

How can I output a string excluding ALL whitespaces? [duplicate]

This question already has answers here:
How to strip all whitespace from string
(14 answers)
Closed 4 years ago.
Basically, I'm trying to do a code in Python where a user inputs a sentence. However, I need my code to remove ALL whitespaces (e.g. tabs, space, index, etc.) and print it out.
This is what I have so far:
def output_without_whitespace(text):
newText = text.split("")
print('String with no whitespaces: '.join(newText))
I'm clear that I'm doing a lot wrong here and I'm missing plenty, but, I haven't been able to thoroughly go over splitting and joining strings yet, so it'd be great if someone explained it to me.
This is the whole code that I have so far:
text = input(str('Enter a sentence: '))
print(f'You entered: {text}')
def get_num_of_characters(text):
result = 0
for char in text:
result += 1
return result
print('Number of characters: ', get_num_of_characters(text))
def output_without_whitespace(text):
newtext = "".join(text.split())
print(f'String without whitespaces: {newtext}')
I FIGURED OUT MY PROBLEM!
I realize that in this line of code.
print(f'String without whitespaces: {newtext}')
It's supposed to be.
print('String without whitespaces: ', output_without_whitespace(text))
I realize that my problem as to why the sentence without whitespaces was not printing back out to me was, because I was not calling out my function!
You have the right idea, but here's how to implement it with split and join:
def output_without_whitespace(text):
return ''.join(text.split())
so that:
output_without_whitespace(' this\t is a\n test..\n ')
would return:
thisisatest..
A trivial solution is to just use split and rejoin (similar to what you are doing):
def output_without_whitespace(text):
return ''.join(text.split())
First we split the initial string to a list of words, then we join them all together.
So to think about it a bit:
text.split()
will give us a list of words (split by any whitespace). So for example:
'hello world'.split() -> ['hello', 'world']
And finally
''.join(<result of text.split()>)
joins all of the words in the given list to a single string. So:
''.join(['hello', 'world']) -> 'helloworld'
See Remove all whitespace in a string in Python for more ways to do it.
Get input, split, join
s = ''.join((input('Enter string: ').split()))
Enter string: vash the stampede
vashthestampede
There are a few different ways to do this, but this seems the most obvious one to me. It is simple and efficient.
>>> with_spaces = ' The quick brown fox '
>>> list_no_spaces = with_spaces.split()
>>> ''.join(list_no_spaces)
'Thequickbrownfox'
.split() with no parameter splits a string into a list wherever there's one or more white space characters, leaving out the white space...more details here.
''.join(list_no_spaces) joins elements of the list into a string with nothing betwen the elements, which is what you want here: 'Thequickbrownfox'.
If you had used ','.join(list_no_spaces) you'd get 'The,quick,brown,fox'.
Experienced Python programmers tend to use regular expressions sparingly. Often it's better to use tools like .split() and .join() to do the work, and keep regular expressions for where there is no alternative.

how to ignore punctuation when counting characters in string in python

In my homework there is question about write a function words_of_length(N, s) that can pick unique words with certain length from a string, but ignore punctuations.
what I am trying to do is:
def words_of_length(N, s): #N as integer, s as string
#this line i should remove punctuation inside the string but i don't know how to do it
return [x for x in s if len(x) == N] #this line should return a list of unique words with certain length.
so my problem is that I don't know how to remove punctuation , I did view "best way to remove punctuation from string" and relevant questions, but those looks too difficult in my lvl and also because my teacher requires it should contain no more than 2 lines of code.
sorry that I can't edit my code in question properly, it's first time i ask question here, there much i need to learn, but pls help me with this one. thanks.
Use string.strip(s[, chars])
https://docs.python.org/2/library/string.html
In you function replace x with strip (x, ['.', ',', ':', ';', '!', '?']
Add more punctuation if needed
First of all, you need to create a new string without characters you want to ignore (take a look at string library, particularly string.punctuation), and then split() the resulting string (sentence) into substrings (words). Besides that, I suggest using type annotation, instead of comments like those.
def words_of_length(n: int, s: str) -> list:
return [x for x in ''.join(char for char in s if char not in __import__('string').punctuation).split() if len(x) == n]
>>> words_of_length(3, 'Guido? van, rossum. is the best!'))
['van', 'the']
Alternatively, instead of string.punctuation you can define a variable with the characters you want to ignore yourself.
You can remove punctuation by using string.punctuation.
>>> from string import punctuation
>>> text = "text,. has ;:some punctuation."
>>> text = ''.join(ch for ch in text if ch not in punctuation)
>>> text # with no punctuation
'text has some punctuation'

Categories