Say I was given a string like so
text = "1234 I just ? shut * the door"
I want to use a regex with re.compile() such that when I split the list all of the words are in front.
I.e. it should look like this.
text = ["I", "just", "shut", "the", "door", "1234", "?", "*"]
How can I use re.compile() to split the string this way?
import re
r = re.compile('regex to split string so that words are first').split(text)
Please let me know if you need any more information.
Thank you for the help.
IIUC, you don't need re. Just use str.split with sorted:
sorted(text.split(), key=lambda x: not x.isalpha())
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can use sorted with re.findall:
import re
text = "1234 I just ? shut * the door"
r = sorted(text.split(), key=lambda x:(x.isalpha(), x.isdigit(), bool(re.findall('^\W+$', x))), reverse=True)
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can't do that with a single regex. You can write one regex to get all words, then another regex to get everything else.
import re
text = "1234 I just ? shut * the door"
r = re.compile(r'[a-zA-Z]+')
words = r.findall(text)
r = re.compile(r'[^a-zA-Z\s]+')
other = r.findall(text)
print(words + other) # ['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
Related
How would I delete an unknown character from a list of strings?
For example my list is ['hi', 'h#w', 'are!', 'you;', '25'] and I want to delete all the characters that are not words or numbers?
How would I do this?
Regex:
s = ['hi', 'h#w', 'are!', 'you;', '25']
[re.sub(r'[^A-Za-z0-9 ]+', '', x) for x in s]
['hi', 'hw', 'are', 'you', '25']
Use re.sub:
from re import sub
lst = ['hi', 'h#w', 'are!', 'you;', '25']
lst = [sub('[^\w]', '', i) for i in lst]
print(lst)
Output:
['hi', 'hw', 'are', 'you', '25']
Explanation:
This line: sub('[^\w]', '', i) tells python to replace all the substrings of pattern "[^\w]" with an empty string, "" inside the i string, and return the result.
The pattern [^\w] finds all the characters in a string that are not letters or numbers.
I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"
Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']
I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
I need to split a string into a list of words, separating on white spaces, and deleting all special characters except for '
For example:
page = "They're going up to the Stark's castle [More:...]"
needs to be turned into a list
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
right now I can only remove all special characters using
re.sub("[^\w]", " ", page).split()
or just split, keeping all special characters using
page.split()
Is there a way to specify which characters to remove, and which to keep?
Use str.split as normal, then filter the unwanted characters out of each word:
>>> page = "They're going up to the Stark's castle [More:...]"
>>> result = [''.join(c for c in word if c.isalpha() or c=="'") for word in page.split()]
>>> result
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
import re
page = "They're going up to the Stark's castle [More:...]"
s = re.sub("[^\w' ]", "", page).split()
out:
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
first use[\w' ] to match the character you need, than use ^ to match the opposite and replace wiht ''(nothing)
Here a solution.
replace all chars other than alpha-numeric and single quote
characters with SPACE and remove any trailing spaces.
Now split the string using SPACE as delimiter.
import re
page = "They're going up to the Stark's castle [More:...]"
page = re.sub("[^0-9a-zA-Z']+", ' ', page).rstrip()
print(page)
p=page.split(' ')
print(p)
Here is the output.
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
Using ''.join() and a nested list comprehension would be a simpler option in my opinion:
>>> page = "They're going up to the Stark's castle [More:...]"
>>> [''.join([c for c in w if c.isalpha() or c == "'"]) for w in page.split()]
["They're", 'going', 'up', 'to', 'the', "Stark's", 'castle', 'More']
>>>
common_words = set(['je', 'tek', 'u', 'još', 'a', 'i', 'bi',
's', 'sa', 'za', 'o', 'kojeg', 'koju', 'kojom', 'kojoj',
'kojega', 'kojemu', 'će', 'što', 'li', 'da', 'od', 'do',
'su', 'ali', 'nego', 'već', 'no', 'pri', 'se', 'li',
'ili', 'ako', 'iako', 'bismo', 'koji', 'što', 'da', 'nije',
'te', 'ovo', 'samo', 'ga', 'kako', 'će', 'dobro',
'to', 'sam', 'sve', 'smo', 'kao'])
all = []
for (item_content, item_title, item_url, fetch_date) in cursor:
#text = "{}".format(item_content)
text= item_content
text= re.sub('[,.?";:\-!##$%^&*()]', '', text)
text = text.lower()
#text = [w for w in text if not w in common_words]
all.append(text)
I want to delete certain words/stopword from either the variable "test", or later from the list "all" I put all the "text" variables from the iteration in.
I tried it like this, but this doesn't delete just words but also those letters if they exist in other words and the output is like 'd','f' for every word, and I want the format to stay the same, I just need those words in the common_words list deleted from the variable (or the list). How would I achieve that?
As a pythonic way for remove the punctuation from a test you can use str.translate method :
>>> "this is224$# a ths".translate(None,punctuation)
'this is224 a ths'
And for replace the words use re.sub,first create the regex with appending the pip (|) to words :
reg='|'.join(common_words)
new_text=re.sub(reg,'',text)
example :
>>> s="this is224$# a ths"
>>> import re
>>> w=['this','a']
>>> boundary_words=['\b{}\b'.format(i) for i in w]
>>> reg='|'.join(oundary_words)
>>> new_text=re.sub(reg,'',s).translate(None,punctuation)
>>> new_text
' is224 ths'