Split string with whitespace and then do a count - python

The sample below is to strip punctuations and converting text into lower case from a ranbo.txt file...
Help me to split this with whitespace
infile = open('ranbo.txt', 'r')
lowercased = infile.read().lower()
for c in string.punctuation:
lowercased = lowercased.replace(c,"")
white_space_words = lowercased.split(?????????)
print white_space_words
Now after this split - how can I found how many words are in this list?
count or len function?

white_space_words = lowercased.split()
splits using any length of whitespace characters.
'a b \t cd\n ef'.split()
returns
['a', 'b', 'cd', 'ef']
But you could do it also other way round:
import re
words = re.findall(r'\w+', text)
returns a list of all "words" from text.
Get its length using len():
len(words)
and if you want to join them into a new string with newlines:
text = '\n'.join(words)
As a whole:
with open('ranbo.txt', 'r') as f:
lowercased = f.read().lower()
words = re.findall(r'\w+', lowercased)
number_of_words = len(words)
text = '\n'.join(words)

Related

Split string in Python while keeping the line break inside the generated list

As simple as it sounds, can't think of a straightforward way of doing the below in Python.
my_string = "This is a test.\nAlso\tthis"
list_i_want = ["This", "is", "a", "test.", "\n", "Also", "this"]
I need the same behaviour as with string.split(), i.e. remove any type and number of whitespaces, but excluding the line breaks \n in which case I need it as a standalone list item.
How could I do this?
Split String using Regex findall()
import re
my_string = "This is a test.\nAlso\tthis"
my_list = re.findall(r"\S+|\n", my_string)
print(my_list)
How it Works:
"\S+": "\S" = non whitespace characters. "+" is a greed quantifier so it find any groups of non-whitespace characters aka words
"|": OR logic
"\n": Find "\n" so it's returned as well in your list
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
Here's a code that works but is definitely not efficient/pythonic:
my_string = "This is a test.\nAlso\tthis"
l = my_string.splitlines() #Splitting lines
list_i_want = []
for i in l:
list_i_want.extend((i.split())) # Extending elements in list by splitting lines
list_i_want.extend('\n') # adding newline character
list_i_want.pop() # Removing last newline character
print(list_i_want)
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']

How can I extract hashtags from string?

I need to extract the "#" from a function that receives a string.
Here's what I've done:
def hashtag(str):
lst = []
for i in str.split():
if i[0] == "#":
lst.append(i[1:])
return lst
My code does work, but it splits words. So for the example string: "Python is #great #Computer#Science" it'll return the list: ['great', 'Computer#Science'] instead of ['great', 'Computer', 'Science'].
Without using RegEx please.
You can first try to find the firsr index where # occurs and split the slice on #
text = 'Python is #great #Computer#Science'
text[text.find('#')+1:].split('#')
Out[214]: ['great ', 'Computer', 'Science']
You can even use strip at last to remove unnecessary white space.
[tag.strip() for tag in text[text.find('#')+1:].split('#')]
Out[215]: ['great', 'Computer', 'Science']
Split into words, and then filter for the ones beginning with an octothorpe (hash).
[word for word in str.replace("#", " #").split()
if word.startswith('#')
]
The steps are
Insert a space in front of each hash, to make sure we separate on them
Split the string at spaces
Keep the words that start with a hash.
Result:
['#great', '#Computer', '#Science']
split by #
take all tokens except the first one
strip spaces
s = "Python is #great #Computer#Science"
out = [w.split()[0] for w in s.split('#')[1:]]
out
['great', 'Computer', 'Science']
When you split the string using default separator (space), you get the following result:
['Python', 'is', '#great', '#Computer#Science']
You can make a replace (adding a space before a hashtag) before splitting
def hashtag(str):
lst = []
str = str.replace('#', ' #')
for i in str.split():
if i[0] == "#":
lst.append(i[1:])
return lst

Remove conjunction from file.txt and punctuation from user input

I want to clean a string from user input from punctuation and conjunction. the conjunction is stored in the file.txt (Stop Word.txt)
I already tried this code:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
for char in words:
punc = char.strip(punctuation)
if punc in conjunction:
words.remove(punc)
print(words)
message(input("Pesan: "))
OUTPUT
when i input "Hello, how are you? and where are you?"
i expect the output is [hello,how,are,you,where,are,you]
but the output is [hello,how,are,you?,where,are,you?]
or [hello,how,are,you?,and,where,are,you?]
Use list comprehension to construct words and check if the word is in your conjunction list:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
return [char.strip(punctuation) for char in words if char not in conjunction]
print (message("Hello, how are you? and where are you?"))
#['hello', 'how', 'are', 'you', 'where', 'are', 'you']

Write a for loop to remove punctuation

I've been tasked with writing a for loop to remove some punctuation in a list of strings, storing the answers in a new list. I know how to do this with one string, but not in a loop.
For example: phrases = ['hi there!', 'thanks!'] etc.
import string
new_phrases = []
for i in phrases:
if i not in string.punctuation
Then I get a bit stuck at this point. Do I append? I've tried yield and return, but realised that's for functions.
You can either update your current list or append the new value in another list. the update will be better because it takes constant space while append takes O(n) space.
phrases = ['hi there!', 'thanks!']
i = 0
for el in phrases:
new_el = el.replace("!", "")
phrases[i] = new_el
i += 1
print (phrases)
will give output: ['hi there', 'thanks']
Give this a go:
import re
new_phrases = []
for word in phrases:
new_phrases.append(re.sub(r'[^\w\s]','', word))
This uses the regex library to turn all punctuation into a 'blank' string. Essentially, removing it
You can use re module and list comprehension to do it in single line:
phrases = ['hi there!', 'thanks!']
import string
import re
new_phrases = [re.sub('[{}]'.format(string.punctuation), '', i) for i in phrases]
new_phrases
#['hi there', 'thanks']
If phrases contains any punctuation then replace it with "" and append to the new_phrases
import string
new_phrases = []
phrases = ['hi there!', 'thanks!']
for i in phrases:
for pun in string.punctuation:
if pun in i:
i = i.replace(pun,"")
new_phrases.append(i)
print(new_phrases)
OUTPUT
['hi there', 'thanks']
Following your forma mentis, I'll do like this:
for word in phrases: #for each word
for punct in string.punctuation: #for each punctuation
w=w.replace(punct,'') # replace the punctuation character with nothing (remove punctuation)
new_phrases.append(w) #add new "punctuationless text" to your output
I suggest you using the powerful translate() method on each string of your input list, which seems really appropriate. It gives the following code, iterating over the input list throug a list comprehension, which is short and easily readable:
import string
phrases = ['hi there!', 'thanks!']
translationRule = str.maketrans({k:"" for k in string.punctuation})
new_phrases = [phrase.translate(translationRule) for phrase in phrases]
print(new_phrases)
# ['hi there', 'thanks']
Or to only allow spaces and letters:
phrases=[''.join(x for x in i if x.isalpha() or x==' ') for i in phrases]
Now:
print(phrases)
Is:
['hi there', 'thanks']
you should use list comprehension
new_list = [process(string) for string in phrases]

Python line.split to include a whitespace

If I have a string and want to return a word that includes a whitespace how would it be done?
For example, I have:
line = 'This is a group of words that include #this and #that but not ME ME'
response = [ word for word in line.split() if word.startswith("#") or word.startswith('#') or word.startswith('ME ')]
print response ['#this', '#that', 'ME']
So ME ME does not get printed because of the whitespace.
Thanks
You could just keep it simple:
line = 'This is a group of words that include #this and #that but not ME ME'
words = line.split()
result = []
pos = 0
try:
while True:
if words[pos].startswith(('#', '#')):
result.append(words[pos])
pos += 1
elif words[pos] == 'ME':
result.append('ME ' + words[pos + 1])
pos += 2
else:
pos += 1
except IndexError:
pass
print result
Think about speed only if it proves to be too slow in practice.
From python Documentation:
string.split(s[, sep[, maxsplit]]): Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed).
so your error is first on the call for split.
print line.split()
['This', 'is', 'a', 'group', 'of', 'words', 'that', 'include', '#this', 'and', '#that', 'but', 'not', 'ME', 'ME']
I recommend to use re for splitting the string. Use the re.split(pattern, string, maxsplit=0, flags=0)

Categories