I want to clean a string from user input from punctuation and conjunction. the conjunction is stored in the file.txt (Stop Word.txt)
I already tried this code:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
for char in words:
punc = char.strip(punctuation)
if punc in conjunction:
words.remove(punc)
print(words)
message(input("Pesan: "))
OUTPUT
when i input "Hello, how are you? and where are you?"
i expect the output is [hello,how,are,you,where,are,you]
but the output is [hello,how,are,you?,where,are,you?]
or [hello,how,are,you?,and,where,are,you?]
Use list comprehension to construct words and check if the word is in your conjunction list:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
return [char.strip(punctuation) for char in words if char not in conjunction]
print (message("Hello, how are you? and where are you?"))
#['hello', 'how', 'are', 'you', 'where', 'are', 'you']
Related
Getting a string that comes after a '%' symbol and should end before other characters (no numbers and characters).
for example:
string = 'Hi %how are %YOU786$ex doing'
it should return as a list.
['how', 'you']
I tried
string = text.split()
sample = []
for i in string:
if '%' in i:
sample.append(i[1:index].lower())
return sample
but it I don't know how to get rid of 'you786$ex'.
EDIT: I don't want to import re
You can use a regular expression.
>>> import re
>>>
>>> s = 'Hi %how are %YOU786$ex doing'
>>> re.findall('%([a-z]+)', s.lower())
>>> ['how', 'you']
regex101 details
This can be most easily done with re.findall():
import re
re.findall(r'%([a-z]+)', string.lower())
This returns:
['how', 'you']
Or you can use str.split() and iterate over the characters:
sample = []
for token in string.lower().split('%')[1:]:
word = ''
for char in token:
if char.isalpha():
word += char
else:
break
sample.append(word)
sample would become:
['how', 'you']
Use Regex (Regular Expressions).
First, create a Regex pattern for your task. You could use online tools to test it. See regex for your task: https://regex101.com/r/PMSvtK/1
Then just use this regex in Python:
import re
def parse_string(string):
return re.findall("\%([a-zA-Z]+)", string)
print(parse_string('Hi %how are %YOU786$ex doing'))
Output:
['how', 'YOU']
I am trying to match words that are not inside < >.
This is the regular expression for matching words inside < >:
text = " Hi <how> is <everything> going"
pattern_neg = r'<([A-Za-z0-9_\./\\-]*)>'
m = re.findall(pattern_neg, text)
# m is ['how', 'everything']
I want the result to be ['Hi', 'is', 'going'].
Using re.split:
import re
text = " Hi <how> is <everything> going"
[s.strip() for s in re.split('\s*<.*?>\s*', text)]
>> ['Hi', 'is', 'going']
A regular expression approach:
>>> import re
>>> re.findall(r"\b(?<!<)\w+(?!>)\b", text)
['Hi', 'is', 'going']
Where \b are the word boundaries, (?<!<) is a negative lookbehind and (?!>) a negative lookahead, \w+ would match one or more alphanumeric characters.
A non-regex naive approach (splitting by space, checking if each word not starts with < and not ends with >):
>>> [word for word in text.split() if not word.startswith("<") and not word.endswith(">")]
['Hi', 'is', 'going']
To also handle the <hello how> are you case, we would need something different:
>>> text = " Hi <how> is <everything> going"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
[' Hi', 'is', 'going']
>>> text = "<hello how> are you"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
['are you']
Note that are you now have to be splitted to get individual words.
file_str = input("Enter poem: ")
my_file = open(file_str, "r")
words = file_str.split(',' or ';')
I have a file on my computer that contains a really long poem, and I want to see if there are any words that are duplicated per line (hence it being split by punctuation).
I have that much, and I don't want to use a module or Counter, I would prefer to use loops. Any ideas?
You can use sets to track seen items and duplicates:
>>> words = 'the fox jumped over the lazy dog and over the bear'.split()
>>> seen = set()
>>> dups = set()
>>> for word in words:
if word in seen:
if word not in dups:
print(word)
dups.add(word)
else:
seen.add(word)
the
over
with open (r"specify the path of the file") as f:
data = f.read()
if(set([i for i in data if f.count(f)>1])):
print "Duplicates found"
else:
print "None"
SOLVED !!!
I can give the explanation with working program
file content of sam.txt
sam.txt
Hello this is star hello the data are Hello so you can move to the
hello
file_content = []
resultant_list = []
repeated_element_list = []
with open(file="sam.txt", mode="r") as file_obj:
file_content = file_obj.readlines()
print("\n debug the file content ",file_content)
for line in file_content:
temp = line.strip('\n').split(" ") # This will strip('\n') and split the line with spaces and stored as list
for _ in temp:
resultant_list.append(_)
print("\n debug resultant_list",resultant_list)
#Now this is the main for loop to check the string with the adjacent string
for ii in range(0, len(resultant_list)):
# is_repeated will check the element count is greater than 1. If so it will proceed with identifying duplicate logic
is_repeated = resultant_list.count(resultant_list[ii])
if is_repeated > 1:
if ii not in repeated_element_list:
for2count = ii + 1
#This for loop for shifting the iterator to the adjacent string
for jj in range(for2count, len(resultant_list)):
if resultant_list[ii] == resultant_list[jj]:
repeated_element_list.append(resultant_list[ii])
print("The repeated strings are {}\n and total counts {}".format(repeated_element_list, len(repeated_element_list)))
Output:
debug the file content ['Hello this is abdul hello\n', 'the data are Hello so you can move to the hello']
debug resultant_list ['Hello', 'this', 'is', 'abdul', 'hello', 'the', 'data', 'are', 'Hello', 'so', 'you', 'can', 'move', 'to', 'the', 'hello']
The repeated strings are ['Hello', 'hello', 'the']
and total counts 3
Thanks
def Counter(text):
d = {}
for word in text.split():
d[word] = d.get(word,0) + 1
return d
there is loops :/
to split on punctionation just us
matches = re.split("[!.?]",my_corpus)
for match in matches:
print Counter(match)
For this kinda file;
A hearth came to us from your hearth
foreign hairs with hearth are same are hairs
This will check whole poem;
lst = []
with open ("coz.txt") as f:
for line in f:
for word in line.split(): #splited by gaps (space)
if word not in lst:
lst.append(word)
else:
print (word)
Output:
>>>
hearth
hearth
are
hairs
>>>
As you see there are two hearth here, because in whole poem there are 3 hearth.
For check line by line;
lst = []
lst2 = []
with open ("coz.txt") as f:
for line in f:
for word in line.split():
lst2.append(word)
for x in lst2:
if x not in lst:
lst.append(x)
lst2.remove(x)
print (set(lst2))
>>>
{'hearth', 'are', 'hairs'}
>>>
The sample below is to strip punctuations and converting text into lower case from a ranbo.txt file...
Help me to split this with whitespace
infile = open('ranbo.txt', 'r')
lowercased = infile.read().lower()
for c in string.punctuation:
lowercased = lowercased.replace(c,"")
white_space_words = lowercased.split(?????????)
print white_space_words
Now after this split - how can I found how many words are in this list?
count or len function?
white_space_words = lowercased.split()
splits using any length of whitespace characters.
'a b \t cd\n ef'.split()
returns
['a', 'b', 'cd', 'ef']
But you could do it also other way round:
import re
words = re.findall(r'\w+', text)
returns a list of all "words" from text.
Get its length using len():
len(words)
and if you want to join them into a new string with newlines:
text = '\n'.join(words)
As a whole:
with open('ranbo.txt', 'r') as f:
lowercased = f.read().lower()
words = re.findall(r'\w+', lowercased)
number_of_words = len(words)
text = '\n'.join(words)
If I have a string and want to return a word that includes a whitespace how would it be done?
For example, I have:
line = 'This is a group of words that include #this and #that but not ME ME'
response = [ word for word in line.split() if word.startswith("#") or word.startswith('#') or word.startswith('ME ')]
print response ['#this', '#that', 'ME']
So ME ME does not get printed because of the whitespace.
Thanks
You could just keep it simple:
line = 'This is a group of words that include #this and #that but not ME ME'
words = line.split()
result = []
pos = 0
try:
while True:
if words[pos].startswith(('#', '#')):
result.append(words[pos])
pos += 1
elif words[pos] == 'ME':
result.append('ME ' + words[pos + 1])
pos += 2
else:
pos += 1
except IndexError:
pass
print result
Think about speed only if it proves to be too slow in practice.
From python Documentation:
string.split(s[, sep[, maxsplit]]): Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed).
so your error is first on the call for split.
print line.split()
['This', 'is', 'a', 'group', 'of', 'words', 'that', 'include', '#this', 'and', '#that', 'but', 'not', 'ME', 'ME']
I recommend to use re for splitting the string. Use the re.split(pattern, string, maxsplit=0, flags=0)