line = "english: while french: pendant que spanish: mientras german: whrend "
words = line.split('\t')
for each in words:
each = each.rstrip()
print words
the string in 'line' is tab delimited but also features a single white space character after each translated word, so while split returns the list I'm after, each word annoyingly has a whitespace character at the end of the string.
in the loop I'm trying to go through the list and remove any trailing whitespaces in the strings but it doest seem to work, suggestions?
Just line.split() could give you stripped words list.
Updating each inside the loop does not make any changes to the words list
Should be done like this
for i in range(len(words)):
words[i]=words[i].rstrip()
Or
words=map(str.rstrip,words)
See the map docs for details on map.
Or one liner with list comprehension
words=[x.rstrip() for x in line.split("\t")]
Or with regex .findall
words=re.findall("[^\t]+",line)
words = line.split('\t')
words = [ i.rstrip() for i in words ]
You can use a regular expression:
import re
words = re.split(r' *\t| +$', line)[:-1]
With this you define the possible sequence as the delimiter. It also allows more than one space because of the * operator (or no space at all).
EDIT: Fixed after Roger Pate pointed an error.
Related
Suppose I have a expression
exp="\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50 and \"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston' or \"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20"
I want to split the expression by and , or so that my result will be
exp=['\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50','\"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston'','\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20']
This is what i have tried:
import re
res=re.split('and|or|',exp)
but it will split by each character how can we make it split by word?
import itertools
exp=itertools.chain(*[y.split('or') for y in exp.split('and')])
exp=[x.strip() for x in list(exp)]
Explanation: 1st split on 'and'. Now try spitting each element obtained on 'or'. This will create list of lists. Using itertools, create a flat list & strip extra spaces from each new element in the flat list
Your regex has three alternatives: "and", "or" or the empty string: and|or|
Omit the trailing | to split just by those two words.
import re
res=re.split('and|or', exp)
Note that this will not work reliably; it'll split on any instance of "and", even when it's in quotes or part of a word. You could make it split only on full words using \b, but that will still split on a product name like 'Black and Decker'. If you need it to be reliable and general, you'll have to parse the string using the full syntax (probably using an off-the-shelf parser, if it's standard SQL or similar).
You can do it in 2 steps: [ss for s in exp.split(" and ") for ss in s.split(' or ')]
I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']
I have a code where I extract bigrams from a large corpus, and concatenate/merge them to get unigrams. 'may', 'be' --> maybe. The corpus contains, of course, a lot of punctuations, but I also discovered that it contains other characters such as emojis... My plan was to put punctuations in a list, and if those characters are not in a line, print the line. Maybe I should change my approach and only print the lines ONLY containing letters and no other characters, since I don't know what kinds of characters are in the corpus. How can this be done? I do need to keep these other characters for the first part of the code, so that bigrams that don't actually exist are printed. The last lines of my code are at the moment:
counted = collections.Counter(grams)
for gram, count in sorted(counted.items()):
s = ''
print (s.join(gram))
And the output I get is:
!aku
!bet
!brå
!båda
These lines won't be of any use for me... Would really appreciate some help! :)
If you want to check that each string contains only letters you can probably use the isalpha() method.
>>> '!båda'.isalpha()
False
>>> 'båda'.isalpha()
True
As you can see from the example, this method should recognize any unicode letter, not just ascii.
To filter out strings that contain a non-letter character, the code can check for the existence of non-letter character in each string:
# coding=utf-8
import string
import unicodedata
source_strings = [u'aku', u'bet', u'brå', u'båda', u'!båda']
valid_chars = (set(string.ascii_letters))
valid_strings = [s for s in source_strings if
set(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')) <= valid_chars]
# valid_strings == [u'aku', u'bet', u'brå', u'båda']
# "båda" was not included.
You can use the unicodedata module to classify the characters:
import unicodedata
unigram= ''.join(gram)
if all(unicodedata.category(char)=='Ll' for char in unigram):
print(unigram)
If you want to remove from your lines only some characters, then you can filter with an easy replace your line before edit it:
sourceList = ['!aku', '!bet', '!brå', '!båda']
newList = []
for word in sourceList:
for special in ['!','&','å']:
word = word.replace(special,'')
newList.append(word)
Then you can do what is needed for your bigram exercise. Hope this help.
Second query: in case you have lots of characters then on your string you can use always the isalpha():
sourceList = ['!aku', '!bet', 'nor mal alpha', '!brå', '!båda']
newList = [word for word in sourceList if word.isalpha()]
In this case you will only check for characters. Hope this clarify second query.
I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty string. I am wondering if there is any other way to do it. My codes are:
def build_map( in_file, word_map ):
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
word = word.strip().strip(string.punctuation).lower()#program revised
add_word( word_map, word )
This should get you going in the right direction, you'll need to process it, probably by stripping periods and colons, and you might want to make it all lowercase anyways.
passage = '''I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty spaces. I am wondering if there is any other way to do it. My codes are:'''
words = set(passage.split())
alpha_sort = sorted(words, key=str.lower)
frequency_sort = sorted(words, key=passage.count, reverse=True)
Maybe you're looking for str.isspace()
Instead of:
if word!='':
you should use:
if word.strip()!='':
because the first one checks for zero-length strings, and you want to eliminate the spaces which are not zero length. Stripping an only-space string will make it zero-length.
To filter empty strings from a list of strings, I would use:
my_list = filter(None, my_list)
The purpose of the program is to count each word in a passage and note the frequency. Unfortunately, the program is also counting empty strings. My codes are:
def build_map( in_file, word_map ):
# Receives an input file and an empty dictionary
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
word= word.strip().strip(string.punctuation).lower()#program revised
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
add_word( word_map, word)
I would really appreciate if someone could take a look at the codes and explain, why it is still counting empty strings. Other than that everything else is working fine. Thanks (modified the code and it is working fine now).
You're checking if the word is empty and then you're stripping the whitespace and punctuation. Reverse the order of these operations.