How can lines with other characters than letters be removed from output? - python

I have a code where I extract bigrams from a large corpus, and concatenate/merge them to get unigrams. 'may', 'be' --> maybe. The corpus contains, of course, a lot of punctuations, but I also discovered that it contains other characters such as emojis... My plan was to put punctuations in a list, and if those characters are not in a line, print the line. Maybe I should change my approach and only print the lines ONLY containing letters and no other characters, since I don't know what kinds of characters are in the corpus. How can this be done? I do need to keep these other characters for the first part of the code, so that bigrams that don't actually exist are printed. The last lines of my code are at the moment:
counted = collections.Counter(grams)
for gram, count in sorted(counted.items()):
s = ''
print (s.join(gram))
And the output I get is:
!aku
!bet
!brå
!båda
These lines won't be of any use for me... Would really appreciate some help! :)

If you want to check that each string contains only letters you can probably use the isalpha() method.
>>> '!båda'.isalpha()
False
>>> 'båda'.isalpha()
True
As you can see from the example, this method should recognize any unicode letter, not just ascii.

To filter out strings that contain a non-letter character, the code can check for the existence of non-letter character in each string:
# coding=utf-8
import string
import unicodedata
source_strings = [u'aku', u'bet', u'brå', u'båda', u'!båda']
valid_chars = (set(string.ascii_letters))
valid_strings = [s for s in source_strings if
set(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')) <= valid_chars]
# valid_strings == [u'aku', u'bet', u'brå', u'båda']
# "båda" was not included.

You can use the unicodedata module to classify the characters:
import unicodedata
unigram= ''.join(gram)
if all(unicodedata.category(char)=='Ll' for char in unigram):
print(unigram)

If you want to remove from your lines only some characters, then you can filter with an easy replace your line before edit it:
sourceList = ['!aku', '!bet', '!brå', '!båda']
newList = []
for word in sourceList:
for special in ['!','&','å']:
word = word.replace(special,'')
newList.append(word)
Then you can do what is needed for your bigram exercise. Hope this help.
Second query: in case you have lots of characters then on your string you can use always the isalpha():
sourceList = ['!aku', '!bet', 'nor mal alpha', '!brå', '!båda']
newList = [word for word in sourceList if word.isalpha()]
In this case you will only check for characters. Hope this clarify second query.

Related

how to prevent regex matching substring of words?

I have a regex in python and I want to prevent matching substrings. I want to add '#' at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'#\1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['#ali_s #ali_t #ali_u #aabs:/t.co/#kMMALke2l9']
As you can see, it puts '#' at the beginning of 'aabs' and 'kMMALke2l9'. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|\s)[a-zA-Z0-9_]{4,15}(\s|$))', r'#\1', str(sent))
return sents
But the result will become like this :
['#ali_s ali_t# ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(\w{4,15})$':
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(\w{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
You can separate words by spaces by adding (?<=\s) to the start and \s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=\s))[a-zA-Z0-9_]{4,15}\s)', r'#\1', str(sent))
return sents
The result will be like this:
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
I am not sure what you are trying to accomplish, but the reason it puts the # at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the # before the whitespace.
you could try to split it to
check at beginning of string and put at first position and
check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail

How would I remove the Arabic prefix "ال" from an arabic string?

I have tried things like this, but there is no change between the input and output:
def remove_al(text):
if text.startswith('ال'):
text.replace('ال','')
return text
text.replace returns the updated string but doesn't change it, you should change the code to
text = text.replace(...)
Note that in Python strings are "immutable"; there's no way to change even a single character of a string; you can only create a new string with the value you want.
If you want to only remove the prefix ال and not all of ال combinations in the string, I'd rather suggest to use:
def remove_prefix_al(text):
if text.startswith('ال'):
return text[2:]
return text
If you simply use text.replace('ال',''), this will replace all ال combinations:
Example
text = 'الاستقلال'
text.replace('ال','')
Output:
'استقل'
I would recommend the method str.lstrip instead of rolling your own in this case.
example text (alrashid) in Arabic: 'الرَشِيد'
text = 'الرَشِيد'
clean_text = text.lstrip('ال')
print(clean_text)
Note that even though arabic reads from right to left, lstrip strips the start of the string (which is visually to the right)
also, as user 6502 noted, the issue in your code is because python strings are immutable, thus the function was returning the input back
"ال" as prefix is quite complex in Arabic that you will need Regex to accurately separate it from its stem and other prefixes. The following code will help you isolate "ال" from most words:
import re
text = 'والشعر كالليل أسود'
words = text.split()
for word in words:
alx = re.search(r'''^
([وف])?
([بك])?
(لل)?
(ال)?
(.*)$''', word, re.X)
groups = [alx.group(1), alx.group(2), alx.group(3), alx.group(4), alx.group(5)]
groups = [x for x in groups if x]
print (word, groups)
Running that (in Jupyter) you will get:

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

how to ignore punctuation when counting characters in string in python

In my homework there is question about write a function words_of_length(N, s) that can pick unique words with certain length from a string, but ignore punctuations.
what I am trying to do is:
def words_of_length(N, s): #N as integer, s as string
#this line i should remove punctuation inside the string but i don't know how to do it
return [x for x in s if len(x) == N] #this line should return a list of unique words with certain length.
so my problem is that I don't know how to remove punctuation , I did view "best way to remove punctuation from string" and relevant questions, but those looks too difficult in my lvl and also because my teacher requires it should contain no more than 2 lines of code.
sorry that I can't edit my code in question properly, it's first time i ask question here, there much i need to learn, but pls help me with this one. thanks.
Use string.strip(s[, chars])
https://docs.python.org/2/library/string.html
In you function replace x with strip (x, ['.', ',', ':', ';', '!', '?']
Add more punctuation if needed
First of all, you need to create a new string without characters you want to ignore (take a look at string library, particularly string.punctuation), and then split() the resulting string (sentence) into substrings (words). Besides that, I suggest using type annotation, instead of comments like those.
def words_of_length(n: int, s: str) -> list:
return [x for x in ''.join(char for char in s if char not in __import__('string').punctuation).split() if len(x) == n]
>>> words_of_length(3, 'Guido? van, rossum. is the best!'))
['van', 'the']
Alternatively, instead of string.punctuation you can define a variable with the characters you want to ignore yourself.
You can remove punctuation by using string.punctuation.
>>> from string import punctuation
>>> text = "text,. has ;:some punctuation."
>>> text = ''.join(ch for ch in text if ch not in punctuation)
>>> text # with no punctuation
'text has some punctuation'

Separating between Hebrew and English strings

So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("א‎ב‎ג‎ד‎ה‎ו‎ז‎ח‎ט‎י‎כ‎ך‎ל‎מ‎נ‎ס‎ ע‎פ‎צ‎ק‎ר‎ש‎ת‎ם‎ן‎ף‎ץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)

Categories