This question already has answers here:
python-re: How do I match an alpha character
(3 answers)
Closed 4 years ago.
Ok so basically this is what I know, and it does work, using Python3:
color="Red1 and Blue2!"
color[2]=="d"
True
What I need is that when I call any position, (which inputs any single character Lower or Upper case in the comparison), into the brackets "color[ ]" and compare it to match only with "Lower or Upper case letters" excluding all numbers and characters (.*&^%$##!).
in order words something to the effects below:
color="Red1 and Blue2!"
if color[5]==[a-zA-z]:
doSomething
else:
doSomethingElse
Of course what I just listed above does not work. Perhaps my syntax is wrong, perhaps it just cant be done. If I only use a single letter on the "right" side of the equals, then all is well, But like I said I need whatever single letter is pulled into the left side, to match something on the right.
First off I wan't to make sure that its possible to do, what I'm trying to accomplish?
2nd, if it is indeed possible to do then have this accomplished "Without" importing anything other then "sys".
If the only way to accomplish this is by importing something else, then I will take a look at that suggestion, however I prefer not to import anything if at all possible.
I'v searched my books, and a whole other questions on this site and I can't seem to find anything that matches, thanks.
For the case of looking for letters, a simple .isalpha() check:
if color[5].isalpha():
will work.
For the general case where a specific check function doesn't exist, you can use in checks:
if color[5] in '13579': # Checks for existence in some random letter set
If the "random letter set" is large enough, you may want to preconvert to a frozenset for checking (frozenset membership tests are roughly O(1), vs. O(n) for str, but str tests are optimized enough that you'd need quite a long str before the frozenset makes sense; possibly larger than the one in the example):
CHARSET = frozenset('13579adgjlqetuozcbm')
if color[5] in CHARSET:
Alternatively, you can use regular expressions to get the character classes you were trying to get:
import re
# Do this once up front to avoid recompiling, then use repeatedly
islet = re.compile('^[a-zA-Z]$').match
...
if islet(color[5]):
This is where isalpha() is helpful.
color="Red1 and Blue2!"
if color[5].isalpha():
doSomething
else:
doSomethingElse
There's also isnumeric(), if you need numbers.
Not really sure why you'd require not importing anything from the standard libraries though.
import string
color="Red1 and Blue2!"
if color[5] in string.ascii_letters:
print("do something")
else:
print("do something else")
Related
def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.
If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())
There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.
The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.
I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?
What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0
Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.
Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.
Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.
I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?
My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233
I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail
I have a string. I need to know if any of the following substrings appear in the string.
So, if I have:
thing_name = "VISA ASSESSMENTS"
I've been doing my searches with:
any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.
Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.
You can try re.search to see if that is faster. Something along the lines of
import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)
If your list of substrings is small and the input is small, then using a for loop to do compares is fine.
Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.
Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.
It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.