I am trying to create a way to proofread command console input and check to make sure that the string is an rgb hex string. (Ex: #FAF0E6) Currently I am working with a try: except: block.
def isbgcolor(bgcolor):
#checks to see if bgcolor is binary
try:
float(bgcolor)
return True
except ValueError:
return False
I tried also using a .startswith('#'). I have seen examples of how to write this function in Java but I'm still a beginner and Python's all I know. Help?
Normally, the best way to see if a string matches some simple format is to actually try to parse it. (Especially if you're only checking so you can then parse it if valid, or print an error if not.) So, let's do that.
The standard library is full of all kinds of useful things, so it's always worth searching. If you want to parse a hex string, the first thing that comes up is binascii.unhexlify. We want to unhexlify everything after the first # character. So:
import binascii
def parse_bgcolor(bgcolor):
if not bgcolor.startswith('#'):
raise ValueError('A bgcolor must start with a "#"')
return binascii.unhexlify(bgcolor[1:])
def is_bgcolor(bgcolor):
try:
parse_bgcolor(bgcolor)
except Exception as e:
return False
else:
return True
This accepts 3-character hex strings (but then so do most data formats that use #-prefixed hex RGB), and even 16-character ones. If you want to add a check for the length, you can add that. Is the rule == 6 or in (3, 6) or % 3 == 0? I don't know, but presumably you do if you have a rule you want to add.
If you start using parse_bgcolor, you'll discover that it's giving you a bytes with 6 values from 0-255, when you really wanted 3 values from 0-65535. You can combine them manually, or you can parse each two-character pair as a number (e.g., with int(pair, 16)), or you can feed the 6-char bytes you already have into, say, struct.unpack('>HHH'). Whatever you need to do is pretty easy once you know exactly what you want to do.
Finally, if you're trying to parse CSS or HTML, things like red or rgb(1, 2, 3) are also valid colors. Do you need to handle those? If so, you'll need something a bit smarter than this. The first thing to do is look at the spec for what you're trying to parse, and work out the rules you need to turn into code. Then you can write the code.
The following would match a hex RGB string:
import re
_rgbstring = re.compile(r'#[a-fA-F0-9]{6}$')
def isrgbcolor(value):
return bool(_rgbstring.match(value))
This only returns True if a string starting with # followed by exactly 6 hex digits is passed in.
Demo:
>>> isrgbcolor('#FAF0E6')
True
>>> isrgbcolor('#FAF0')
False
>>> isrgbcolor('FAF0E6')
False
>>> isrgbcolor('#NotRgb')
False
If you want to support the 3-digit CSS format as well, update the pattern:
_rgbstring = re.compile(r'#[a-fA-F0-9]{3}(?:[a-fA-F0-9]{3})?$')
This matches a hash followed by 3 hex digits, plus an optional 3 extra digits.
This seems to be the most simplest way. This regex will notice the P doesn't belong in the HEX.
import re
from pprint import pprint
hex = '#f8Ed90P'
pprint(re.findall('[^#0-9a-fA-F]', hex))
..so if there is something in the result of re.findall there's something wrong with your HEX structure.
This code resulted in:
macbook-pro:Desktop allendar$ python3 test.py
['P']
This code has the flaw that the hash-deck can be anywhere, which of course isn't right.
You might just want to check the hash-deck at the beginning of the string so the regex is easier to discern. Afterwards just only check if the other characters are conform to the characters allowed in your regex check.
Related
Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation
I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.
You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.
I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail
well i need to compare two strings or at least find a sequence of characters from a string to another string. The two strings contain md5 of files which i must compare and say if i find a match.
my current code is:
def comparemd5():
origmd5=getreferrerurl()
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()
the output i get is:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40
12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
ratio is : 0.0
Thus! there is a match from dlmd5 in origmd5 but somehow its not finding it...
I am doing something wrong somewhere...Please help me out :/
Basically, you want the idom if test_string in list_of_strings. Looks like you don't need case sensitivity, so you might want
if test_string.lower() in (s.lower() for s in list_of_strings)
In your case:
>>> originals = ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
>>> test = '59739ccda2f15d5ac16db6695cae3378'
>>> if test.lower() in (s.lower() for s in originals):
... print '%s is match, yeih!' % test
...
59739ccda2f15d5ac16db6695cae3378 is match, yeih!
Looks like you're having a problem since the case isn't matching on the letters. May want to try:
def comparemd5():
origmd5=[item.lower() for item in getreferrerurl()]
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()
Given the input:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
You have two problems.
First of all, that first one isn't just an MD5, but an MD5 and two other things.
To fix that: If you know that origmd5 will always be in this format, just use origmd5[2] instead of origmd5. If you have no idea what origmd5 is, except that one of the things in it is the actual MD5, you'll have to compare against all of the elements.
Second, the actual MD5 values are both hex strings representing the same binary data, but they're different hex strings (because one is in uppercase, the other in lowercase). You could fix this by just doing a case-insensitive comparison, but it's probably more robust to unhexlify them both and compare the binary values.
In fact, if you've copied and pasted the output correctly, at least one of those hex strings has a space in the middle of it, so you actually need to unhexlify hex strings with optional spaces between hex pairs. AFAIK, there is no stdlib function that does this, but you can write it yourself in one step:
def unhexlify(s):
return binascii.unhexlify(s.replace(' ', ''))
Meanwhile, I'm not sure why you're trying to use difflib.SequenceMatcher at all. Two slightly different MD5 hashes refer to completely different original sources; that's kind of the whole point of MD5, and crypto hash functions in general. There's no such thing as a 95% match; there's either a match, or a non-match.
So, if you know the 3rd value in origmd5 is the one you want, just do this:
s = unhexlify(origmd5[2]) == unhexlify(dlmd5)
Otherwise, do this:
s = any(unhexlify(origthingy) == unhexlify(dlmd5) for origthingy in origmd5)
Or, turning it around to make it simpler:
s = unhexlify(dlmd5) in map(unhexlify, origthingy)
Or whatever equivalent you find most readable.
I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.
Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).
You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)
Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;