Compare two strings in python - python

well i need to compare two strings or at least find a sequence of characters from a string to another string. The two strings contain md5 of files which i must compare and say if i find a match.
my current code is:
def comparemd5():
origmd5=getreferrerurl()
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()
the output i get is:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40
12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
ratio is : 0.0
Thus! there is a match from dlmd5 in origmd5 but somehow its not finding it...
I am doing something wrong somewhere...Please help me out :/

Basically, you want the idom if test_string in list_of_strings. Looks like you don't need case sensitivity, so you might want
if test_string.lower() in (s.lower() for s in list_of_strings)
In your case:
>>> originals = ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
>>> test = '59739ccda2f15d5ac16db6695cae3378'
>>> if test.lower() in (s.lower() for s in originals):
... print '%s is match, yeih!' % test
...
59739ccda2f15d5ac16db6695cae3378 is match, yeih!

Looks like you're having a problem since the case isn't matching on the letters. May want to try:
def comparemd5():
origmd5=[item.lower() for item in getreferrerurl()]
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()

Given the input:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
You have two problems.
First of all, that first one isn't just an MD5, but an MD5 and two other things.
To fix that: If you know that origmd5 will always be in this format, just use origmd5[2] instead of origmd5. If you have no idea what origmd5 is, except that one of the things in it is the actual MD5, you'll have to compare against all of the elements.
Second, the actual MD5 values are both hex strings representing the same binary data, but they're different hex strings (because one is in uppercase, the other in lowercase). You could fix this by just doing a case-insensitive comparison, but it's probably more robust to unhexlify them both and compare the binary values.
In fact, if you've copied and pasted the output correctly, at least one of those hex strings has a space in the middle of it, so you actually need to unhexlify hex strings with optional spaces between hex pairs. AFAIK, there is no stdlib function that does this, but you can write it yourself in one step:
def unhexlify(s):
return binascii.unhexlify(s.replace(' ', ''))
Meanwhile, I'm not sure why you're trying to use difflib.SequenceMatcher at all. Two slightly different MD5 hashes refer to completely different original sources; that's kind of the whole point of MD5, and crypto hash functions in general. There's no such thing as a 95% match; there's either a match, or a non-match.
So, if you know the 3rd value in origmd5 is the one you want, just do this:
s = unhexlify(origmd5[2]) == unhexlify(dlmd5)
Otherwise, do this:
s = any(unhexlify(origthingy) == unhexlify(dlmd5) for origthingy in origmd5)
Or, turning it around to make it simpler:
s = unhexlify(dlmd5) in map(unhexlify, origthingy)
Or whatever equivalent you find most readable.

Related

How to use hash function in Python3 to transform an arbitrary string into a fixed-length sequence of alphanumeric symbols?

I have a large number of different sentences written in different languages (French, Ukrainian, English and so on). For each sentence I want to generate audio file with the given sentence being pronounced by a text-to-speech program. Now I need to decide how to name those audio files (one file for each sentence). I thought that it would be elegant if I can infer file name from the sentence. In other words, if I see the sentence, I should be able to computer (infer / derive) the name of audio file in which this sentence is spoken.
I thought that I could use a hash function for that. I would apply a hash function to the string representing the sentence and, as a result, I would get a string (hash) that I can use as a name of the file.
Why not to use the sentence itself as a name? Because sentences can be large and I do not want to have very large file names. Moreover, I do not want to have spaces and other punctuation symbols (as well as strange alphabet symbols) in the names of the files. Finally, I expect that hash will always have the same length which looks nice.
Now is my question: How can I transform an arbitrary unicode string into a sequence of alphanumeric symbols being a hash of the input string in Python3?
I also wonder if there is a danger of getting the same hash for different sentences.
ADDED:
I have just realized, that by applying hash function to the same string I can get different results for different sessions. This is, obviously, something that I would like to avoid.
Sure. Use a cryptographic hash function such as SHA-256; they're available in hashlib. (As you've noticed, hash isn't stable between sessions due to PYTHONHASHSEED, nor necessarily between Python versions and interpreters.)
I also apply some normalization here, but that may or may not be what you want.
import hashlib
def get_filename(sentence: str) -> str:
# assuming leading/trailing whitespace doesn't matter, nor does case
sentence_norm = sentence.lower().strip()
return hashlib.sha256(sentence_norm.encode("utf-8")).hexdigest()
>>> get_filename("Hello, mon ami!")
'c13c197526d17532bd6d9bf3c2ad34486ccb2fcdeadaf7b71c3c67c0f048ecb9'
>>> get_filename("hello, mon ami! ")
'c13c197526d17532bd6d9bf3c2ad34486ccb2fcdeadaf7b71c3c67c0f048ecb9'
>>>
I also wonder if there is a danger of getting the same hash for different sentences.
No, not until SHA-256 is broken, and if it is, we're all in trouble anyway.

Python: check if string meets specific format

Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation

Python identify file with largest number as part of filename

I have files with a number appended at the end e.g:
file_01.csv
file_02.csv
file_03.csv
I am looking for a simple way of identifying the file with the largest number appended to it. Is there a moderately simple way of achieving this? ... I was thinking of importing all file names in the folder, extracting the last digits, converting to number, and then looking for max number, however that seems moderately complicated for what I assume is a relatively common task.
if the filenames are really formatted in such a nice way, then you can simply use max:
>>> max(['file_01.csv', 'file_02.csv', 'file_03.csv'])
'file_03.csv'
but note that:
>>> 'file_5.csv' > 'file_23.csv'
True
>>> 'my_file_01' > 'file_123'
True
>>> 'fyle_01' > 'file_42'
True
so you might want to add some kind of validation to your function, and/or or use glob.glob:
>>> max(glob.glob('/tmp/file_??'))
'/tmp/file_03'
import re
x=["file_01.csv","file_02.csv","file_03.csv"]
print max(x,key=lambda x:re.split(r"_|\.",x)[1])

How to check if a string is an rgb hex string

I am trying to create a way to proofread command console input and check to make sure that the string is an rgb hex string. (Ex: #FAF0E6) Currently I am working with a try: except: block.
def isbgcolor(bgcolor):
#checks to see if bgcolor is binary
try:
float(bgcolor)
return True
except ValueError:
return False
I tried also using a .startswith('#'). I have seen examples of how to write this function in Java but I'm still a beginner and Python's all I know. Help?
Normally, the best way to see if a string matches some simple format is to actually try to parse it. (Especially if you're only checking so you can then parse it if valid, or print an error if not.) So, let's do that.
The standard library is full of all kinds of useful things, so it's always worth searching. If you want to parse a hex string, the first thing that comes up is binascii.unhexlify. We want to unhexlify everything after the first # character. So:
import binascii
def parse_bgcolor(bgcolor):
if not bgcolor.startswith('#'):
raise ValueError('A bgcolor must start with a "#"')
return binascii.unhexlify(bgcolor[1:])
def is_bgcolor(bgcolor):
try:
parse_bgcolor(bgcolor)
except Exception as e:
return False
else:
return True
This accepts 3-character hex strings (but then so do most data formats that use #-prefixed hex RGB), and even 16-character ones. If you want to add a check for the length, you can add that. Is the rule == 6 or in (3, 6) or % 3 == 0? I don't know, but presumably you do if you have a rule you want to add.
If you start using parse_bgcolor, you'll discover that it's giving you a bytes with 6 values from 0-255, when you really wanted 3 values from 0-65535. You can combine them manually, or you can parse each two-character pair as a number (e.g., with int(pair, 16)), or you can feed the 6-char bytes you already have into, say, struct.unpack('>HHH'). Whatever you need to do is pretty easy once you know exactly what you want to do.
Finally, if you're trying to parse CSS or HTML, things like red or rgb(1, 2, 3) are also valid colors. Do you need to handle those? If so, you'll need something a bit smarter than this. The first thing to do is look at the spec for what you're trying to parse, and work out the rules you need to turn into code. Then you can write the code.
The following would match a hex RGB string:
import re
_rgbstring = re.compile(r'#[a-fA-F0-9]{6}$')
def isrgbcolor(value):
return bool(_rgbstring.match(value))
This only returns True if a string starting with # followed by exactly 6 hex digits is passed in.
Demo:
>>> isrgbcolor('#FAF0E6')
True
>>> isrgbcolor('#FAF0')
False
>>> isrgbcolor('FAF0E6')
False
>>> isrgbcolor('#NotRgb')
False
If you want to support the 3-digit CSS format as well, update the pattern:
_rgbstring = re.compile(r'#[a-fA-F0-9]{3}(?:[a-fA-F0-9]{3})?$')
This matches a hash followed by 3 hex digits, plus an optional 3 extra digits.
This seems to be the most simplest way. This regex will notice the P doesn't belong in the HEX.
import re
from pprint import pprint
hex = '#f8Ed90P'
pprint(re.findall('[^#0-9a-fA-F]', hex))
..so if there is something in the result of re.findall there's something wrong with your HEX structure.
This code resulted in:
macbook-pro:Desktop allendar$ python3 test.py
['P']
This code has the flaw that the hash-deck can be anywhere, which of course isn't right.
You might just want to check the hash-deck at the beginning of the string so the regex is easier to discern. Afterwards just only check if the other characters are conform to the characters allowed in your regex check.

How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

Categories