newbie playing around with hashes here and not getting the result I am looking for. Trying to get a hash from a txt file from the web, then comparing that hash to a local hash.
For testing purposes I'm using SHA256.new(“10”).hexdigest() which is: 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5
CODE:
import urllib2
from Crypto.Hash import SHA256
source = urllib2.urlopen("<xxURLxx>")
line1 = source.readline() # get first line of the txt file in source which is the hash
localHash = SHA256.new("10").hexdigest()
if localHash == line1: # I know, shouldnt use == to compare hashes but it is my first try.
print("it works!")
else:
print("it does not work...")
Printing the hashes I get from the web file and the local hash they return the same characters. But if I hash each hash one more time I get different results.
Any ideas?
Had a look around S.O. and found:
Compare result from hexdigest() to a string
but the issue there was the lack of .digest() which I have.
Thank you in advance for any help.
If I had to guess, I'd say that changing
line1 = source.readline()
to
line1 = source.readline().strip()
will fix the problem. strip() removes leading and trailing whitespace, including the newline ('\n') character that will almost certainly be at the end of the first line read by readline.
You can see whether there are "invisible" characters like that by using repr, which renders them explicitly using escape characters:
>>> print repr('\t')
'\t'
Related
Caveat emptor: I can spell p-y-t-h-o-n and that's pretty much all there is to my knowledge. I tried to take some online classes but after about 20 lectures learning not much, I gave up long time ago. So, what I am going to ask is very simple but I need help:
I have a file with the following structure:
object_name_here:
object_owner:
- me#my.email.com
- user#another.email.com
object_id: some_string_here
identification: some_other_string_here
And this block repeats itself hundreds of times in the same file.
Other than object_name_here being unique and required, all other lines may or may not be present, email addresses can be from none to 10+ different email addresses.
what I want to do is to export this information into a flat file, likes of /etc/passwd, with a twist
for instance, I want the block above to yield a line like this:
object_name_here:object_owner=me#my_email.com,user#another.email.com:objectid=some_string_here:identification=some_other_string_here
again, the number of fields or length of the content fields are not fixed by any means. I am sure this is pretty easy task to accomplish with python but how, I don't know. I don't even know where to start from.
Final Edit: Okay, I am able to write a shell script (bash, ksh etc.) to parse the information, but, when I asked this question originally, I was under the impression that, python had a simpler way of handling uniform or semi-uniform data structures as this one. My understanding was proven to be not very accurate. Sorry for wasting your time.
As jaypb points out, regular expressions are a good idea here. If you're interested in some python 101, I'll give you some simple code to get you started on your own solution.
The following code is a quick and dirty way to lump every six lines of a file into one line of a new file:
# open some files to read and write
oldfile = open("oldfilename","r")
newfile = open("newfilename","w")
# initiate variables and iterate over the input file
count = 0
outputLine = ""
for line in oldfile:
# we're going to append lines in the file to the variable outputLine
# file.readline() will return one line of a file as a string
# str.strip() will remove whitespace at the beginning and end of a string
outputLine = outputLine + oldfile.readline().strip()
# you know your interesting stuff is six lines long, so
# reset the output string and write it to file every six lines
if count%6 == 0:
newfile.write(outputLine + "\n")
outputLine = ""
# increment the counter
count = count + 1
# clean up
oldfile.close()
newfile.close()
This isn't exactly what you want to do but it gets you close. For instance, if you want to get rid of " - " from the beginning of the email addresses and replace it with "=", instead of just appending to outputLine you'd do something like
if some condition:
outputLine = outputLine + '=' + oldfile.readline()[3:]
that last bit is a python slice, [3:] means "give me everything after the third element," and it works for things like strings or lists.
That'll get you started. Use google and the python docs (for instance, googling "python strip" takes you to the built-in types page for python 2.7.10) to understand every line above, then change things around to get what you need.
Since you are replacing text substrings with different text substrings, this is a pretty natural place to use regular expressions.
Python, fortunately, has an excellent regular expressions library called re.
You will probably want to heavily utilize
re.sub(pattern, repl, string)
Look at the documentation here:
https://docs.python.org/3/library/re.html
Update: Here's an example of how to use the regular expression library:
#!/usr/bin/env python
import re
body = None
with open("sample.txt") as f:
body = f.read()
# Replace emails followed by other emails
body = re.sub(" * - ([a-zA-Z.#]*)\n * -", r"\1,", body)
# Replace declarations of object properties
body = re.sub(" +([a-zA-Z_]*): *[\n]*", r"\1=", body)
# Strip newlines
body = re.sub(":?\n", ":", body)
print (body)
Example output:
$ python example.py
object_name_here:object_owner=me#my.email.com, user#another.email.com:object_id=some_string_here:identification=some_other_string_here
I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.
I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')
I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).
Write a program that outputs the first number within a file specified by the user. It should behave like:
Enter a file name: l11-1.txt
The first number is 20.
You will need to use the file object method .read(1) to read 1 character at a time, and a string object method to check if it is a number. If there is no number, the expected behaviour is:
Enter a file name: l11-2.txt
There is no number in l11-2.txt.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
I have the files and it does correspond to the answers above but im not sure how to make it output properly.
The code i have so far is below:
filenm = raw_input("Enter a file name: ")
datain=file(filenm,"r")
try:
c=datain.read(1)
result = []
while int(c) >= 0:
result.append(c)
c = datain.read(1)
except:
pass
if len(result) > 0:
print "The first number is",(" ".join(result))+" . "
else:
print "There is no number in" , filenm + "."
so far this opens the file and reads it but the output is always no number even if there is one. Can anyone help me ?
OK, you've been given some instructions:
read a string input from the user
open the file given by that string
.read(1) a character at a time until you get the first number or EOF
print the number
You've got the first and second parts here (although you should use open instead of file to open a file), what next? The first thing to do is to work out your algorithm: what do you want the computer to do?
Your last line starts looping over the lines in the file, which sounds like not what your teacher wants -- they want you to read a single character. File objects have a .read() method that lets you specify how many bytes to read, so:
c = datain.read(1)
will read a single character into a string. You can then call .isdigit() on that to determine if it's a digit or not:
c.isdigit()
It sounds like you're supposed to keep reading a digit until you run out, and then concatenate them all together; if the first thing you read isn't a digit (c.isdigit() is False) you should just error out
Your datain variable is a file object. Use its .read(1) method to read 1 character at a time. Take a look at the string methods and find one that will tell you if a string is a number.
Why is reading 1 character at a time a better algorithm than calling .read() once and then processing the resulting string using a loop?
Define "better".
In this case, it's "better" because it makes you think.
In some cases, it's "better" because it can save reading an entire line when reading the first few bytes is enough.
In some cases, it's "better" because the entire line may not be sitting around in the input buffer.
You could use regex like (searching for an integer or a float):
import re
with open(filename, 'r') as fd:
match = re.match('([-]?\d+(\.\d+|))', fd.read())
if match:
print 'My first number is', match.groups()[0]
This with with anything like: "Hello 111." => will output 111.