How to display the first few characters of a string in Python? - python

I just started learning Python but I'm sort of stuck right now.
I have hash.txt file containing thousands of malware hashes in MD5, Sha1 and Sha5 respectively separated by delimiters in each line. Below are 2 examples lines I extracted from the .txt file.
416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f
56a99a4205a4d6cab2dcae414a5670fd|612aeeeaa8aa432a7b96202847169ecae56b07ee|d17de7ca4c8f24ff49314f0f342dbe9243b10e9f3558c6193e2fd6bccb1be6d2
My intention is to display the first 32 characters (MD5 hash) so the output will look something like this:
416d76b8811b0ddae2fdad8f4721ddbe 56a99a4205a4d6cab2dcae414a5670fd
Any ideas?

You can 'slice' a string very easily, just like you'd pull items from a list:
a_string = 'This is a string'
To get the first 4 letters:
first_four_letters = a_string[:4]
>>> 'This'
Or the last 5:
last_five_letters = a_string[-5:]
>>> 'string'
So applying that logic to your problem:
the_string = '416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f '
first_32_chars = the_string[:32]
>>> 416d76b8811b0ddae2fdad8f4721ddbe

Since there is a delimiter, you should use that instead of worrying about how long the md5 is.
>>> s = "416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f"
>>> md5sum, delim, rest = s.partition('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
Alternatively
>>> md5sum, sha1sum, sha5sum = s.split('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
>>> sha1sum
'd4f656ee006e248f2f3a8a93a8aec5868788b927'
>>> sha5sum
'12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f'

If you want first 2 letters and last 2 letters of a string then you can use the following code:
name = "India"
name[0:2]="In"
names[-2:]="ia"

Related

String separater

I have a program that takes text input and stores a list of words that were input, but without duplicates. I need to store these in a file so I convert it into a string and join it with a comma between each word.
Now if there is a comma near a word then it would break. I therefore need a string to join the items of a list that is not part of any of the items.
For example if an item was "dog" the string og couldn't be used so the program would know this and add another letter on to make it a unique set of letters.
I then concatenate these strings to recreate the inputted text but it only works if the string I'm splitting them with is not part of the words.
I use ## now as it is unlikely that will be in the inputted text but I would like it to be perfect.
Consider using an existing serialization library that will convert your objects to string for you, without you having to invent an algorithm yourself. For instance, json:
>>> import json
>>> my_strings = ["foo", 'ba"r', "ba,z", "qu'x", "zo##rt"]
>>> s = json.dumps(my_strings)
>>> s
'["foo", "ba\\"r", "ba,z", "qu\'x", "zo##rt"]'
>>> type(s)
<class 'str'>
>>> result = json.loads(s)
>>> result
['foo', 'ba"r', 'ba,z', "qu'x", 'zo##rt']
>>> type(result)
<class 'list'>

Unicode object to a list

I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?
It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')
Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab

python parsing a string

I have a list with strings.
list_of_strings
They look like that:
'/folder1/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file'
I want to part this string into:
/folder1/folder2/folder3/folder4/folder5/exp-* and put this into a new list.
I thought to do something like that, but I am lacking the right snippet to do what I want:
list_of_stringparts = []
for string in sorted(list_of_strings):
part= string.split('/')[7] # or whatever returns the first part of my string
list_of_stringparts.append(part)
has anyone an idea? Do I need a regex?
You are using array subscription which extracts one (eigth) element. To get first seven elements, you need a slicing [N:M:S] like this:
>>> l = '/folder1/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file'
>>> l.split('/')[:7]
['', 'folder1', 'folder2', 'folder3', 'folder4', 'folder5', 'exp-*']
In our case N is ommitted (by default 0) and S is step which is by default set to 1, so you'll get elements 0-7 from the result of split.
To construct your string back, use join():
>>> '/'.join(s)
'/folder1/folder2/folder3/folder4/folder5/exp-*'
I would do like this,
>>> s = '/folder1/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file'
>>> s.split('/')[:7]
['', 'folder1', 'folder2', 'folder3', 'folder4', 'folder5', 'exp-*']
>>> '/'.join(s.split('/')[:7])
'/folder1/folder2/folder3/folder4/folder5/exp-*'
Using re.match
>>> s = '/folder1/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file'
>>> re.match(r'.*?\*', s).group()
'/folder1/folder2/folder3/folder4/folder5/exp-*'
Your example suggests that you want to partition the strings at the first * character. This can be done with str.partition():
list_of_stringparts = []
list_of_strings = ['/folder1/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file', '/folder1/exp-*/folder2/folder3/folder4/folder5/exp-*/exp-*/otherfolder/file', '/folder/blah/pow']
for s in sorted(list_of_strings):
head, sep, tail = s.partition('*')
list_of_stringparts.append(head + sep)
>>> list_of_stringparts
['/folder/blah/pow', '/folder1/exp-*', '/folder1/folder2/folder3/folder4/folder5/exp-*']
Or this equivalent list comprehension:
list_of_stringparts = [''.join(s.partition('*')[:2]) for s in sorted(list_of_strings)]
This will retain any string that does not contain a * - not sure from your question if that is desired.

Particula string search in python

if 'bl1' in open('/tmp/ch.py').read():
print 'OK'
I have to search for the particular string "bl1".
Any way to get it?
I tried using ^bl1$ , it didn't work.
If you are simply running a verification to confirm "bl1" appears in the text of the file, I think your statement is OK. It have tested and the if statement returned True (given that ch.py has words containing 'bl1' inside)...
if 'bl1' in open('/tmp/ch.py').read():
print 'OK'
>>> if 'bl1' in open('/tmp/ch.py').read():
... print("ok")
...
ok
>>> if 'bl1' in "dfdsaflj hjjfadsfbl1dafdsfd bl1llll bdasbl1aa":
... print("ok")
...
ok
Reading from you saying '^bl1$', I assume you are trying to apply regular expression, but unfortunately you didn't follow the rules of regular expression explicitly.
If you are looking to extract the words containing the 3 consecutive characters 'bl1', you can apply the built-in function from re module.
>>> import re
>>> match = re.findall('\w+[bl1]\w+', open('/tmp/ch.py').read()) // finds all occurrences
>>> match
['asfdbl1', 'bl123'] // return all occurrences of words containing 'bl1' in a list
As in your format, it should look like this:
>>> if re.findall('\w+[bl1]\w+', open('/tmp/ch.py').read()):
... print('OK')
...
OK
>>>
In regular expression, '^bl1$' is the format to look one match in a string that starts with "bl1" and ends with "bl1", which means the whole string has to be 'bl1' exactly for a matching..
>>> match = re.findall('^bl1$', open('/tmp/ch.py').read())
>>> match
[]
>>> match = re.findall('^bl1$', 'bl1') // exactly "bl1"
>>> match
['bl1']
>>> match = re.findall('^bl1$', 'bl12') // not exactly "bl1"
>>> match
[]
If you are interested in regular expression, I hope you can find what you like in the documentation of Python standard libraries - re : https://docs.python.org/3.4/library/re.html

Add decoded(from Hindi) params to a given url in python

I have this url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='. after the "=" sign one Hindi word which denotes the word searched for is given. I want to be able to add that as a parameter to this url, so that I will only need to change the word each time and not the whole url. I tried to use this:
>>> url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='
>>> word = 'word1'
>>> conj = url + word
but this gives me the Hindi word in unicode. like this:
>>> conj
'http://www.bhaskar.com/uttar_pradesh/lucknow/=\xe0\xa6\xb8\xe0\xa6\xb0'
Can anyone help?
but this gives me the Bengali word in unicode
No, it does not :)
When you type temp in the terminal, it displays an unique interpretation of the string. When you type print(temp), however, you are getting a more user-friendly representation of the same string. In the end, however, the string pointed by temp is the same all the time, it is only presented in different ways. See, for example, if you get the second one and put it in a variable and print it:
>>> temp2 = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp2)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
Actually, you can create the string by using escaped values in all characters, not only the Bengali one:
>>> temp3 = '\x68\x74\x74\x70\x3a\x2f\x2f\x77\x77\x77\x2e\x63\x66\x69\x6c\x74\x2e\x69\x69\x74\x62\x2e\x61\x63\x2e\x69\x6e\x2f\x69\x6e\x64\x6f\x77\x6f\x72\x64\x6e\x65\x74\x2f\x66\x69\x72\x73\x74\x3f\x6c\x61\x6e\x67\x6e\x6f\x3d\x33\x26\x71\x75\x65\x72\x79\x77\x6f\x72\x64\x3d\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp3)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
In the end, all these strings are the same:
>>> temp == temp2
True
>>> temp == temp3
True
So, don't worry, you have the correct string in the variable. You are only getting a problem if the escaped string is displayed elsewhere. Finish your program, run it until the end and you'll see there will be no errors.

Categories