String separater - python

I have a program that takes text input and stores a list of words that were input, but without duplicates. I need to store these in a file so I convert it into a string and join it with a comma between each word.
Now if there is a comma near a word then it would break. I therefore need a string to join the items of a list that is not part of any of the items.
For example if an item was "dog" the string og couldn't be used so the program would know this and add another letter on to make it a unique set of letters.
I then concatenate these strings to recreate the inputted text but it only works if the string I'm splitting them with is not part of the words.
I use ## now as it is unlikely that will be in the inputted text but I would like it to be perfect.

Consider using an existing serialization library that will convert your objects to string for you, without you having to invent an algorithm yourself. For instance, json:
>>> import json
>>> my_strings = ["foo", 'ba"r', "ba,z", "qu'x", "zo##rt"]
>>> s = json.dumps(my_strings)
>>> s
'["foo", "ba\\"r", "ba,z", "qu\'x", "zo##rt"]'
>>> type(s)
<class 'str'>
>>> result = json.loads(s)
>>> result
['foo', 'ba"r', 'ba,z', "qu'x", 'zo##rt']
>>> type(result)
<class 'list'>

Related

create a list by literally using a string

For some reason I have a text file describing a lists of regular expressions
RegexRemove = [ 'OC1.*','OC2.*','-UC.*','EG[0-9]{4,6}.*','_t[0-9]{0,2}\.[0-9]{0,2}$' ]
RegexReplace = [ ['LA.*','LA'],['IF.*', 'IF'],['BH.*', 'BH'],['DP.*', 'DP'] ]
I like to read in the lines as string and convert them to the list as described in the text file.
The line are like source code defining the list, but they are part of a bigger textfile, which cannot read and interpreted as python.
I tried to convert them by replacing and splitting the string, but I always run into trouble, since commatas are used as delimiters for split and are part of the regular expression, too.
Can I read in only the line containing "Regex" and convert them to the lists described there by using some fancy functions?
Extract the wanted lines (obviously you've already done with this), split them on the '=' char, and then pass the second part to ast.literal_eval():
>>> import ast
>>> s = "[ 'OC1.*','OC2.*','-UC.*','EG[0-9]{4,6}.*','_t[0-9]{0,2}\.[0-9]{0,2}$' ]"
>>> ast.literal_eval(s)
['OC1.*', 'OC2.*', '-UC.*', 'EG[0-9]{4,6}.*', '_t[0-9]{0,2}\\.[0-9]{0,2}$']
>>>
You can use eval to parse string as a python objects:
items = eval('[1,2,4]')
print(type(items),len(items), items) # output: <class 'list'> 3 [1, 2, 4]

Remove Characters from string with replace not working

I have a number of strings from which I am aiming to remove charactars using replace. However, this dosent seem to wake. To give a simplified example, this code:
row = "b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'"
row = row.replace("b'", "").replace("'", "").replace('b"', '').replace('"', '')
print(row.encode('ascii', errors='ignore'))
still ouputs this b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38' wheras I would like it to output James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38. How can I do this?
Edit: Updataed the code with a better example.
You seem to be mistaking single quotes for double quotes. Simple replace 'b:
>>> row = "xyz'b"
>>> row.replace("'b", "")
'xyz'
As an alternative to str.replace, you can simple slice the string to remove the unwanted leading and trailing characters:
>>> row[2:-1]
'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'
In your first .replace, change b' to 'b. Hence your code should be:
>>> row = "xyz'b"
>>> row = row.replace("'b", "").replace("'", "").replace('b"', '').replace('"', '')
# ^ changed here
>>> print(row.encode('ascii', errors='ignore'))
xyz
I am assuming rest of the conditions you have are the part of other task/matches that you didn't mentioned here.
If all you want is to take the string before first ', then you may just do:
row.split("'")[0]
You haven't listed this to remove 'b:
.replace("'b", '')
import ast
row = "b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'"
b_string = ast.literal_eval(row)
print(b_string)
u_string = b_string.decode('utf-8')
print(u_string)
out:
b_string:b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'
u_string: James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38
The real question is how to convert a string to python object.
You get a string which contains an a binary string, to convert it to python's binary string object, you should use eval(). ast.literal_eval() is more safe way to do it.
Now you get a binary string, you can convert it to unicode string which do not start with "b" by using decode()

Unicode object to a list

I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?
It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')
Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab

decoding a chinese stopwords file and appending to a list

I am trying to read a chinese stopwords file and append the characters to a list. This is my code:
word_list=[]
with open("stop-words_chinese_1_zh.txt", "r") as f:
for row in f:
decoded=row.decode("utf-8")
print decoded
word_list.append(decoded)
print word_list[:10]
This is my output. Decoded looks fine but after i append decoded to a list, it reverts back to the undecoded characters.
着
诸
自
[u'\u7684\r\n', u'\u4e00\r\n', u'\u4e0d\r\n', u'\u5728\r\n', u'\u4eba\r\n', u'\u6709\r\n', u'\u662f\r\n', u'\u4e3a\r\n', u'\u4ee5\r\n', u'\u4e8e\r\n']
The list hasn't reverted to the undecoded characters. If you print the type of the element in the list:
>>> print type(word_list[0])
You'd get:
<type 'unicode'>
So there isn't anything wrong with your list. Now we turn our attention to the print function. When you call print on an object, it prints whatever that object's str function returns. In the case of a list, however, its str function iteratively calls repr on each element, which returns the Python representation string of said element instead.
The behavior that you want here is to have str invoked instead of repr on each element in the list. There is one caveat here: str will attempt to encode the given object using 'ascii' encoding, which will invariably fail as the list elements are in unicode. For the purpose of displaying on screen, you likely want whatever sys.stdout.encoding is, and it's usually 'UTF-8'.
Thus, to print a unicode list on screen:
>>> import sys
>>> print '[' + ','.join(w.encode(sys.stdout.encoding) for w in word_list) + ']'
Alternatively, we can pass in a unicode string and let print deal with the on-screen encoding:
>>> print u'[' + u','.join(word_list) + u']'
And one last thing: it appears that the elements in your word_list contains newline characters as well. You may want to omit them since you're building a list of stop words. Your final solution would be:
>>> print u'[' + u','.join(w[0] for w in word_list) + u']'

How to display the first few characters of a string in Python?

I just started learning Python but I'm sort of stuck right now.
I have hash.txt file containing thousands of malware hashes in MD5, Sha1 and Sha5 respectively separated by delimiters in each line. Below are 2 examples lines I extracted from the .txt file.
416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f
56a99a4205a4d6cab2dcae414a5670fd|612aeeeaa8aa432a7b96202847169ecae56b07ee|d17de7ca4c8f24ff49314f0f342dbe9243b10e9f3558c6193e2fd6bccb1be6d2
My intention is to display the first 32 characters (MD5 hash) so the output will look something like this:
416d76b8811b0ddae2fdad8f4721ddbe 56a99a4205a4d6cab2dcae414a5670fd
Any ideas?
You can 'slice' a string very easily, just like you'd pull items from a list:
a_string = 'This is a string'
To get the first 4 letters:
first_four_letters = a_string[:4]
>>> 'This'
Or the last 5:
last_five_letters = a_string[-5:]
>>> 'string'
So applying that logic to your problem:
the_string = '416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f '
first_32_chars = the_string[:32]
>>> 416d76b8811b0ddae2fdad8f4721ddbe
Since there is a delimiter, you should use that instead of worrying about how long the md5 is.
>>> s = "416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f"
>>> md5sum, delim, rest = s.partition('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
Alternatively
>>> md5sum, sha1sum, sha5sum = s.split('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
>>> sha1sum
'd4f656ee006e248f2f3a8a93a8aec5868788b927'
>>> sha5sum
'12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f'
If you want first 2 letters and last 2 letters of a string then you can use the following code:
name = "India"
name[0:2]="In"
names[-2:]="ia"

Categories