I was trying to parse a fortinet log into csv file. Part of the code will read all 4 million lines in the log file and save any word before an "=" sign as an element in a set(). These element will become the csv headers.
When trying to parse 500000 lines, the set looks ok, but when I try to parse 1 million lines, some elements starts to become long hexadecimal values.
Attached is how it looked like in the end.
C:\Python27>python.exe D:\parser\logtocsv.py
* Finding Column Headers 2018-07-24 08:59:23
completed 1517027 lines
set(['', 'shapersentname', 'tuple-num', 'bandwidth', 'totalsession', 'disk', 'hook', 'HTTP/1.1in.css?ver', 'group', 'HTTP/1.13&ip', 'to',
'\xe7\xbf\xbb\xe4\xba\x86\xe5\x8d\x8a\xe5\xa4\xa9\xe6\x89\x8d\xe6\x89\xbe\xe5\x88\xb0\xe4\xbd\xa0\xef\xbc\x8cJia\xe6\x88\x91\xe5\xb8\xb8\xe7\x
94\xa8\xe6\x89\xa3\xe5\x8f\xb7706772123\xe5\x88\xab\xe5\x86\x8d\xe5\x88\xa0\xe6\x88\x91\xe5\x95\xa6~', 'HTTP/1.1?cms_redirect', 'analyticscksu
m', 'devname', 'setuprate', 'appact', 'fazlograte', 'recipient', 'sentpkt', 'shaperrcvdname', 'level', 'subtype', 'attackid', 'appid', 'dir',
'profile', 'sentbyte', 'crscore', 'duration', 'analyticssubmit', 'subject', 'error', 'eventtype', 'dstcountry', 'countweb', 'filename', 'diskl
ograte', 'applist', 'fcni', 'ref', 'method', 'mem', 'incidentserialno', 'processtime', 'reason', 'dstintf', 'srcintf', 'countav', 'sender', 'v
irusid', 'logid', 'HTTP/1.1ver', 'act', 'action', 'carrier_ep', 'policyid', 'dstip', 'rcvdbyte', 'srccountry', 'dtype', 'app', 'utmaction', 's
rcip', '\xe7\xbf\xbb\xe4\xba\x86\xe5\x8d\x8a\xe5\xa4\xa9\xe6\x89\x8d\xe6\x89\xbe\xe5\x88\xb0\xe4\xbd\xa0\xef\xbc\x8cJia\xe6\x88\x91\xe5\xb8\xb
8\xe7\x94\xa8\xe6\x89\xa3\xe5\x8f\xb7715859168\xe5\x88\xab\xe5\x86\x8d\xe5\x88\xa0\xe6\x88\x91\xe5\x95\xa6~', 'crlevel', 'shaperdropsentbyte',
'rsso_key', 'from', 'log', 'service', 'fdni', 'devid', '\xe5\xbe\x88\xe5\xbc\x80\xe5\xbf\x83\xef\xbc\x8c\xe8\x83\xbd\xe6\x89\xbe\xe5\x88\xb0\
xe4\xbd\xa0\xef\xbc\x81Jia\xe4\xb8\x8b\xe6\x88\x91\xe5\xb8\xb8\xe7\x94\xa8q\xe5\x8f\xb7717598789\xe5\x88\xab\xe5\xbf\x98\xe4\xba\x86\xef\xbc\x
81', 'attack', 'filesize', 'logdesc', 'poluuid', 'msg', 'type', 'direction', 'authproto', 'sessionid', 'shaperdroprcvdbyte', 'countips', 'coun
t', 'datarange', 'cat', 'ui', 'countapp', 'rcvdpkt', 'quarskip', 'vd', 'craction', 'file', 'apprisk', 'severity', 'proto', 'hostname', 'new_st
atus', 'attachment', 'dstport', 'status', 'acct_stat', 'time', 'fsci', 'catdesc', 'virus', 'reporttype', 'user', 'reqtype', 'date', 'old_statu
s', 'countemail', 'url', 'appcat', 'srcport', 'command', 'trandisp', 'cpu'])
pause^A
Below is the part of my code that saves the elements in my set():
import csv
import time
import datetime
def findColumnHeaders(log_file_path):
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
print "* Finding Column Headers " + st
f = open(log_file_path, "r")
col_headers = set() # create empty set for all column headers
col_headers_old = set()
content = f.readlines()
content = [x.strip() for x in content]# List of lines to iterate through
current_col_header = ""
start = False
count = 0
letter_chr = 0
line_number = 0
letter_number = 0
for line in content:
letter_number = 0
line_number += 1
for letter in line:
letter_number += 1
if letter == " ":
# space means start taking in new col_value
# data is in this structure with space prior to column names -> " column=col_value"
start = True
current_col_header = ""
elif letter == "=":
# when hits "=", means that prior word is a column col_value -> "column=col_value"
col_headers.add(current_col_header)
# reset current_col_header to empty string
current_col_header = ""
# only take in once another space has been encountered
start = False
continue
elif start:
current_col_header += letter
# else, do nothing
print "completed " + str(line_number) + " lines"
print col_headers
raw_input("pause")
return col_headers
Your problem is that you're printing out UTF-8-encoded strings. For example, this:
'\xe7\xbf\xbb\xe4\xba\x86\xe5\x8d\x8a\xe5\xa4\xa9\xe6\x89\x8d\xe6\x89\xbe\xe5\x88\xb0\xe4\xbd\xa0\xef\xbc\x8cJia\xe6\x88\x91\xe5\xb8\xb8\xe7\x94\xa8\xe6\x89\xa3\xe5\x8f\xb7706772123\xe5\x88\xab\xe5\x86\x8d\xe5\x88\xa0\xe6\x88\x91\xe5\x95\xa6~'
… is the UTF-8 encoding of this string:
翻了半天才找到你,Jia我常用扣号706772123别再删我啦~
What you want to do is decode your UTF-8 str byte strings to unicode text strings.
The cleanest place to do this is as early as possible—tell the file object itself to decode thing for you. You can use either codecs.open if you want to be compatible with Python 2.6 and earlier, or io.open if you want to be compatible with Python 3. Either way, you're also going to want to replace your string literals like "=" with unicode literals like u"=". So, for example:
f = io.open(log_file_path, "r", encoding="utf-8")
# ...
current_col_header = u""
# ...
if letter == u" ":
# etc.
The smallest change, on the other hand, is to do this as late as possible, manually decoding things only when you store them in the set:
col_headers.add(current_col_header.decode('utf-8')
… or even later, when you print things out:
print {header.decode('utf-8') for header in col_headers}
The benefit of the first approach is that if you wanted to look for any non-ASCII characters, you can do that with unicode string. For example, the first letter in the Chinese string as unicode is u'翻', so you can just do if letter == u'翻':; the first byte in the Chinese string as UTF-8 is '\xe7', so you can't do if letter == '翻': (and, while you can do if letter == '\xe7', that would be incorrect, because lots of other character start with the same \xe7 byte).
But if that's never going to be an issue, you can do it either way.
As a side note, because you're on Windows and using Python 2.7, trying to print non-ASCII strings may just not work. There are workarounds, but they're all painful. The easy solution is to switch to Python 3. (In fact, in Python 3, your whole problem would never have arisen in the first place, because files automatically decode UTF-8, and every string is automatically a Unicode string.) But if you can't upgrade for some reason, and you run into this problem, you will need one of those horrible workarounds.
Related
I've a application.properties file, key-value pair depending on the condition I'm taking whether to consider single tab or double tab.
application.properties:
key1=\t
key2=\t\t
main.py
with open('application.properties', 'rt')
read and convert to key value pair
return props
str1 = 'abc xyz'
str2 = 'def jkl'
splitter1 = props['key1']
splitter2 = props['key2']
print(str1.split(splitter1)[1])
print(str1.split(splitter2)[1])
Indexerror: list of index out of range
print(type(splitter2) , splitter2)
<class 'str'> \t\t
Python processes \t in string literals, not string values. You'll have to replace the digraph \t with a tab character yourself. Something like
props = {}
with open('application.properties') as f:
for line in f:
name, value = line.strip().split('=')
props[name] = value.replace(r'\t', '\t')
If there are other escape sequences that you expect to appear, you'll have to handle them yourself as well.
So I am reading and manipulate a file with :
base_file = open(path+'/'+base_name, "r")
lines = base_file.readlines()
After this I search and find the "raw_data" start of line.
if re.match("\s{0,100}raw_data: ",line):
split_line = line.split("raw_data:")
print(split_line)
raw_string = split_line[1]
One example of raw_data is:
raw_data: "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
And raw_string will be
print(raw_data)
"&\276!\300\307
=\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
If I tried to read this file I will obtain one char to one char even for escape characters.
So, my question is how to transform this plain text to utf-8 string so that I can have one character when reading \300 and not 4 characters.
I tried to pass "encondig =utf-8" in open file method but does not work.
I have made the same example passing raw_data as variable and it works properly.
RAW_DATA = "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
For this example \300 is only one char.
Hope someone can help me.
The problem is that on the read file the escape \ symbols are coming in as \, but in the example you've provided they are being evaluated as part of the numerics that follow it. ie, \276 is read as a single character.
If you run:
RAW_DATA = r"&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
You would should be getting the same error that you were getting originally. Notice that we are using the raw-string literal instead of regular string literal. This will ensure that the \ don't get escaped.
You would need to evaluate the RAW_DATA to force it to evaluate the \.
You can do something like RAW_DATA = eval(f'"{RAW_DATA}"') or
import ast
RAW_DATA = ast.literal_eval(f'"{RAW_DATA}"')
Note, the second option is a bit more secure that doing a straight eval as you are limiting the scope of what can be executed.
I'm trying to read a null terminated string but i'm having issues when unpacking a char and putting it together with a string.
This is the code:
def readString(f):
str = ''
while True:
char = readChar(f)
str = str.join(char)
if (hex(ord(char))) == '0x0':
break
return str
def readChar(f):
char = unpack('c',f.read(1))[0]
return char
Now this is giving me this error:
TypeError: sequence item 0: expected str instance, int found
I'm also trying the following:
char = unpack('c',f.read(1)).decode("ascii")
But it throws me:
AttributeError: 'tuple' object has no attribute 'decode'
I don't even know how to read the chars and add it to the string, Is there any proper way to do this?
Here's a version that (ab)uses __iter__'s lesser-known "sentinel" argument:
with open('file.txt', 'rb') as f:
val = ''.join(iter(lambda: f.read(1).decode('ascii'), '\x00'))
How about:
myString = myNullTerminatedString.split("\x00")[0]
For example:
myNullTerminatedString = "hello world\x00\x00\x00\x00\x00\x00"
myString = myNullTerminatedString.split("\x00")[0]
print(myString) # "hello world"
This works by splitting the string on the null character. Since the string should terminate at the first null character, we simply grab the first item in the list after splitting. split will return a list of one item if the delimiter doesn't exist, so it still works even if there's no null terminator at all.
It also will work with byte strings:
myByteString = b'hello world\x00'
myStr = myByteString.split(b'\x00')[0].decode('ascii') # "hello world" as normal string
If you're reading from a file, you can do a relatively larger read - estimate how much you'll need to read to find your null string. This is a lot faster than reading byte-by-byte. For example:
resultingStr = ''
while True:
buf = f.read(512)
resultingStr += buf
if len(buf)==0: break
if (b"\x00" in resultingStr):
extraBytes = resultingStr.index(b"\x00")
resultingStr = resultingStr.split(b"\x00")[0]
break
# now "resultingStr" contains the string
f.seek(0 - extraBytes,1) # seek backwards by the number of bytes, now the pointer will be on the null byte in the file
# or f.seek(1 - extraBytes,1) to skip the null byte in the file
(edit version 2, added extra way at the end)
Maybe there are some libraries out there that can help you with this, but as I don't know about them lets attack the problem at hand with what we know.
In python 2 bytes and string are basically the same thing, that change in python 3 where string is what in py2 is unicode and bytes is its own separate type, which mean that you don't need to define a read char if you are in py2 as no extra work is required, so I don't think you need that unpack function for this particular case, with that in mind lets define the new readString
def readString(myfile):
chars = []
while True:
c = myfile.read(1)
if c == chr(0):
return "".join(chars)
chars.append(c)
just like with your code I read a character one at the time but I instead save them in a list, the reason is that string are immutable so doing str+=char result in unnecessary copies; and when I find the null character return the join string. And chr is the inverse of ord, it will give you the character given its ascii value. This will exclude the null character, if its needed just move the appending...
Now lets test it with your sample file
for instance lets try to read "Sword_Wea_Dummy" from it
with open("sword.blendscn","rb") as archi:
#lets simulate that some prior processing was made by
#moving the pointer of the file
archi.seek(6)
string=readString(archi)
print "string repr:", repr(string)
print "string:", string
print ""
#and the rest of the file is there waiting to be processed
print "rest of the file: ", repr(archi.read())
and this is the output
string repr: 'Sword_Wea_Dummy'
string: Sword_Wea_Dummy
rest of the file: '\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
other tests
>>> with open("sword.blendscn","rb") as archi:
print readString(archi)
print readString(archi)
print readString(archi)
sword
Sword_Wea_Dummy
ÍÌÌ=p=Š4:¦6¿JÆ=
>>> with open("sword.blendscn","rb") as archi:
print repr(readString(archi))
print repr(readString(archi))
print repr(readString(archi))
'sword'
'Sword_Wea_Dummy'
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6='
>>>
Now that I think about it, you mention that the data portion is of fixed size, if that is true for all files and the structure on all of them is as follow
[unknow size data][know size data]
then that is a pattern we can exploit, we only need to know the size of the file and we can get both part smoothly as follow
import os
def getDataPair(filename,knowSize):
size = os.path.getsize(filename)
with open(filename, "rb") as archi:
unknown = archi.read(size-knowSize)
know = archi.read()
return unknown, know
and by knowing the size of the data portion, its use is simple (which I get by playing with the prior example)
>>> strins_data, data = getDataPair("sword.blendscn", 80)
>>> string_data, data = getDataPair("sword.blendscn", 80)
>>> string_data
'sword\x00Sword_Wea_Dummy\x00'
>>> data
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
>>> string_data.split(chr(0))
['sword', 'Sword_Wea_Dummy', '']
>>>
Now to get each string a simple split will suffice and you can pass the rest of the file contained in data to the appropriated function to be processed
Doing file I/O one character at a time is horribly slow.
Instead use readline0, now on pypi: https://pypi.org/project/readline0/ . Or something like it.
In 3.x, there's a "newline" argument to open, but it doesn't appear to be as flexible as readline0.
Here is my implementation:
import struct
def read_null_str(f):
r_str = ""
while 1:
back_offset = f.tell()
try:
r_char = struct.unpack("c", f.read(1))[0].decode("utf8")
except:
f.seek(back_offset)
temp_char = struct.unpack("<H", f.read(2))[0]
r_char = chr(temp_char)
if ord(r_char) == 0:
return r_str
else:
r_str += r_char
What I'm trying to do is open a file, then find every instance of '[\x06I"' and '\x06;', then return whatever is between the two.
Since this is not a standard text file (it's map data from RPG maker) readline() will not work for my purposes, as the file is not at all formatted in such a way that the data I want is always neatly within one line by itself.
What I'm doing right now is loading the file into a list with read(), then simply deleting characters from the very beginning until I hit the string '[\x06I'. Then I scan ahead to find '\x06;', store what's between them as a string, append said string to a list, then resume at the character after the semicolon I found.
It works, and I ended up with pretty much exactly what I wanted, but I feel like that's the worst possible way to go about it. Is there a more efficient way?
My relevant code:
while eofget == 0:
savor = 0
while savor == 0 or eofget == 0:
if line[0:4] == '[\x06I"':
x = 4
spork = 0
while spork == 0:
x += 1
if line[x] == '\x06':
if line[x+1] == ';':
spork = x
savor = line[5:spork] + "\n"
line = line[x+1:]
linefinal[lineinc] = savor
lineinc += 1
elif line[x:x+7] == '#widthi':
print("eof reached")
spork = 1
eofget = 1
savor = 0
elif line[x:x+7] == '#widthi':
print("finished map " + mapname)
eofget = 1
savor = 0
break
else:
line = line[1:]
You can just ignore the variable names. I just name things the first thing that comes to mind when I'm doing one-offs like this. And yes, I am aware a few things in there don't make any sense, but I'm saving cleanup for when I finalize the code.
When eofget gets flipped on this subroutine terminates and the next map is loaded. Then it repeats. The '#widthi' check is basically there to save time, since it's present in every map and indicates the beginning of the map data, AKA data I don't care about.
I feel this is a natural case to use regular expressions. Using the findall method:
>>> s = 'testing[\x06I"text in between 1\x06;filler text[\x06I"text in between 2\x06;more filler[\x06I"text in between \n with some line breaks \n included in the text\x06;ending'
>>> import re
>>> p = re.compile('\[\x06I"(.+?)\x06;', re.DOTALL)
>>> print(p.findall(s))
['text in between 1', 'text in between 2', 'text in between \n with some line breaks \n included in the text']
The regex string '\[\x06I"(.+?)\x06;'can be interpreted as follows:
Match as little as possible (denoted by ?) of an undetermined number of unspecified characters (denoted by .+) surrounded by '[\x06I"' and '\x06;', and only return the enclosed text (denoted by the parentheses around .+?)
Adding re.DOTALL in the compile makes the .? match line breaks as well, allowing multi-line text to be captured.
I would use split():
fulltext = 'adsfasgaseg[\x06I"thisiswhatyouneed\x06;sdfaesgaegegaadsf[\x06I"this is the second what you need \x06;asdfeagaeef'
parts = fulltext.split('[\x06I"') # split by first label
results = []
for part in parts:
if '\x06;' in part: # if second label exists in part
results.append(part.split('\x06;')[0]) # get the part until the second label
print results
I have a text file which contains entry like
70154::308933::3
UserId::ProductId::Score
I wrote this program to read:
(Sorry the indendetion is bit messed up here)
def generateSyntheticData(fileName):
dataDict = {}
# rowDict = []
innerDict = {}
try:
# for key in range(5):
# count = 0
myFile = open(fileName)
c = 0
#del innerDict[0:len(innerDict)]
for line in myFile:
c += 1
#line = str(line)
n = len(line)
#print 'n: ',n
if n is not 1:
# if c%100 ==0: print "%d: "%c, " entries read so far"
# words = line.replace(' ','_')
words = line.replace('::',' ')
words = words.strip().split()
#print 'userid: ', words[0]
userId = int( words[0]) # i get error here
movieId = int (words[1])
rating =float( words[2])
print "userId: ", userId, " productId: ", movieId," :rating: ", rating
#print words
#words = words.replace('_', ' ')
innerDict = dataDict.setdefault(userId,{})
innerDict[movieId] = rating
dataDict[userId] = (innerDict)
innerDict = {}
except IOError as (errno,strerror):
print "I/O error({0}) :{1} ".format(errno,strerror)
finally:
myFile.close()
print "total ratings read from file",fileName," :%d " %c
return dataDict
But i get the error:
ValueError: invalid literal for int() with base 10: ''
Funny thing is, it is working just fine reading the same format data from other file..
Actually while posting this question, I noticed something weird..
The entry 70154::308933::3
each number has a space.in between like 7 space 0 space 1 space 5 space 4 space :: space 3...
BUt the text file looks fine..:( on copy pasting only it shows this nature..
Anyways.. but any clue whats going on.
Thanks
The "spaces" thay you are seeing appear to be NULs ("\x00"). There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". If however you need to process it as is, you'll need to know/detect what the encoding is. To find out which, do this:
print repr(open("yourfile.txt", "rb").read(20))
and compare the srtart of the output with the following:
>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
... enc = "UTF-16" + sfx
... print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>
You can make a detector that's good enough for your purposes by inspecting the first 2 bytes:
[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.
You could avoid hard-coding the fallback encoding:
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
Your line-reading code will look like this:
rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
# whatever
Oh, and the lines will be unicode objects ... if that gives you a problem, ask another question.
Debugging 101: simply change the line:
words = words.strip().split()
to:
words = words.strip().split()
print words
and see what comes out.
I will mention a couple of things. If you have the literal UserId::... in the file and you try to process it, it won't take kindly to trying to convert that to an integer.
And the ... unusual line:
if n is not 1:
I would probably write as:
if n != 1:
If, as you indicate in your comment, you end up seeing:
['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']
then I'd be checking your input file for binary (non-textual) data. You should never end up with that binary information if you're just reading text and trimming/splitting.
And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. It may be a UTF-16 Unicode string, for example.