Parse log file in python

Parse log file in python - python

I have a log file that has lines that look like this:
"1","2546857-23541","f_last","user","4:19 P.M.","11/02/2009","START","27","27","3","c2546857-23541",""
Each line in the log as 12 double quote sections and the 7th double quote section in the string comes from where the user typed something into the chat window:
"22","2546857-23541","f_last","john","4:38 P.M.","11/02/2009","
What's up","245","47","1","c2546857-23541",""
This string also shows the issue I'm having; There are areas in the chat log where the text the user typed is on a new line in the log file instead of the same line like the first example.
So basically I want the lines in the second example to look like the first example.
I've tried using Find/Replace in N++ and I am able to find each "orphaned" line but I was unable to make it join the line above it.
Then I thought of making a python file to automate it for me, but I'm kind of stuck about how to actually code it.
Python errors out at this line running unutbu's code
"1760","4746880-00129","bwhiteside","tom","11:47 A.M.","12/10/2009","I do not see ^"refresh your knowledge
^" on the screen","422","0","0","c4746871-00128",""

The csv module is smart enough to recognize when a quoted item is not finished (and thus must contain a newline character).
import csv
with open('data.log',"r") as fin:
with open('data2.log','w') as fout:
reader=csv.reader(fin,delimiter=',', quotechar='"', escapechar='^')
writer=csv.writer(fout, delimiter=',',
doublequote=False, quoting=csv.QUOTE_ALL)
for row in reader:
row[6]=row[6].replace('\n',' ')
writer.writerow(row)

If you data is valid CSV you can use Python's csv.reader class. It should work just fine with your sample data. It may not work correctly depending an what an embeded double-quote looks like from the source system. See: http://docs.python.org/library/csv.html#module-contents.

Unless I'm misunderstanding the problem. You simply need to read in the file and remove any newline characters that occur between double quote characters.

Related

Which newline character is in my CSV?

We receive a .tar.gz file from a client every day and I am rewriting our import process using SSIS. One of the first steps in my process is to unzip the .tar.gz file which I achieve via a Python script.
After unzipping we are left with a number of CSV files which I then import into SQL Server. As an aside, I am loading using the CozyRoc DataFlow Task Plus.
Most of my CSV files load without issue but I have five files which fail. By reading the log I can see that the process is reading the Header and First line as though there is no HeaderRow Delimiter (i.e. it is trying to import the column header as ColumnHeader1ColumnValue1
I took one of these CSVs, copied the top 5 rows into Excel, used Text-To-Columns to delimit the data then saved that as a new CSV file.
This version imported successfully.
That makes me think that somehow the original CSV isn't using {CR}{LF} as the row delimiter but I don't know how to check. Any suggestions?

I ended up using the suggestion commented by #vahdet because I already had notepad++ installed. I can't find the same option in EmEditor but it may exist
For those who are curious, the files are using {LF} which is consistent with the other files. My investigation continues...

Seeing that you have EmEditor, you can use EmEditor to find the eol character in two ways:
Use View > Character Code Value... at the end of a line to display a dialog box showing information about the character at the current position.
Go to View > Marks and turn on Newline Characters and CR and LF with Different Marks to show the eol while editing. LF is displayed with a down arrow while CRLF is a right angle.
Some other things you could try checking for are: file encoding, wrong type of data for a field and an inconsistent number of columns.

Writing out text with double double quotes - Python on Linux

I'm trying to take the text output of a query to an SSD (pulling a log page, similar to pulling SMART data. I'm then trying to write this text data out of a log file I update periodically.
My problem happens when the log data for some drives has double double-quotes as a placeholder for a blank field. Here is a snippet of the input:
VER 0x10200
VID 0x15b7
BoardRev 0x0
BootLoadRev ""
When this gets written out (appended) to my own log file, the text gets replaced with several null characters and then when I try to open all the text editors tell me it's corrupted.
The "" characters are replaced by something like this on my Linux system:
BootLoadRev "\00\00\00\00"
Some fields are even longer with the \00 characters. If the "" is not there, things write out OK.
The code is similar to this:
f=open(fileName, 'w')
test_bench.send_command('get_log_page')
identify_data = test_bench.get_data_in()
f.write(identify_data)
f.close()
Is there a way to send this text to a file w/o these nulls causing problems?

Assuming that this is Python 2 (and that your content is thus what Python 3 would call a bytestring), and that your intended data format is raw ASCII, the trivial solution is simply to remove the NULs from your content before you write to disk:
f.write(identify_data.replace('\0', ''))

Python not inserting an EOF character after file close?

I'm having a strange problem where some Python code that prints to a file is not inserting an EOF character. Basically, the Python script generates runscripts to later be submitted as jobs on a cluster. I essentially wrote the entire runscript between """'s, allowing for variables to be plugged in (to vary some parameters in my simulation). I write the runscripts using the
with open(file_name, 'w') as runscrpt:
runscrpt.write("""ENTIRE_FILE_CONTENTS_HERE""")
syntax. I can give the actual code if necessary but it's not much more than above. Despite the script running fine and generating all of my runsripts, whenever I submitted them nothing happened. It took me a long time to figure out why, but it's because they're missing an EOF character. I can fix it by, for example, opening one, adding some trailing whitespace or a blank line somewhere in vim, and resaving the file.
Why isn't Python inserting the EOF character, and is there a better way to fix this than manually making trivial edits to all the files with vim?

Sounds like you mean there is no EOL (not EOF!) at the end, because that's what diff will typically tell you. Just add a newline at the end of the write (make sure there is a newline before the final """ terminator, or write a separate newline explicitly).
with open(file_name, 'w') as runscript:
runscript.write("""ENTIRE_FILE_CONTENTS_HERE\n""")
(As a bonus, I added the missing vowel.)

Python - Replace slashes in CSV file

I've got a csv file that has around 100 rows. Some of the cells in the 100 rows have filepaths like:
C:\\\\Users\\\Simon\\\\Desktop\\\\file.jpg
I want to open the csv file in python and change only the rows that have triple-slashes and convert them to a single backslash. Here is my code so far:
import csv
with open('myCsvFile', 'rb') as csvfile:
SysIndexTwo = csv.reader(csvfile)
for allRows in SysIndexTwo:
if '\\\\' in allRows:
writer.writerows(allRows.replace('\\\\', '\\'))
Tried the suggestions and get the following error:
Traceback (most recent call last):
File "SIPHON2.py", line 7, in <module>
for allRows in SysIndexTwo:
ValueError: I/O operation on closed file
simon#ubuntu:~/Desktop$ python SIPHON2.py
Traceback (most recent call last):
File "SIPHON2.py", line 7, in <module>
for allRows in SysIndexTwo:
ValueError: I/O operation on closed file
This doesn't seem to work. Any ideas?
Thanks

You need to indent your actual processing. Right now, you drop out of the context manager (the with statement where you define your CSV reader) before you try to use it. Thus, you get the "IO operation on closed file" error because the context manager closed the file when you left it.
You want this:
with open('myCsvFile', 'rb') as csvfile:
reader = csv.reader(csvfile) # Simple names are good, esp. in small scope!
for row in reader: # Indent me!
pass # Do stuff here.
The with statement is handy for automatically closing files (among other things) for you. However, this means that any work you do that requires the file you're using must be done before you leave the block, because once you leave, the file is closed!
The csv reader doesn't read the whole file when you initialize it: it reads it on demand. Thus, you need to still be inside the block when you read lines from the csv reader.
Other Notes
You've got a bunch of other problems. You seem to be unsure whether you're trying to clean three or four backslashes--make sure you know what you're doing before you try to do it!
Your actual row replacement is broken, because as you've written it, allRows is a list, not a string, so you're probably not going to find the backslash pattern you're looking for. Instead, you need an inner loop to look through each cell in each row:
for row in reader:
corrected = []
for cell in row:
corrected.append(cell.replace('\\\\\\', '\\')) # Gross! See below.
writer.writerow(corrected)
Note that I can't see where writer is defined, but it looks like it might be subject to the same problem as your reader, if it's defined in a context manager someplace else!
Finally, raw strings are your friends (though they may not help you much here). In general, anytime you want a literal backslash in your strings, put an r in front of the string to save yourself a lot of headache. However, replacing odd numbers of backslashes is still a problem, because even raw strings cannot end in an odd number of backslashes.
So, to replace \\\ with \ (replace three backslashes with one), you'll have to double up on the backslashes like I did in the example above. If you wanted to replace four backslashes with two, you could use raw strings to your advantage: cell.replace(r'\\\\', r'\\') works just fine.
For posterity: you could also do something just as ugly, but in a different way, by adding a space to the end of the pattern strings so they no longer end with backslashes, and then stripping off the extra space. The following line replaces three backslashes with one, but it's much hackier (and slower if you're doing it a whole lot):
s = r'This is a \\\ string with \\\ sets \ of \\ three backslash\\\es.'
print(s.replace(r'\\\ '.strip(), r'\ '.strip()))

The slashes you're trying to match are getting treated as escapes, so '\\\\' is actually looking for '\\'.
Try using raw strings, i.e. r'\\\\' (you'll want to use raw strings for both the matching as well as the replace).
You could also double up the slashes, so use \\ everytime you want \, but that gets cumbersome very quickly

Try:
allRows.replace('\\\\\\', '\\')
Please note that the \ symbols needs to be escaped, by doubling.
>>> d
'C:\\\\\\Users\\\\\\Simon\\\\\\Desktop\\\\\\file.jpg\n'
>>> d.replace('\\\\\\', '\\')
'C:\\Users\\Simon\\Desktop\\file.jpg\n'
>>> print d.replace('\\\\\\', '\\')
C:\Users\Simon\Desktop\file.jpg

Specifying filename in os.system call from python

I am creating a simple file in python to reorganize some text data I grabbed from a website. I put the data in a .txt file and then want to use the "tail" command to get rid of the first 5 lines. I'm able to make this work for a simple filename shown below, but when I try to change the filename (to what I'd actually like it to be) I get an error. My code:
start = 2010
end = 2010
for i in range(start,end+1)
year = str(i)
...write data to a file called file...
teamname=open(file).readline() # want to use this in the new filename
teamfname=teamname.replace(" ","") #getting rid of spaces
file2 = "gotdata2_"+year+".txt"
os.system("tail -n +5 gotdata_"+year+".txt > "+file2)
The above code works as intended, creating file, then creating file2 that excludes the first 5 lines of file. However, when I change the name of file2 to be:
file2 = teamfname+"_"+year+".txt"
I get the error:
sh: line 1: _2010.txt: command not found
It's as if the end of my file2 statement is getting chopped off and the .txt part isn't being recognized. In this case, my code outputs a file but is missing the _2010.txt at the end. I've double checked that both year and teamfname are strings. I've also tried it with and without spaces in the teamfname string. I get the same error when I try to include a os.system mv statement that would rename the file to what I want it to be, so there must be something wrong with my understanding of how to specify the string here.
Does anyone have any ideas about what causes this? I haven't been able to find a solution, but I've found this problem difficult to search for.

Without knowing what your actual strings are, it's impossible to be sure what the problem is. However, it's almost certainly something to do with failing to properly quote and/or escape arguments for the command line.
My first guess would be that you have a newline in the middle of your filename, and the shell is truncating the command at the newline. But I wouldn't bet too heavily on that. If you actually printed out the repr of the pathname, I could tell you for sure. But why go through all this headache?
The solution to almost any problem with os.system is to not use os.system.
If you look at the docs, they even tell you this:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.
If you use subprocess instead of os.system, you can avoid the shell entirely. You can also pass arguments as a list instead of trying to figure out how to quote them and escape them properly. Which would completely avoid the exact problem you're having.
For example, if you do this:
file2 = "gotdata2_"+year+".txt"
with open(file2, 'wb') as f:
subprocess.check_call(['tail', '-n', '+5', "gotdata_"+year+".txt"], stdout=f)
Then, if you change that first line to this:
file2 = teamfname+"_"+year+".txt"
It will still work even if teamfname has a space or a quote or another special character in it.
That being said, I'm not sure why you want to use tail in the first place. You can skip the first 5 lines just as easily directly in Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.