Parse Specific Text File to CSV Format with Headers - python

I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!

You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.

Related

update records in file2 with data found in file1

There is a large file with fixed format,file1. Another CSV file, file2 has id's and values, using which, specific portions of a record with same id in file1 need to be updated. Here is my attempt. I really appreciate any help you can offer to make this work.
file2 comma separated
clr,code,type
Red,1001,1
Red,2001,2
Red,3001,3
blu,1002,1
blu,2002,2
blu,3002,3
file1 (fixed width format)
clrtyp1typ2typ3notes
red110121013101helloworld
blu110221023102helloworld2
the file1 need to be updated to the following
clrtyp1typ2typ3notes
red100120013001helloworld
blu100220023002helloworld2
please note that both the files are fairly large files(multiple GB each). I am python noob, please excuse any gross mistakes. I'd really appreciate any help you could offer.
import shutil
#read both input files
file1=open("file1.txt",'r').read()
file2='file2.txt'
#make a copy of the input file to make edits to it.
file2Edit=file2+'.EDIT'
shutil.copy(file2, baseEdit)
baseEditFile = open(baseEdit,'w').read()
#go thru eachline, pick clr from file1 and look for it in file2, if found, form a string to be replaced and replace the original line.
with open('file2.txt','w') as f:
for line in f:
base_clr = line[:3]
findindex = file1.find(base_recid)
if findindex != -1:
for line2 in file1:
#print(line)
clr = line2.split(",")[0]
code = line2.split(",")[1]
type = line2.split(",")[2]
if keytype = 1:
finalline=line[:15]+string.rjust(keyid, 15)+line[30:]
baseEditFile.write( replace(line,finalline)
baseEditFile.replace(line,finalline)
If I get you right, you need something like this:
# declare file names and necessary lists
file1 = "file1.txt"
file2 = "file2.txt"
file1_new = "file1.txt.EDIT"
clrs = {}
# read clrs to update
with open(file1, "r") as f:
# skip header line
f.next()
for line in f:
clrs[line[:3]] = []
# read the new codes
with open(file2, "r") as f:
# skip header
f.next()
for line in f:
current = line.strip().split(",")
key = current[0].lower()
if key in clrs:
clrs[key].append(current[1])
# write the new lines (old codes replaced with the new ones) to new file
with open(file1, "r") as f_in:
with open(file1_new, "w") as f_out:
# writes header
f_out.write(f_in.next())
for line in f_in:
line_new = list(line)
key = line[:3]
# checks if new codes were found for that key
if key in clrs.keys():
# replaces old keys by the new keys
line_new[3:15] = "".join(clrs[key])
f_out.write("".join(line_new))
This works only for the given example. If your file has another format in real use, you have to adjust the indices used.
This little script first opens your file1, iterates over it, and adds the clr as a key to a dictionary. The value for that key is an empty list.
Then it opens file2, and iterates over every clr here. If the clr is in the dictionary, it appends the code to the list. So after running this part, the dictionary contains key, value pairs, where the keys are the clr's and the values are lists containing the codes (in the order that was given by the file).
And in the last part of the script, every line of file1.txt is written to file1.txt.EDIT. Before writing, the old codes are replaced by the new ones.
The codes saved in file2.txt have to be in the same order as they are saved in file1.txt. If the order can be different, or the there is the possibility that there are more codes in file2.txt than you need to replace in file1.txt, you need to add a query to check for the correct codes. That's no big business, but this script will solve your problem for the files you gave us as an example.
If you have any questions or need more help, feel free to ask.
EDIT: Besides some syntactic mistakes and wrong method calls you made in your question's code, you shouldn't read in the whole data saved in a file at once, especially if you know the files can get very large. This consumes a lot of memory and may cause the program to run very slow. That's why iterating line by line is much better. The example I provided reads only one line of the file at once and writes it to the new file directly, instead of saving both old files and the new file in memory and writing it as the last step.

what file mode to create new when not exists and append new data when exists

I need to do the following:
create a csv file if it does not exist, append data if it exists
when create a new csv file, created with heading from dict1.
My code:
def main():
list1 = [ 'DATE','DATASET','name1','name2','name3']
dict1 =dict.fromkeys(list1,0)
with open('masterResult.csv','w+b')as csvFile:
header = next(csv.reader(csvFile))
dict_writer = csv.DictWriter(csvFile,header,0)
dict_writer.writerow(dict1)
if __name__ =='__main__':
main()
I've written the below sample code which you can refer and use for your requirement. First of all, if you use, append mode for opening file, you can append if the file exists and newly write if it does not exist. Now, coming to your header writing, you can check the size of the file in prior. If the size is zero, then it is a new file obviously and you can write your header first. If the size is not zero, then you can append only data records without writing header. Below is my sample code. For the first time when you run it, it will create file with header. The next time you run the code, it will append only the data records and not the header.
import os
header='Name,Age'
filename='sample.csv'
filesize=0
if(os.path.exists(filename) and os.path.isfile(filename)):
filesize=os.stat(filename).st_size
f=open(filename,'a')
if(filesize == 0):
f.write('%s\n' % header)
f.write('%s\n' % 'name1,25')
f.close()
The w mode will overwrite an existing file. Instead, you need to use the a (append) mode:
with open('masterResult.csv','a+b') as csvFile:
# here -------------------^

Batch Appending matching rows to csv files using python

I have a set of csv files and another csv file, GroundTruth2010_edited_copy.csv, which contains information I'd like to append to the end of the rows of the Set of files. The files contain information describing geologic samples. For all the files, including GroundTruth2010_edited_copy.csv, each row has an identifying 'rockid' that identifies the sample and the remainder of the row describes various parameters of the sample. I want to append corresponding information from GroundTruth2010_edited_copy.csv to the Set of csv files. That is, if the rows have the same 'rockid,' I want to combine them into a new row in a new csv file. Hence, there is a new csv file for each original csv file in the Set. Here is my code.
import os
import csv
#read in ground truth data
csvfilename='GroundTruth/GroundTruth2010_edited_copy.csv'
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
path=os.getcwd()
filenames = os.listdir(path)
for filename in filenames:
if filename.endswith('.csv'):
#read csv files
r=csv.reader(open(filename))
new_data = []
for row in r:
rockid=row[-1]
for krow in rocreader:
entry=krow[0]
newentry=entry[:5] +entry[6:] #remove extra '0' from middle of entry
if newentry==rockid:
print('Ok!')
#append ground truth data
new_data.append([row, krow[1], krow[2], krow[3], krow[4]])
#write csv files
newfilename = "".join(filename.split(".csv")) + "_GT.csv"
with open(newfilename, "w") as f:
writer = csv.writer(f)
writer.writerows(new_data)
The code runs and makes my new csv files, but they are all empty. The problem seems to be that my second 'if' statement is never true: the console never prints 'Ok!' I've tried troubleshooting for a bit, and been rather frustrated. Perhaps the most frustrating thing is that after the program finishes, if I enter
rockid==newentry
The console returns 'True,' so it seems to me I should get at least one 'Ok!' for the final iteration. Can anyone help me find what's wrong?
Also, since my if statement is never true, there may also be a problem with the way I append 'new_data.'
You only open rocreader once, so when you try to use it later in the loop, you'll only get rows from it the first time through-- in the rest of the loop's runs, you're reading 0 rows (and of course getting no matches). To read it over and over, open and close it once for each time you need to use it.
But instead of re-scanning the Ground Truth file from disk (slow!) for every row of each of the other CSVs, you should read it once into a dictionary, so you can look up IDs in one step.
with open(csvfilename) as csvfile:
rocreader=csv.reader(csvfile)
rocindex = dict((row[-1], row) for row in rocreader)
Then for any key newentry, you can just check like this:
if newentry in rocindex:
truth = rocindex[newentry]
# Merge it with the row that has key `newentry`

Why can't I repeat the 'for' loop for csv.Reader?

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

How do I search a particular string out of file1, and update a csv file

I have two very large files:
File1 is formatted as such:
thisismy#email.com:20110708
thisisnotmy#email.com:20110908
thisisyour#email.com:20090807
...
File2 is a csv file that has the same email addresses in the row[0] field, and I need to put the date into the row[5] field.
I understand how to properly read & parse the csv, as well I comprehend how to read the File1 and cut it properly.
What I need assistance with is how to properly search the CSV file for ANY instances of the email address and update the csv with the corresponding date.
Thanks for your assistance.
You may want to check using module re::
import re
emails = re.findall(r'^(.*\#.*?):', open('filename.csv').read())
That will get you all the emails.
If the data you have to replace has a fixed size, which seems to be the case in your example. You can use seek(). While reading your file looking for your value, get the cursor position and write your replacement data from the desired position.
Cf: Writing in file's actual position in Python
However, if you are dealing with extra huge files, using command line tools such as sed could save a lot of processing time.
Below example tested on Python 2.7:
import csv
# 'b' flag for binary is necessary if on Windows otherwise crlf hilarity ensues
with open('/path/to/file1.txt','rb') as fin:
csv_reader = csv.reader(fin, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_reader.next()
# populate dict with email address as key and date as value
# dictionary comprehensions supported in 2.7+
# on a lower version? use: d = dict((line[0],line[1]) for line in csv_reader)
email_address_dict = {line[0]: line[1] for line in csv_reader}
# there are ways to modify a file in-place
# but it's easier to write to a new file
with open('/path/to/file2.txt','rb') as fin, \
open('/path/to/file3.txt','wb') as fou:
csv_reader = csv.reader(fin, delimiter=":")
csv_writer = csv.writer(fou, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_writer.writerow( csv_reader.next() )
for line in csv_reader:
# construct new line
# looking up date value in just-created dict
# the new date value is inserted position 5 (zero-based)
newline = line[0:5]
newline.append(email_address_dict[line[0]])
newline.extend(line[6:])
csv_writer.writerow(newline)

Categories