update records in file2 with data found in file1 - python

There is a large file with fixed format,file1. Another CSV file, file2 has id's and values, using which, specific portions of a record with same id in file1 need to be updated. Here is my attempt. I really appreciate any help you can offer to make this work.
file2 comma separated
clr,code,type
Red,1001,1
Red,2001,2
Red,3001,3
blu,1002,1
blu,2002,2
blu,3002,3
file1 (fixed width format)
clrtyp1typ2typ3notes
red110121013101helloworld
blu110221023102helloworld2
the file1 need to be updated to the following
clrtyp1typ2typ3notes
red100120013001helloworld
blu100220023002helloworld2
please note that both the files are fairly large files(multiple GB each). I am python noob, please excuse any gross mistakes. I'd really appreciate any help you could offer.
import shutil
#read both input files
file1=open("file1.txt",'r').read()
file2='file2.txt'
#make a copy of the input file to make edits to it.
file2Edit=file2+'.EDIT'
shutil.copy(file2, baseEdit)
baseEditFile = open(baseEdit,'w').read()
#go thru eachline, pick clr from file1 and look for it in file2, if found, form a string to be replaced and replace the original line.
with open('file2.txt','w') as f:
for line in f:
base_clr = line[:3]
findindex = file1.find(base_recid)
if findindex != -1:
for line2 in file1:
#print(line)
clr = line2.split(",")[0]
code = line2.split(",")[1]
type = line2.split(",")[2]
if keytype = 1:
finalline=line[:15]+string.rjust(keyid, 15)+line[30:]
baseEditFile.write( replace(line,finalline)
baseEditFile.replace(line,finalline)

If I get you right, you need something like this:
# declare file names and necessary lists
file1 = "file1.txt"
file2 = "file2.txt"
file1_new = "file1.txt.EDIT"
clrs = {}
# read clrs to update
with open(file1, "r") as f:
# skip header line
f.next()
for line in f:
clrs[line[:3]] = []
# read the new codes
with open(file2, "r") as f:
# skip header
f.next()
for line in f:
current = line.strip().split(",")
key = current[0].lower()
if key in clrs:
clrs[key].append(current[1])
# write the new lines (old codes replaced with the new ones) to new file
with open(file1, "r") as f_in:
with open(file1_new, "w") as f_out:
# writes header
f_out.write(f_in.next())
for line in f_in:
line_new = list(line)
key = line[:3]
# checks if new codes were found for that key
if key in clrs.keys():
# replaces old keys by the new keys
line_new[3:15] = "".join(clrs[key])
f_out.write("".join(line_new))
This works only for the given example. If your file has another format in real use, you have to adjust the indices used.
This little script first opens your file1, iterates over it, and adds the clr as a key to a dictionary. The value for that key is an empty list.
Then it opens file2, and iterates over every clr here. If the clr is in the dictionary, it appends the code to the list. So after running this part, the dictionary contains key, value pairs, where the keys are the clr's and the values are lists containing the codes (in the order that was given by the file).
And in the last part of the script, every line of file1.txt is written to file1.txt.EDIT. Before writing, the old codes are replaced by the new ones.
The codes saved in file2.txt have to be in the same order as they are saved in file1.txt. If the order can be different, or the there is the possibility that there are more codes in file2.txt than you need to replace in file1.txt, you need to add a query to check for the correct codes. That's no big business, but this script will solve your problem for the files you gave us as an example.
If you have any questions or need more help, feel free to ask.
EDIT: Besides some syntactic mistakes and wrong method calls you made in your question's code, you shouldn't read in the whole data saved in a file at once, especially if you know the files can get very large. This consumes a lot of memory and may cause the program to run very slow. That's why iterating line by line is much better. The example I provided reads only one line of the file at once and writes it to the new file directly, instead of saving both old files and the new file in memory and writing it as the last step.

Related

When writing to a file, can you be specific about where you want to write?

I have a text file, which has the following:
20
15
10
And I have the following code:
test_file = open("test.txt","r")
n = 21
line1 = test_file.readline(1)
line2 = test_file.readline(2)
line3 = test_file.readline(3)
test_file.close()
line1 = int(line1)
line2 = int(line2)
line3 = int(line3)
test_file = open("test.txt","a")
if n > line1:
test_file.write("\n")
n = str(n)
test_file.write(n)
test_file.close()
This code checks if the variable 'n' is bigger than line 1. What I wanted it to do is if it is bigger than line 1, it should be written in a line before the previous line 1. However this code will write it at the bottom of the file. Is there anything I can do to write something where I want to and not at the bottom of the file?
Any help is appreciated.
You can put your whole data in a variable, edit that variable then overwrite the information in the file.
with open('test.txt', 'r') as file:
# read a list of lines into data
data = file.readlines()
# now change the 2nd line, note that you have to add a newline
data[1] = "42\t\n"
# and write everything back
with open('test.txt', 'w') as file:
file.writelines( data )
This is a short answer, implement your own algorithm to solve your own problem.
As correctly pointed out by Amadan in a comment, the only way to obtain this result is a complete rewrite of the file.
This, clearly depending on how strict your requirements are, is fairly inefficient.
If you want to understand more about inefficiency just imagine the actions you would have to manually take to write a new 1st line in a physical notebook page.
Since the 1st line is already written you would have to turn the page, write the new first line, then copy again all the lines from the old page and, finally, tear the 1st page out and have your perfect notebook with a perfect page again.
You are writing with pen so there is no possibility to delete, only a new page will do the trick.
That is quite some work!
This is - well, more or less - what Python is doing behind the scenes when it is opening for reading (the 'r' part in my examples below) and then opening for writing (the 'w' part) the same file again.
As a general idea imagine that when you see for loops there is a lot of work to do.
I will clumsily over-simplify saying that the more the for loops the slower the code (countless pages of paper have been written by brilliant minds on performances, I suggest you diving dive deeper and searching for "Big O notation" using your preferred search engine. Here's an example: https://www.freecodecamp.org/news/all-you-need-to-know-about-big-o-notation-to-crack-your-next-coding-interview-9d575e7eec4/).
A better solution would be to change your data file and make sure that the last value is also the most recent one.
Rewriting the file is as easy as writing an empty file, code and result are identical.
The trick here is that we have in memory (in the variables data and new_data) everything we need.
In data we store the whole content of the file before the change.
In new_data we can easily apply the needed modification because it is just a list containing a number and a newline (\n) for each list item.
Once new_data contains the data in the desired order all we need to do is write that list into a file.
Here's a possible solution, as close as possible to your code:
n = 21
with open('test.txt', 'r') as file:
data = file.readlines()
first_entry = int(data[0])
if (n > first_entry):
new_data = []
new_value = str(n) + "\n"
new_data.append(new_value)
for item in data:
new_data.append(item)
with open('test.txt', 'w') as file:
file.writelines(new_data)
Here's a more portable one:
def prepend_to_file_if_bigger_than_first_line(filename, value):
"""Checks if value is bigger than the one found in the 1st line of the specified file,
if true prepends it to the file
Args:
filename (str): The file name to check.
value (str): The value to check.
"""
with open(filename, 'r') as file:
data = file.readlines()
first_entry = int(data[0])
if (value > first_entry):
new_value = "{}\n".format(value)
new_data = []
new_data.append(new_value)
for old_value in data:
new_data.append(old_value)
with open(filename, 'w') as file:
file.writelines(new_data)
prepend_to_file_if_bigger_than_first_line("test.txt", 301)
As bonus some food for thought and exercises to learn:
What if instead of rewriting everything you just add a new line to the end of the page? Wouldn't it be more efficient and effective?
How would you re-implement my function above just to check the last line in file and append a new value?
Try bench-marking the prepend and the append solution, which one is best?

How can I search for someones name then replace the number in that same line as the persons name?

I have the following data in a file called data.txt and would like to be able to add to the numbers at the end and replace them in the file without creating a new one:
Alfreda,art,2015,35
brook,biology,2015,3
charlie,chemistry,2015,140
dolly,Design,2015,120
Emilia,English,2015,150
Fiona,french,2015,40
Grace,Greek,2015,12
Hanna,history,2015,15
Here is the code I currently have:
with open("data.txt", "r") as f:
newline=[]
for word in f.line():
newline.append(word.replace(35,str(New))
with open("data.txt", "w") as f:
for line in newline :
f.writelines(line)
If you just want to add string to each line then update the file, this code can solve your problem but this is not optimal.
with open("data.txt", "r") as myFile:
newline=[]
# Use the readlines method to get all the lines
for line in myFile.readlines():
# Remove the \n character with the rstrip method
line = line.rstrip('\n')
newline.append(line+",35\n") # Don't forget to add \n
# Test
print newline
myFile.close()
with open("data.txt", "w") as myFile:
for line in newline :
myFile.writelines(line)
If this is not your problem, try to use the pickle module and work with objects, it will be easier.
I'm going to have to make some of your question up. If you have a file and you want to update it, the updates have to come from somewhere. The code in the question has a New variable but there is no indication of how New is supposed to get a value, or how the program is supposed to know which row to update.
I'm going to assume you have a file of updates called updates.txt that looks like this (and it is deliberately not in alphabetical order):
Emilia,45
Alfreda,35
So after your program runs the resulting file will have two rows different:
Alfreda,art,2015,70 ...this one
brook,biology,2015,3
charlie,chemistry,2015,140
dolly,Design,2015,120
Emilia,English,2015,195 ...and this one
Fiona,french,2015,40
Grace,Greek,2015,12
Hanna,history,2015,15
But the rest the same.
Since your sample data file is a .csv file I am using the Python csv module, rather than picking the data apart by hand. It doesn't matter much with simple data like this but it's a good module to know about.
import csv
marks = {}
# Read in existing data into a dictionary:
# key is name, value is a list [subject, year, score]
# like this: {"Alfreda": ["art",2015,35], ... }
# This is to make it easy to do random updates based on name
with open("data.txt", "r") as f:
for row in csv.reader(f):
name,subject,year,score = row
marks[name] = [subject,int(year),int(score)]
# Read in updates and apply each line to the corresponding entry in marks
with open("updates.txt", "r") as f:
for row in csv.reader(f):
name,added_score = row
try:
marks[name][2] += int(added_score) # for example marks["Alfreda"][2] += int("35")
except KeyError:
print(f"Name {name} not found to update, nothing done")
# Write out updated dictionary:
with open("data.txt", "w") as f:
writer = csv.writer(f,lineterminator="\n")
for name in sorted(marks.keys(), key=lambda n: n.lower()):
row=[name]+marks[name] # for example ["Alfreda"] + ["art",2015,70]
writer.writerow(row)
This line:
for name in sorted(marks.keys(), key=lambda n: n.lower()):
looks complicated but it is needed because you obviously expect the names Alfreda brook charlie dolly Emilia Fiona Grace Hanna to be in that order. But just doing the obvious
for name in sorted(marks.keys()):
will put them in the order Alfreda Emilia Fiona Grace Hanna brook charlie dolly.
In the interests of keeping the code simple and as close to your original as possible, it does no validity checks, so if this line
charlie,chemistry,2015,140
was wrongly entered as
charlie,chemistry,2015,14O
(with the letter O instead of a zero), the program will just fail. Ditto if the update file is missing a comma somewhere.
This works and will do what I think you want. But...
There are issues with the design. Your program reads in the data from data.txt, then overwrites it with new data. But suppose your program fails just after this line:
with open("data.txt", "w") as f:
Then you won't have your original data (because the call to open() truncated it), and you won't have the new data either (because you haven't written it out yet). Or suppose you accidentally run the program twice. There will be no way to tell you have done that.
You can provide some insurance against this sort of mishap by using the fileinput module, like this:
import fileinput
# Read in existing data
with fileinput.input("data.txt", inplace=True, backup=".bkp") as f:
for row in csv.reader(f):
name,subject,year,score = row
marks[name] = [subject,int(year),int(score)]
With this change, your updates will be in data.txt as before, but your original data will still be around, in a file called data.txt.bkp.
But that is just a fix. It avoids the real issue, which is that you really have a database application and you are trying to implement it using textfiles. The code above is all very well for an exercise, but it's not robust and it won't scale.

Python removing duplicates and saving the result

I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns

Python: Extracting lines from a file using another file as key

I have a 'key' file that looks like this (MyKeyFile):
afdasdfa ghjdfghd wrtwertwt asdf (these are in a column, but I never figured out the formatting, sorry)
I call these keys and they are identical to the first word of the lines that I want to extract from a 'source' file. So the source file (MySourceFile) would look something like this (again, bad formatting, but 1st column = the key, following columns = data):
afdasdfa (several tab delimited columns)
.
.
ghjdfghd ( several tab delimited columns)
.
wrtwertwt
.
.
asdf
And the '.' would indicate lines of no interest currently.
I am an absolute novice in Python and this is how far I've come:
with open('MyKeyFile','r') as infile, \
open('MyOutFile','w') as outfile:
for line in infile:
for runner in source:
# pick up the first word of the line in source
# if match, print the entire line to MyOutFile
# here I need help
outfile.close()
I realize there may be better ways to do this. All feedback is appreciated - along my way of solving it, or along more sophisticated ones.
Thanks
jd
I think that this would be a cleaner way of doing it, assuming that your "key" file is called "key_file.txt" and your main file is called "main_file.txt"
keys = []
my_file = open("key_file.txt","r") #r is for reading files, w is for writing to them.
for line in my_file.readlines():
keys.append(str(line)) #str() is not necessary, but it can't hurt
#now you have a list of strings called keys.
#take each line from the main text file and check to see if it contains any portion of a given key.
my_file.close()
new_file = open("main_file.txt","r")
for line in new_file.readlines():
for key in keys:
if line.find(key) > -1:
print "I FOUND A LINE THAT CONTAINS THE TEXT OF SOME KEY", line
You can modify the print function or get rid of it to do what you want with the desired line that contains the text of some key. Let me know if this works
As I understood (corrent me in the comments if I am wrong), you have 3 files:
MySourceFile
MyKeyFile
MyOutFile
And you want to:
Read keys from MyKeyFile
Read source from MySourceFile
Iterate over lines in the source
If line's first word is in keys: append that line to MyOutFile
Close MyOutFile
So here is the Code:
with open('MySourceFile', 'r') as sourcefile:
source = sourcefile.read().splitlines()
with open('MyKeyFile', 'r') as keyfile:
keys = keyfile.read().split()
with open('MyOutFile', 'w') as outfile:
for line in source:
if line.split():
if line.split()[0] in keys:
outfile.write(line + "\n")
outfile.close()

Why can't I repeat the 'for' loop for csv.Reader?

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

Categories