Python: Extracting lines from a file using another file as key - python

I have a 'key' file that looks like this (MyKeyFile):
afdasdfa ghjdfghd wrtwertwt asdf (these are in a column, but I never figured out the formatting, sorry)
I call these keys and they are identical to the first word of the lines that I want to extract from a 'source' file. So the source file (MySourceFile) would look something like this (again, bad formatting, but 1st column = the key, following columns = data):
afdasdfa (several tab delimited columns)
.
.
ghjdfghd ( several tab delimited columns)
.
wrtwertwt
.
.
asdf
And the '.' would indicate lines of no interest currently.
I am an absolute novice in Python and this is how far I've come:
with open('MyKeyFile','r') as infile, \
open('MyOutFile','w') as outfile:
for line in infile:
for runner in source:
# pick up the first word of the line in source
# if match, print the entire line to MyOutFile
# here I need help
outfile.close()
I realize there may be better ways to do this. All feedback is appreciated - along my way of solving it, or along more sophisticated ones.
Thanks
jd

I think that this would be a cleaner way of doing it, assuming that your "key" file is called "key_file.txt" and your main file is called "main_file.txt"
keys = []
my_file = open("key_file.txt","r") #r is for reading files, w is for writing to them.
for line in my_file.readlines():
keys.append(str(line)) #str() is not necessary, but it can't hurt
#now you have a list of strings called keys.
#take each line from the main text file and check to see if it contains any portion of a given key.
my_file.close()
new_file = open("main_file.txt","r")
for line in new_file.readlines():
for key in keys:
if line.find(key) > -1:
print "I FOUND A LINE THAT CONTAINS THE TEXT OF SOME KEY", line
You can modify the print function or get rid of it to do what you want with the desired line that contains the text of some key. Let me know if this works

As I understood (corrent me in the comments if I am wrong), you have 3 files:
MySourceFile
MyKeyFile
MyOutFile
And you want to:
Read keys from MyKeyFile
Read source from MySourceFile
Iterate over lines in the source
If line's first word is in keys: append that line to MyOutFile
Close MyOutFile
So here is the Code:
with open('MySourceFile', 'r') as sourcefile:
source = sourcefile.read().splitlines()
with open('MyKeyFile', 'r') as keyfile:
keys = keyfile.read().split()
with open('MyOutFile', 'w') as outfile:
for line in source:
if line.split():
if line.split()[0] in keys:
outfile.write(line + "\n")
outfile.close()

Related

Python: Inserting content of entire textdocument after a certain string in 2nd document

Im pretty new to Python, but I've been trying to get into some programming in my free time. Currently, im dealing with the following problem:
I have 2 documents, 1 and 2. Both have text in them.
I want to search document 1 for a specific string. When I locate that string, I want to insert all the content of document 2 in a line after the specific string.
Before insertion:
Document 1 content:
text...
SpecificString
text...
After insertion:
Document 1 content:
text...
SpecificString
Document 2 content
text...
I've been trying different methods, but none are working, and keep deleting all content from document 1 and replacing it. Youtube & Google haven't yielded any desireble results, maybe im just looking in the wrong places.
I tried differnet things, this is 1 example:
f1 = '/Users/Win10/Desktop/Pythonprojects/oldfile.txt'
f2 = '/Users/Win10/Desktop/Pythonprojects/newfile.txt'
searchString=str("<\module>")
with open(f1, "r") as moduleinfo, open(f2, "w") as newproject:
new_contents = newproject.readlines()
#Now prev_contents is a list of strings and you may add the new line to this list at any position
if searchString in f1:
new_contents.insert(0,"\n")
new_contents.insert(0,moduleinfo)
#new_file.write("\n".join(new_contents))
The code simply deleted the content of document 1.
You can find interesting answers (How do I write to the middle of a text file while reading its contents?, Can you write to the middle of a file in python?, Adding lines after specific line)
By the way, an interesting way is to iterate the file in a read mode to find the index where the insert must be. Afterwards, overwrite the file with new indexing:
a) File2 = File2[:key_index] + File1 + File 2[key_index:]
Another option explained by Adding lines after specific line:
with open(file, "r") as in_file:
buf = in_file.readlines()
with open(file, "w") as out_file:
for line in buf:
if line == "YOUR SEARCH\n":
line = line + "Include below\n"
out_file.write(line)
Please tell us your final approach.
KR,
You have to import the second file in append mode instead of writing mode. Write mode override the document. Append mode add text to the end of the file, but you can move the pointer to the wanted location for writing and append the text there.
You can enter append mode by replacing the 'w' with 'a'.
Thanks for your input, it put me on the right track. I ended up going with the following:
f2 = '/Users/Win10/Desktop/Pythonprojects/newfile.txt'
f1 = '/Users/Win10/Desktop/Pythonprojects/oldfile.txt'
with open(f2) as file:
original = file.read()
with open(f1) as input:
myinsert = input.read()
newfile = original.replace("</Module>", "</Module>\n"+myinsert)
with open(f2, "w") as replaced:
replaced.write(newfile)
text from oldfile is inserted into newfile in a new line, under the "/Module" string. I'll be following up, if I find better solutions. Again, thank you for your answers.

How to match text in two different file and extract values

So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
On a functional point of view, you have:
a dictionary, meaning here a key: value thing
a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.
Assuming your files are called dict.yml and input.csv.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word + " " + meaning
print(output)
The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.

update records in file2 with data found in file1

There is a large file with fixed format,file1. Another CSV file, file2 has id's and values, using which, specific portions of a record with same id in file1 need to be updated. Here is my attempt. I really appreciate any help you can offer to make this work.
file2 comma separated
clr,code,type
Red,1001,1
Red,2001,2
Red,3001,3
blu,1002,1
blu,2002,2
blu,3002,3
file1 (fixed width format)
clrtyp1typ2typ3notes
red110121013101helloworld
blu110221023102helloworld2
the file1 need to be updated to the following
clrtyp1typ2typ3notes
red100120013001helloworld
blu100220023002helloworld2
please note that both the files are fairly large files(multiple GB each). I am python noob, please excuse any gross mistakes. I'd really appreciate any help you could offer.
import shutil
#read both input files
file1=open("file1.txt",'r').read()
file2='file2.txt'
#make a copy of the input file to make edits to it.
file2Edit=file2+'.EDIT'
shutil.copy(file2, baseEdit)
baseEditFile = open(baseEdit,'w').read()
#go thru eachline, pick clr from file1 and look for it in file2, if found, form a string to be replaced and replace the original line.
with open('file2.txt','w') as f:
for line in f:
base_clr = line[:3]
findindex = file1.find(base_recid)
if findindex != -1:
for line2 in file1:
#print(line)
clr = line2.split(",")[0]
code = line2.split(",")[1]
type = line2.split(",")[2]
if keytype = 1:
finalline=line[:15]+string.rjust(keyid, 15)+line[30:]
baseEditFile.write( replace(line,finalline)
baseEditFile.replace(line,finalline)
If I get you right, you need something like this:
# declare file names and necessary lists
file1 = "file1.txt"
file2 = "file2.txt"
file1_new = "file1.txt.EDIT"
clrs = {}
# read clrs to update
with open(file1, "r") as f:
# skip header line
f.next()
for line in f:
clrs[line[:3]] = []
# read the new codes
with open(file2, "r") as f:
# skip header
f.next()
for line in f:
current = line.strip().split(",")
key = current[0].lower()
if key in clrs:
clrs[key].append(current[1])
# write the new lines (old codes replaced with the new ones) to new file
with open(file1, "r") as f_in:
with open(file1_new, "w") as f_out:
# writes header
f_out.write(f_in.next())
for line in f_in:
line_new = list(line)
key = line[:3]
# checks if new codes were found for that key
if key in clrs.keys():
# replaces old keys by the new keys
line_new[3:15] = "".join(clrs[key])
f_out.write("".join(line_new))
This works only for the given example. If your file has another format in real use, you have to adjust the indices used.
This little script first opens your file1, iterates over it, and adds the clr as a key to a dictionary. The value for that key is an empty list.
Then it opens file2, and iterates over every clr here. If the clr is in the dictionary, it appends the code to the list. So after running this part, the dictionary contains key, value pairs, where the keys are the clr's and the values are lists containing the codes (in the order that was given by the file).
And in the last part of the script, every line of file1.txt is written to file1.txt.EDIT. Before writing, the old codes are replaced by the new ones.
The codes saved in file2.txt have to be in the same order as they are saved in file1.txt. If the order can be different, or the there is the possibility that there are more codes in file2.txt than you need to replace in file1.txt, you need to add a query to check for the correct codes. That's no big business, but this script will solve your problem for the files you gave us as an example.
If you have any questions or need more help, feel free to ask.
EDIT: Besides some syntactic mistakes and wrong method calls you made in your question's code, you shouldn't read in the whole data saved in a file at once, especially if you know the files can get very large. This consumes a lot of memory and may cause the program to run very slow. That's why iterating line by line is much better. The example I provided reads only one line of the file at once and writes it to the new file directly, instead of saving both old files and the new file in memory and writing it as the last step.

Writing to the end of specific line in python

I have a text file that contains key value pairs separated by a tab like this:
KEY\tVALUE
I have opened this file in append mode(a+) so I can both read and write. Now it may happen that a particular key has more than 1 value. For that I want to be able to go to that particular key and write the next value beside original one separated by a some delimiter(or ,).
Here is what I wish to do:
import io
ft = io.open("test.txt",'a+')
ft.seek(0)
for line in ft:
if (line.split('\t')[0] == "querykey"):
ft.write(unicode("nextvalue"));#Write the another key value beside the original one
Now there are two problems with it:
I will iterate through the file to see on which line the key is present(Is there a faster way?)
I will write a string to the end of that line.
I would be grateful if I can get help with the second point.
The write function always writes at the end of file. How should I write to the end of a specific line? I have searched and have not got very clear answers as to how to do that
You can read whole of file content, do your edit and write edited content to file.
with open('test.txt') as f:
lines = f.readlines()
f= open('test.txt', 'w')#open file for write
for line in lines:
if line.split('\t')[0] == "querykey":
line = line + ',newkey'
f.write('\n'.join(lines))

Edit and save file

I need to edit my file and save it so that I can use it for another program . First I need to put "," in between every word and add a word at the end of every line.
In order to put "," in between every word , I used this command
for line in open('myfile','r+') :
for word in line.split():
new = ",".join(map(str,word))
print new
I'm not too sure how to overwrite the original file or maybe create a new output file for the edited version . I tried something like this
with open('myfile','r+') as f:
for line in f:
for word in line.split():
new = ",".join(map(str,word))
f.write(new)
The output is not what i wanted (different from the print new) .
Second, I need to add a word at the end of every line. So, i tried this
source = open('myfile','r')
output = open('out','a')
output.write(source.read().replace("\n", "yes\n"))
The code to add new word works perfectly. But I was thinking there should be an easier way to open a file , do two editing in one go and save it. But I'm not too sure how. Ive spent a tremendous amount of time to figure out how to overwrite the file and it's about time I seek for help
Here you go:
source = open('myfile', 'r')
output = open('out','w')
output.write('yes\n'.join(','.join(line.split()) for line in source.read().split('\n')))
One-liner:
open('out', 'w').write('yes\n'.join(','.join(line.split() for line in open('myfile', 'r').read().split('\n')))
Or more legibly:
source = open('myfile', 'r')
processed_lines = []
for line in source:
line = ','.join(line.split()).replace('\n', 'yes\n')
processed_lines.append(line)
output = open('out', 'w')
output.write(''.join(processed_lines))
EDIT
Apparently I misread everything, lol.
#It looks like you are writing the word yes to all of the lines, then spliting
#each word into letters and listing those word's letters on their own line?
source = open('myfile','r')
output = open('out','w')
for line in source:
for word in line.split():
new = ",".join(word)
print >>output, new
print >>output, 'y,e,s'
How big is this file?
Maybe You could create a temporary list which would just contain everything from file you want to edit. Every element could represent one line.
Editing list of strings is pretty simple.
After Your changes you can just open Your file again with
writable = open('configuration', 'w')
and then put changed lines to file with
file.write(writable, currentLine + '\n')
.
Hope that helps - even a little bit. ;)
For the first problem, you could read all the lines in f before overwriting f, assuming f is opened in 'r+' mode. Append all the results into a string, then execute:
f.seek(0) # reset file pointer back to start of file
f.write(new) # new should contain all concatenated lines
f.truncate() # get rid of any extra stuff from the old file
f.close()
For the second problem, the solution is similar: Read the entire file, make your edits, call f.seek(0), write the contents, f.truncate() and f.close().

Categories