Searching strings from one file in another file and retrieving the line - python

I have two files named A and B
A file looks like this:
1_A
2_B
3_C
4_D
5_E
B file looks like this:
C
D
so wrote a small script using file B to search for the corresponding line that contain "C" and "D
here is my code:
import re
f = open("fileA", "r")
t = open("fileB", "r")
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)
But the result was blank, does anyone have any ideas? Many thanks!

After the first iteration, the file pointer in file B is at end of file, and you can't read anything more from it.
Trivial solutions involve rewinding file B or equivalently opening it inside the loop and closing it after each iteration. The I/O overhead is somewhat prohibitive, though.
On the other hand, reading both files into memory so you can compare them is not very scalable, especially if the files are big.
The usual compromise is to read the smaller file into memory, then processing one line at a time from the bigger file.
with open("fileB", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("fileA", "r") as f:
for line in f:
if any([term in line for term in terms]):
print(line)
If the files are too big for this, you might want to split file B into smaller chunks and do multiple passes, or, if at least one of the files is fairly static, look at using a database.

use readlines() after opening files:
import re
f = open("FileA", "r").readlines()
t = open("FileB", "r").readlines()
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)

You may try this with readlines,
a_lines = open('FileA.txt', 'r').readlines()
b_lines = open('FileB.txt', 'r').readlines()
[a_line.strip() for b_line in b_lines for a_line in a_lines if b_line in a_line]
# Returns ['3_C', '4_D']

You can open second file and save the output and then open file file and just check if any item[2] from second file is in first list :
list_=[]
with open('second','r') as f:
for line in f:
list_.append(line.split()[0])
final_list=[]
with open('first','r') as f:
for line in f:
if line[2] in list_:
final_list.append(line.split()[0])
print(final_list)
output:
['3_C', '4_D']

To be in line with your code, you can try the following:
f = open("fileA", "r")
t = open("fileB", "r")
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)
t.seek(0) # reset the file pointer after going through the entire file

Related

How to get first letter of each line in python?

here is what I got txt and open
txt file looks like
f = open('data.txt', 'r')
print(f.read())
the show['Cat\n','Dog\n','Cat\n','Dog\n'........]
output
But I would like to get this
['C\n','D\n','C\n','D\n'........]
First you'll want to open the file in read mode (r flag in open), then you can iterate through the file object with a for loop to read each line one at a time. Lastly, you want to access the first element of each line at index 0 to get the first letter.
first_letters = []
with open('data.txt', 'r') as f:
for line in f:
first_letters.append(line[0])
print(first_letters)
If you want to have the newline character still present in the string you can modify line 5 from above to:
first_letters.append(line[0] + '\n')
f = open("data.txt", "r")
for x in f:
print(x[0])
f.close()

Comparing two lines from two text files according to a single part of the text file

I have two text files and I want to write out two new text files according to whether there is a common section to each line in the two original text files.
The format of the text files is as follows:
commontextinallcases uniquetext2 potentiallycommontext uniquetext4
There are more than 4 columns but you get the idea. I want to check the 'potentiallycommontext' part in each text file and if they are the same write out the whole line of each text file to a new text file for each with its own unique text still in place.
Spliting it is fairly easy just using the .split() command when reading it in. I have found the following code:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
But I am not sure this would work for my case where I need to split the lines. Is there a way to do this I am missing?
Thanks
I don't think, that this set-approach is suitable for your case.
I'd try like
with open('some_file_1.txt', 'r') as file1, open('some_file_2.txt', 'r') as file2, open('some_output_file.txt', 'w') as file_out:
for line1, line2 in zip(file1, file2):
if line1.split()[2] == line2.split()[2]:
file_out.write(line1)
file_out.write(line2)
There might be shorter solutions but this should work
PCT_IDX = _ # find which index of line.split() corresponds to potentiallycommontext
def lines(filename):
with open(filename, 'r') as file:
for line in file:
line = line.rstrip('\n')
yield line
lines_1 = lines('some_file_1.txt')
lines_2 = lines('some_file_2.txt')
with open('some_output_file.txt', 'w') as file_out:
for (line_1, line_2) in zip(lines_1, lines_2):
maybe_cmn1 = line_1.split()[PCT_IDX]
maybe_cmn2 = line_2.split()[PCT_IDX]
if maybe_cmn1 == maybe_cmn2:
file_out.write(line_1)
file_out.write(line_2)

Python read .txt and split words after symbol #

I have a large 11 GB .txt file with email addresses. I would like to save only the strings till the # symbol among each other. My output only generate the first line.I have used this code of a earlier project. I would like to save the output in a different .txt file. I hope someone could help me out.
my code:
import re
def get_html_string(file,start_string,end_string):
answer="nothing"
with open(file, 'rb') as open_file:
for line in open_file:
line = line.rstrip()
if re.search(start_string, line) :
answer=line
break
start=answer.find(start_string)+len(start_string)
end=answer.find(end_string)
#print(start,end,answer)
return answer[start:end]
beginstr=''
end='#'
file='test.txt'
readstring=str(get_html_string(file,beginstr,end))
print readstring
Your file is quite big (11G) so you shouldn't keep all those strings in memory. Instead, process the file line by line and write the result before reading next line.
This should be efficient :
with open('test.txt', 'r') as input_file:
with open('result.txt', 'w') as output_file:
for line in input_file:
prefix = line.split('#')[0]
output_file.write(prefix + '\n')
If your file looks like this example:
user#google.com
user2#jshds.com
Useruser#jsnl.com
You can use this:
def get_email_name(file_name):
with open(file_name) as file:
lines = file.readlines()
result = list()
for line in lines:
result.append(line.split('#')[0])
return result
get_email_name('emails.txt')
Out:
['user', 'user2', 'Useruser']

python: Open file, edit one line, save it as the same file

I want to open a file, search for a specific word, change the word and save the file again. Sounds really easy - but I just can't get it working... I know that I have to overwrite the whole file but only change this one word!
My Code:
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
for line in linelist:
i =0;
if 'word' in line:
for number in arange(0,1,0.1)):
myNumber = 2 - number
myNumberasString = str(myNumber)
myChangedLine = line.replace('word', myNumberasString)
f2 = open('./myfile', 'w')
f2.write(line)
f2.close
#here I have to do some stuff with these files so there is a reason
#why everything is in this for loop. And I know that it will
#overwrite the file every loop and that is good so. I want that :)
If I make it like this, the 'new' myfile file contains only the changed line. But I want the whole file with the changed line... Can anyone help me?
****EDIT*****
I fixed it! I just turned the loops around and now it works perfectly like this:
f=open('myfile','r')
text = f.readlines()
f.close()
i =0;
for number in arange(0,1,0.1):
fw=open('mynewfile', 'w')
myNumber = 2 - number
myNumberasString = str(myNumber)
for line in text:
if 'word' in line:
line = line.replace('word', myNumberasString)
fw.write(line)
fw.close()
#do my stuff here where I need all these input files
You just need to write out all the other lines as you go. As I said in my comment, I don't know what you are really trying to do with your replace, but here's a slightly simplified version in which we're just replacing all occurrences of 'word' with 'new':
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
# Re-open file here
f2 = open('./myfile', 'w')
for line in linelist:
line = line.replace('word', 'new')
f2.write(line)
f2.close()
Or using contexts:
with open('./myfile', 'r') as f:
lines = f.readlines()
with open('./myfile', 'w') as f:
for line in lines:
line = line.replace('word', 'new')
f.write(line)
Use fileinput passing in whatever you want to replace:
import fileinput
for line in fileinput.input("in.txt",inplace=True):
print(line.replace("whatever","foo"),end="")
You don't seem to be doing anything special in your loop that cannot be calculated first outside the loop, so create the string you want to replace the word with and pass it to replace.
inplace=True will mean the original file is changed. If you want to verify everything looks ok then remove the inplace=True for the first run and you will actually see the replaced output instead of the lines being written to the file.
If you want to write to a temporary file, you can use a NamedTemporaryFile with shutil.move:
from tempfile import NamedTemporaryFile
from shutil import move
with open("in.txt") as f, NamedTemporaryFile(dir=".",delete=False) as out:
for line in f:
out.write(line.replace("whatever","foo"))
move("in.txt",out.name)
One problem you may encounter is matching substrings with replace so if you know the word is always followed in the middle of a sentence surrounded by whitespace you could add that but if not you will need to split and check every word.
from tempfile import NamedTemporaryFile
from shutil import move
from string import punctuation
with open("in.txt") as f, NamedTemporaryFile(dir=".",delete=False) as out:
for line in f:
out.write(" ".join(word if word.strip(punctuation) != "whatever" else "foo"
for word in line.split()))
The are three issues with your current code. First, create the f2 file handle before starting the loop, otherwise you'll overwrite the file in each iteration. Third, you are writing an unmodified line in f2.write(line). I guess you meant f2.write(myChangedLine)? Third, you should add an else statement that writes unmodified lines to the file. So:
f = open('./myfile', 'r')
linelist = f.readlines()
f.close
f2 = open('./myfile', 'w')
for line in linelist:
i =0;
if 'word' in line:
for number in arange(0,1,0.1)):
myNumber = 2 - number
myNumberasString = str(myNumber)
myChangedLine = line.replace('word', myNumberasString)
f2.write(myChangedLine)
else:
f2.write(line)
f2.close()

How to get the same part of file content?

I have two files , and I want to display the content exist in file 1 and file 2 in the screen. But it seems nothing display ( but it should display オレンジ) . What is the problem?
thanks
File 1
リンゴ
バナナ
オレンジ
File 2
オレンジ
Here is my code
import sys
File1 = open(sys.argv[1],"r",encoding="UTF-8")
F1_Content = File1.readlines()
F1_Content = map(lambda e : e.rstrip("\n"),F1_Content)
File2 = open(sys.argv[2],"r",encoding="UTF-8")
F2_Content = File2.readlines()
F2_Content = map(lambda e : e.rstrip("\n"),F2_Content)
for line in F1_Content:
print(repr(line))
if line in F2_Content:
print(line)
File1.close()
File2.close()
Output
'\ufeff
''
''
You probably have more whitespace in one of the files than just a newline. You could loop over either F1_Content and F2_Content, printing the representation of that line with print(repr(line)) or print(line.encode('unicode_escape')) to make it easier to spot how the lines differ.
I'd strip the lines entirely. Also, use a set for the lines of one file, testing will be much more efficient:
with open(sys.argv[1], "r") as file1:
f1_content = {line.strip() for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if line.strip() in file2:
print(line)
Looping directly over the file itself reads the file line by line, letting you handle file lines without having to read the whole file into memory.
Note also the use of with statements here; file objects are context managers, and when the context closes (the with block ends) the file object is automatically closed for you.
With Katakana, there is also the possibility that one of your files uses decomposition for the ZI character while the other does not; you can either express it as \u30B8 or as \u30B7\u3099; (SI + COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK):
>>> print('\u30B8 != \u30B7\u3099:', '\u30B8' != '\u30B7\u3099')
ジ != ジ: True
You can use unicodedata.normalize() to normalize all your lines to either composed or decomposed forms. Here I force all data to use composed forms:
from unicodedata import normalize
with open(sys.argv[1], "r") as file1:
f1_content = {normalize(line.strip(), 'NFKC') for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if normalize(line.strip(), 'NFKC') in file2:
print(line)

Categories