How to get the same part of file content?

How to get the same part of file content? - python

I have two files , and I want to display the content exist in file 1 and file 2 in the screen. But it seems nothing display ( but it should display オレンジ) . What is the problem?
thanks
File 1
リンゴ
バナナ
オレンジ
File 2
オレンジ
Here is my code
import sys
File1 = open(sys.argv[1],"r",encoding="UTF-8")
F1_Content = File1.readlines()
F1_Content = map(lambda e : e.rstrip("\n"),F1_Content)
File2 = open(sys.argv[2],"r",encoding="UTF-8")
F2_Content = File2.readlines()
F2_Content = map(lambda e : e.rstrip("\n"),F2_Content)
for line in F1_Content:
print(repr(line))
if line in F2_Content:
print(line)
File1.close()
File2.close()
Output
'\ufeff
''
''

You probably have more whitespace in one of the files than just a newline. You could loop over either F1_Content and F2_Content, printing the representation of that line with print(repr(line)) or print(line.encode('unicode_escape')) to make it easier to spot how the lines differ.
I'd strip the lines entirely. Also, use a set for the lines of one file, testing will be much more efficient:
with open(sys.argv[1], "r") as file1:
f1_content = {line.strip() for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if line.strip() in file2:
print(line)
Looping directly over the file itself reads the file line by line, letting you handle file lines without having to read the whole file into memory.
Note also the use of with statements here; file objects are context managers, and when the context closes (the with block ends) the file object is automatically closed for you.
With Katakana, there is also the possibility that one of your files uses decomposition for the ZI character while the other does not; you can either express it as \u30B8 or as \u30B7\u3099; (SI + COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK):
>>> print('\u30B8 != \u30B7\u3099:', '\u30B8' != '\u30B7\u3099')
ジ != ジ: True
You can use unicodedata.normalize() to normalize all your lines to either composed or decomposed forms. Here I force all data to use composed forms:
from unicodedata import normalize
with open(sys.argv[1], "r") as file1:
f1_content = {normalize(line.strip(), 'NFKC') for line in file1}
with open open(sys.argv[2], "r") af file2:
for line in file2:
if normalize(line.strip(), 'NFKC') in file2:
print(line)

Related

Python remove line containing specific char at the front

How do I remove a line from a txt file which start with ">"?
For example, in the txt file, there is about 250k+ lines and if I were to use the code below, it will take quite some time.
data = ""
with open(fileName) as f:
for line in f:
if ">" not in line:
line = line.replace("\n", "")
data += line
An example of the txt file is:
> version 1.0125 revision 0... # This is the line to be removed
some random line 1
some random line 2
> version 1.0126 revision 0... # This is the line to be removed
...
I have tried using data = f.read(), it is instant but the data will contain line that start with ">".
Any help is appreciated. Thank you :)

Not knowing what you want to do with the data afterwards, this should be fast and correct:
with open(fileName) as f:
data = "".join(line for line in f if not line.startswith(">"))
If you just want to remove these lines from the file, I would honestly not do it in Python, but in your shell directly, e.g. on Linux:
$ grep -v '^>' original_file.txt >fixed_file.txt
If you insist on Python, do it on a line-by-line basis:
with open(original_file) as f:
with open(new_file, "w") as g:
for line in f:
if not line.startswith(">"):
g.write(line)

Use two files, one for reading, second for appending:
with open(fileName, 'r') as f, open(fileName.raplace('.txt', '_1.txt'), 'a+') as df:
for line in f.readlines():
if not line.startswith('>'):
df.write(line)

How to modify a line in a file using Python

I am trying to do what for many will be a very straight forward thing but for me is just infuriatingly difficult.
I am trying search for a line in a file that contains certain words or phrases and modify that line...that's it.
I have been through the forum and suggested similar questions and have found many hints but none do just quite what I want or are beyond my current ability to grasp.
This is the test file:
# 1st_word 2nd_word
# 3rd_word 4th_word
And this is my script so far:
############################################################
file = 'C:\lpthw\\text'
f1 = open(file, "r+")
f2 = open(file, "r+")
############################################################
def wrline():
lines = f1.readlines()
for line in lines:
if "1st_word" in line and "2nd_word" in line:
#f2.write(line.replace('#\t', '\t'))
f2.write((line.replace('#\t', '\t')).rstrip())
f1.seek(0)
wrline()
My problem is that the below inserts a \n after the line every time and adds a blank line to the file.
f2.write(line.replace('#\t', '\t'))
The file becomes:
1st_word 2nd_word
# 3rd_word 4th_word
An extra blank line between the lines of text.
If I use the following:
f2.write((line.replace('#\t', '\t')).rstrip())
I get this:
1st_word 2nd_wordd
# 3rd_word 4th_word
No new blank line inserted but and extra "d" at the end instead.
What am I doing wrong?
Thanks

Your blank line is coming from the original blank line in the file. Writing a line with nothing in it writes a newline to the file. Instead of not putting anything into the written line, you have to completely skip the iteration, so it does not write that newline. Here's what I suggest:
def wrline():
lines = open('file.txt', 'r').readlines()
f2 = open('file.txt', 'w')
for line in lines:
if '1st_word' in line and '2nd_word' in line:
f2.write((line.replace('# ', ' ')).rstrip('\n'))
else:
if line != '\n':
f2.write(line)
f2.close()

I would keep read and write operations separate.
#read
with open(file, 'r') as f:
lines = f.readlines()
#parse, change and write back
with open(file, 'w') as f:
for line in lines:
if line.startswith('#\t'):
line = line[1:]
f.write(line)

You have not closed the files and there is no need for the \t
Also get rid of the rstrip()
Read in the file, replace the data and write it back.. open and close each time.
fn = 'example.txt'
new_data = []
# Read in the file
with open(fn, 'r+') as file:
filedata = file.readlines()
# Replace the target string
for line in filedata:
if "1st_word" in line and "2nd_word" in line:
line = line.replace('#', '')
new_data.append(line)
# Write the file out again
with open(fn, 'w+') as file:
for line in new_data:
file.write(line)

Comparing two lines from two text files according to a single part of the text file

I have two text files and I want to write out two new text files according to whether there is a common section to each line in the two original text files.
The format of the text files is as follows:
commontextinallcases uniquetext2 potentiallycommontext uniquetext4
There are more than 4 columns but you get the idea. I want to check the 'potentiallycommontext' part in each text file and if they are the same write out the whole line of each text file to a new text file for each with its own unique text still in place.
Spliting it is fairly easy just using the .split() command when reading it in. I have found the following code:
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
But I am not sure this would work for my case where I need to split the lines. Is there a way to do this I am missing?
Thanks

I don't think, that this set-approach is suitable for your case.
I'd try like
with open('some_file_1.txt', 'r') as file1, open('some_file_2.txt', 'r') as file2, open('some_output_file.txt', 'w') as file_out:
for line1, line2 in zip(file1, file2):
if line1.split()[2] == line2.split()[2]:
file_out.write(line1)
file_out.write(line2)

There might be shorter solutions but this should work
PCT_IDX = _ # find which index of line.split() corresponds to potentiallycommontext
def lines(filename):
with open(filename, 'r') as file:
for line in file:
line = line.rstrip('\n')
yield line
lines_1 = lines('some_file_1.txt')
lines_2 = lines('some_file_2.txt')
with open('some_output_file.txt', 'w') as file_out:
for (line_1, line_2) in zip(lines_1, lines_2):
maybe_cmn1 = line_1.split()[PCT_IDX]
maybe_cmn2 = line_2.split()[PCT_IDX]
if maybe_cmn1 == maybe_cmn2:
file_out.write(line_1)
file_out.write(line_2)

Python read .txt and split words after symbol #

I have a large 11 GB .txt file with email addresses. I would like to save only the strings till the # symbol among each other. My output only generate the first line.I have used this code of a earlier project. I would like to save the output in a different .txt file. I hope someone could help me out.
my code:
import re
def get_html_string(file,start_string,end_string):
answer="nothing"
with open(file, 'rb') as open_file:
for line in open_file:
line = line.rstrip()
if re.search(start_string, line) :
answer=line
break
start=answer.find(start_string)+len(start_string)
end=answer.find(end_string)
#print(start,end,answer)
return answer[start:end]
beginstr=''
end='#'
file='test.txt'
readstring=str(get_html_string(file,beginstr,end))
print readstring

Your file is quite big (11G) so you shouldn't keep all those strings in memory. Instead, process the file line by line and write the result before reading next line.
This should be efficient :
with open('test.txt', 'r') as input_file:
with open('result.txt', 'w') as output_file:
for line in input_file:
prefix = line.split('#')[0]
output_file.write(prefix + '\n')

If your file looks like this example:
user#google.com
user2#jshds.com
Useruser#jsnl.com
You can use this:
def get_email_name(file_name):
with open(file_name) as file:
lines = file.readlines()
result = list()
for line in lines:
result.append(line.split('#')[0])
return result
get_email_name('emails.txt')
Out:
['user', 'user2', 'Useruser']

Searching strings from one file in another file and retrieving the line

I have two files named A and B
A file looks like this:
1_A
2_B
3_C
4_D
5_E
B file looks like this:
C
D
so wrote a small script using file B to search for the corresponding line that contain "C" and "D
here is my code:
import re
f = open("fileA", "r")
t = open("fileB", "r")
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)
But the result was blank, does anyone have any ideas? Many thanks!

After the first iteration, the file pointer in file B is at end of file, and you can't read anything more from it.
Trivial solutions involve rewinding file B or equivalently opening it inside the loop and closing it after each iteration. The I/O overhead is somewhat prohibitive, though.
On the other hand, reading both files into memory so you can compare them is not very scalable, especially if the files are big.
The usual compromise is to read the smaller file into memory, then processing one line at a time from the bigger file.
with open("fileB", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("fileA", "r") as f:
for line in f:
if any([term in line for term in terms]):
print(line)
If the files are too big for this, you might want to split file B into smaller chunks and do multiple passes, or, if at least one of the files is fairly static, look at using a database.

use readlines() after opening files:
import re
f = open("FileA", "r").readlines()
t = open("FileB", "r").readlines()
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)

You may try this with readlines,
a_lines = open('FileA.txt', 'r').readlines()
b_lines = open('FileB.txt', 'r').readlines()
[a_line.strip() for b_line in b_lines for a_line in a_lines if b_line in a_line]
# Returns ['3_C', '4_D']

You can open second file and save the output and then open file file and just check if any item[2] from second file is in first list :
list_=[]
with open('second','r') as f:
for line in f:
list_.append(line.split()[0])
final_list=[]
with open('first','r') as f:
for line in f:
if line[2] in list_:
final_list.append(line.split()[0])
print(final_list)
output:
['3_C', '4_D']

To be in line with your code, you can try the following:
f = open("fileA", "r")
t = open("fileB", "r")
for line1 in f:
for line2 in t:
if line2 in line1:
print(line1)
t.seek(0) # reset the file pointer after going through the entire file

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the same part of file content? - python

Related

Python remove line containing specific char at the front

How to modify a line in a file using Python

Comparing two lines from two text files according to a single part of the text file

Python read .txt and split words after symbol #

Searching strings from one file in another file and retrieving the line

Categories

Resources