I need to scan 2 files in python and say which words in file1 are also in file2. I made a list with all words from file2 and then scan if the line from file1 is in the list.
So this works perfectly, but large files (like 500k) it can take 1h+ and I was wondering if there is a faster way
Thanks in advance
(defined var etc and files)
a = []
for line in var:
a += [line]
teller = 0
for line1 in new_file:
if line1 not in a:
print(line1, file=filter, end='')
else:
teller += 1
print(line1, file=bad, end='' )
print('There were', teller, 'lines that were in the old file.')
A faster alternative is using sets (as long as you can keep the content of both files in memory):
with open('a.txt', 'r') as a, open('b.txt', 'r') as b:
a_content = set(a)
b_content = set(b)
result = a_content.intersection(b_content)
If you're worried about speed, then you should be using your OS facilities, not Python loops. Typically, the fastest way to look for individual lines would be to sort both files and then do a simple file diff. If you insist on using Python, that would also be a much quicker way.
Your method will work, but it's super inefficient because you are traversing through the file2 for every single word/line within file1. Try turning both file1 and file2 to sets and then compare the sets; I'm pretty sure Python has something like .intersect or .intersection to compare two sets, lists, arrays, or other data structures.
Related
Ok, I couldn't really find an answer to this anywhere else, so I figured I'd ask.
I'm working with some .csv files that have about 74 million lines right now and I'm trying to add columns into one file from another file.
ex.
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,25.14,0,0,3
3,1110,7,3301,15766,1216,4,33.52,0,0,4
combined with
Units_cat
0
1
so that
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Units_cat,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,0,25.14,0,0,3
3,1110,7,3301,15766,1216,4,1,33.52,0,0,4
I've been using pandas to read in and output the .csv files, but the issue I'm coming to is the program keeps crashing because creating the DataFrame overloads my memory. I've tried applying the csv library from Python but I'm not sure how merge the files the way I want (not just append).
Anyone know a more memory efficient method of combining these files?
Something like this might work for you:
Using csv.DictReader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.DictReader(file1)
file2 = csv.DictReader(file2)
# Get the field order correct here:
fieldnames = file1.fieldnames
index = fieldnames.index('Units Sold')+1
fieldnames = fieldnames[:index] + file2.fieldnames + fieldnames[index:]
result = csv.DictWriter(result, fieldnames)
def dict_merge(a,b):
a.update(b)
return a
result.writeheader()
result.writerows(dict_merge(a,b) for a,b in izip(file1, file2))
Using csv.reader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.reader(file1)
file2 = csv.reader(file2)
result = csv.writer(result)
result.writerows(a[:7] + b + a[7:] for a,b in izip(file1, file2))
Notes:
This is for Python2. You can use the normal zip() function in Python3. If the two files are not of equivalent lengths, consider itertools.izip_longest().
The memory efficiency comes from passing a generator expression to .writerows() instead of a list. This way, only the current line is under consideration at any moment in time, not the entire file. If a generator expression isn't appropriate, you'll get the same benefit from a for loop: for a,b in izip(...): result.writerow(...)
The dict_merge function is not required starting from Python3.5. In sufficiently new Pythons, try result.writerows({**a,**b} for a,b in zip(file1, file2)) (See this explanation).
I have 100-200 text files with different name in a folder and I want to compare text present in the file with each other and keep the similar files in a group.
Note :
1.Files are not identical. They are similar like 2-3 lines in a paragraph are same with other file.
2. one file may be kept in different groups or can be kept in multiple groups
Can anyone help me in this as I an beginner to python?
I have tried the below code but it doesn't work for me.
file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
for line2 in list2:
for line3 in list3:
for line3 in list4:
for line4 in list5:
if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
print line1
file3.write(line1)
You can use this code to check similar lines in between files:
import glob
_contents = dict()
for filename in glob.glob('*.csv'):
file = open(filename, 'r')
frd = file.readlines()
_contents[filename]=frd
for key in _contents:
for other_key in _contents:
if key == other_key:
pass
else:
print("Comparing in between files {0} and {1}".format(key, other_key))
non_identical_contents = set(_contents[key]) - set(_contents[other_key])
print(list(set(_contents[key])-non_identical_contents))
If I understood your purpose right, you should iterate over all of the text files in the library and compare each one with the other (in all possible combinations). The code should look something like this:
import glob, os
nl = [] #Name list (containing the names of all files in the directory)
fl = [] #File list (containing the content of all files in the directory, each element in this list is a list of strings - the list of lines in a file)
os.chdir("/libwithtextfiles")
for filename in glob.glob("*.txt"): #Using glob to get all the files ending with '.txt'
nl.append(filename) #Appending all the filenames in the directory to 'nl'
f = open(filename, 'r')
fl.append(f.readlines()) #Appending all of the lists of line to 'fl'
f.close()
for fname1 in nl:
l1 = fl[nl.index(fname1)]
if nl.index(fname1) == len(nl) - 1: #We reached the last file
break
for fname2 in nl[nl.index(fname1) + 1:]:
l2 = fl[nl.index(fname2)]
#Here compare the amount of lines identical, use a counter
#then print it, or output to a file or do whatever you want
#with it
#e.g (according to what I understood from your code)
for f1line in l1:
for f2line in l2:
if f1line == f2line: #Why 'in' and not '=='?
"""
have some counter increase right here, a suggestion is having
a list of lists, where the first element is
a list that contains integers
the first integer is the number of lines found identical
between the file (index in list_of_lists is corresponding to the name in that index in 'nl')
and the one following it (index in list_of_lists + 1)
the next integer is the number of lines identical between the same file
and the one following the one following it (+2 this time), etc.
Long story short: list_of_lists[i][j] is the number of lines identical
between the 'i'th file and the 'i+j'th one.
"""
pass
Note that your code doesn't utilize loops where it should, you could have had a list called l instead of line1 - line5.
Aside from that, your code is unclear at all, I assume the missing indentation (for line2 in list2: should be indent, including anything afterwards) and the for line3 in list3: for line3 in list4: #using line3 twice are accidental and happened copying the code to this site. You're comparing every line with every line in the other files?
You should, as my comment in the code suggest, have a counter to count in how many files does that line repeat (doing that by having a for-loop with another loop nested inside, iterating over the lines and comparing just two, rather than all five, where even when having 5 files, each with 10 lines, you'd iterate 100,000 times over it (10**5) - whereas in my method, you only have 1000 iterations in such case, 100 times more efficient).
I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.
Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]
To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.
I'd like to read the contents from several files into unique lists that I can call later - ultimately, I want to convert these lists to sets and perform intersections and subtraction on them. This must be an incredibly naive question, but after poring over the iterators and loops sections of Lutz's "Learning Python," I can't seem to wrap my head around how to approach this. Here's what I've written:
#!/usr/bin/env python
import sys
OutFileName = 'test.txt'
OutFile = open(OutFileName, 'w')
FileList = sys.argv[1: ]
Len = len(FileList)
print Len
for i in range(Len):
sys.stderr.write("Processing file %s\n" % (i))
FileNum = i
for InFileName in FileList:
InFile = open(InFileName, 'r')
PathwayList = InFile.readlines()
print PathwayList
InFile.close()
With a couple of simple test files, I get output like this:
Processing file 0
Processing file 1
['alg1\n', 'alg2\n', 'alg3\n', 'alg4\n', 'alg5\n', 'alg6']
['csr1\n', 'csr2\n', 'csr3\n', 'csr4\n', 'csr5\n', 'csr6\n', 'csr7\n', 'alg2\n', 'alg6']
These lists are correct, but how do I assign each one to a unique variable so that I can call them later (for example, by including the index # from range in the variable name)?
Thanks so much for pointing a complete programming beginner in the right direction!
#!/usr/bin/env python
import sys
FileList = sys.argv[1: ]
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % (i))
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
Assuming you read in two files, the following will do a line by line comparison (it won't pick up any extra lines in the longer file, but then they'd not be the same if one had more lines than the other ;)
for i, s in enumerate(zip(PathwayList[0], PathwayList[1]), 1):
if s[0] == s[1]:
print i, 'match', s[0]
else:
print i, 'non-match', s[0], '!=', s[1]
For what you're wanting to do, you might want to take a look at the difflib module in Python. For sorting, look at Mutable Sequence Types, someListVar.sort() will sort the contents of someListVar in place.
You could do it like that if you don't need to remeber where the contents come from :
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
for contents in PathwayList:
# do something with contents which is a list of strings
print contents
or, if you want to keep track of the files names, you could use a dictionary :
PathwayList = {}
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList[InFile] = InFile.readlines()
InFile.close()
for filename, contents in PathwayList.items():
# do something with contents which is a list of strings
print filename, contents
You might want to check out Python's fileinput module, which is a part of the standard library and allows you to process multiple files at once.
Essentially, you have a list of files and you want to change to list of lines of these files...
Several ways:
result = [ list(open(n)) for n in sys.argv[1:] ]
This would get you a result like -> [ ['alg1', 'alg2', 'alg3'], ['csr1', 'csr2'...]] Accessing would be like 'result[0]' which would result in ['alg1', 'alg2', 'alg3']...
Somewhat better might be dictionary:
result = dict( (n, list(open(n))) for n in sys.argv[1:] )
If you want to just concatenate, you would just need to chain it:
import itertools
result = list(itertools.chain.from_iterable(open(n) for n in sys.argv[1:]))
# -> ['alg1', 'alg2', 'alg3', 'csr1', 'csr2'...
Not one-liners for a beginner...however now it would be a good exercies to try to comprehend what's going on :)
You need to dynamically create the variable name for each file 'number' that you're reading. (I'm being deliberately vague on purpose, knowing how to build variables like this is quite valuable and more readily remembered if you discover it yourself)
something like this will give you a start
You need a list which holds your PathwayList lists, that is a list of lists.
One remark: it is quite uncommon to use capitalized variable names. There is no strict rule for that, but by convention most people only use capitalized names for classes.
Im new to python programming and need some help with some basic file I/O and list manipulation.
currently i have a list (s) that has these elements in it:
['taylor343', 'steven435', 'roger101\n']
what i need to do is print each line into new text files with only the 1 element in the text files as shown below:
file1.txt
taylor343
file2.txt
steven435
file3.txt
roger101
Im currently trying to work with this using a loop but i can only output into 1 text file
for x in list:
output.write(x+"\n")
How can i get it to write every single line of list into new text files (not just one)
Thank you
You need to open each new file you want to write into. As a quick example:
items = ['taylor', 'steven', 'roger']
filenames = ['file1', 'file2', 'file3']
for item, filename in zip(items, filenames):
with open(filename, 'w') as output:
output.write(item + '\n')
#Joe Kington wrote an excellent answer that is very pythonic. A more verbose answer that might make understanding what is going on a little easier would be something like this:
s = ['taylor343', 'steven435', 'roger101\n']
f = open("file1.txt","w")
f.write(s[0]+"\n")
f.close()
f = open("file2.txt","w")
f.write(s[1]+"\n")
f.close()
f = open("file3.txt","w")
f.write(s[2]) # s[2] already has the newline, for some reason
f.close()
If I were to make it a bit more general, I'd do this:
s = ['taylor343', 'steven435', 'roger101'] # no need for that last newline
for i,name in enumerate(s):
f = open("file"+str(i+1)+".txt","w")
f.write(name+"\n")
f.close()