Read content of txt files into lists to find duplicates

Read content of txt files into lists to find duplicates - python

I'm new to Python.
My code should read 2 different .txt files into lists and compare them to find and delete duplicates.
Code
import os
dir = os.listdir
T = "Albums"
if T not in dir():
os.mkdir("Albums")
with open('list.txt','w+') as f:
linesA = f.readlines()
print(linesA) # output empty
with open('completed.txt','w+') as t:
linesB = t.readlines()
print(linesB) # output empty
for i in linesA[:]:
if i in linesB:
linesA.remove(i)
print(linesA)
print(linesB)
I tried the code above with following inputs:
in list.txt I wrote (on separate lines) A, B and C.
in completed.txt I wrote (also on separate lines) A and B.
It should have first output the content of the lists, which were empty for some reasons.
Why are the read lists empty?

Does this help:
I suggest using not os.path.exists(entry) instead of not entry in os.listdir(), it's not relevant for the problem, but I point it out anyway. (Also, you overwrote the built-in dir function)
I've split up the file using split("\n")
I've changed the way the files are opened to r+, this doesn't clear the file unlike w+.
Please note that if you want to use readlines you have to remove the new line for each entry.
import os
with open('list.txt','w+') as file:
file.write("Foo\n")
file.write("Bar")
with open('completed.txt','w+') as file:
file.write("Bar\n")
file.write("Python")
T = "Albums"
if not os.path.exists(T):
os.mkdir("Albums")
with open('list.txt','r+') as f:
linesA = f.read().split("\n")
print(linesA)
with open('completed.txt','r+') as t:
linesB = t.read().split("\n")
print(linesB)
for entry in list(linesA):
if entry in linesB:
linesA.remove(entry)
print(linesA)
print(linesB)
Output:
['Foo', 'Bar']
['Bar', 'Python']
['Foo']
['Bar', 'Python']

This makes little sense.
dir = os.listdir
You wanted to call os.listdir().
What you did was assign a reference to that function,
without actually calling the function.
Better to dispense with dir and just phrase it this way:
if T not in os.listdir():
with open('list.txt','w+') as f:
linesA = f.readlines()
...
with open('completed.txt','w+') as t:
linesB = t.readlines()
You wanted to open those with 'r' read mode,
rather than write.

Related

Removing the current item I'm iterating over in a loop in Python 3.x

I currently have the below code in Python 3.x:-
lst_exclusion_terms = ['bob','jenny', 'michael']
file_list = ['1.txt', '2.txt', '3.txt']
for f in file_list:
with open(f, "r", encoding="utf-8") as file:
content = file.read()
if any(entry in content for entry in lst_exclusion_terms):
print(content)
What I am aiming to do is to review the content of each file in the list file_list. When reviewing the content, I then want to check to see if any of the entries in the list lst_exclusion_terms exists. If it does, I want to remove that entry from the list.
So, if 'bob' is within the content of 2.txt, this will be removed (popped) out of the list.
I am unsure how to replace my print(content) with the command to identify the current index number for the item being examined and then remove it.
Any suggestions? Thanks

You want to filter a list of files based on whether they contain some piece(s) of text.
There is a Python built-in function filter which can do that. filter takes a function that returns a boolean, and an iterable (e.g. a list), and returns an iterator over the elements from the original iterable for which the function returns True.
So first you can write that function:
def contains_terms(filepath, terms):
with open(filepath) as f:
content = f.read()
return any(term in content for term in terms)
Then use it in filter, and construct a list from the result:
file_list = list(filter(lambda f: not contains_terms(f, lst_exclusion_terms), file_list))
Of course, the lambda is required because contains_terms takes 2 arguments, and returns True if the terms are in the file, which is sort of the opposite of what you want (but sort of makes more sense from the point of view of the function itself). You could specialise the function to your use case and remove the need for the lambda.
def is_included(filepath):
with open(filepath) as f:
content = f.read()
return all(term not in content for term in lst_exclusion_terms)
With this function defined, the call to filter is more concise:
file_list = list(filter(is_included, file_list))

I've had a desire like this before, where I needed to delete a list item when iterating over it. It is often suggested to just recreate a new list with the contents you wanted as suggested here
However, here is a quick and dirty approach that can remove the file from the list:
lst_exclusion_terms = ['bob','jenny', 'michael']
file_list = ['1.txt', '2.txt', '3.txt']
print("Before removing item:")
print(file_list)
flag = True
while flag:
for i,f in enumerate(file_list):
with open(f, "r", encoding="utf-8") as file:
content = file.read()
if any(entry in content for entry in lst_exclusion_terms):
file_list.pop(i)
flag = False
break
print("After removing item")
print(file_list)
In this case, file 3.txt was removed from the list since it matched the lst_exclusion_terms
The following were the contents used in each file:
#1.txt
abcd
#2.txt
5/12/2021
#3.txt
bob
jenny
michael

Check if there are new strings in a txt file

I am trying to make a function which will compare two txt files. If it recognizes new lines that are in one file but not in the other, it will add them in a list and also in that file that does not contain those new lines. It fails to do that. Here is my function. What am I doing wrong?
newLinks = []
def newer():
with open('cnbcNewLinks.txt', 'w') as newL:
for line in open('cnbcCleanedLinks.txt'):
if line not in "cnbcNewLinks.txt":
newLinks.append(line)
newL.write(line)
else:
continue
cleaned = ''.join(newLinks)
print(cleaned)

I put in python code what #Alex suggested.
See the doc for set.
I replace you text file name by a.txt and b.txt to be easily readable.
# First read the files and compare then using `set`
with open('a.txt', 'r') as newL, open('b.txt', 'r') as cleanL:
a = set(newL)
b = set(cleanL)
add_to_cleanL = list(a - b) # list with line in newL that are not in cleanL
add_to_newL = list(b - a) # list with line in cleanL that are not in newL
# Then open in append mode to add at the end of the file
with open('a.txt', 'a') as newL, open('b.txt', 'a') as cleanL:
newL.write(''.join(add_to_newL)) # append the list at the end of newL
cleanL.write(''.join(add_to_cleanL)) # append the list at the end of cleanL

If files not big, then move data in list,
both of list convert in set and use 'differ' builtin functions, two times.
then add difference in files.

python: create a list of strings

I have a number of files that I am reading in. I would like to have a list that contains the file contents. I read the whole content into a string.
I need a list that looks like this:
["Content of the first file", "content of the second file",...]
I have tried various ways like append, extend or insert, but they all expect a list as parameter and not a str so I end up getting this:
[["Content of the first file"], ["content of the second file"],...]
How can I get a list that contains strings and then add strings without turning it into a list of lists?
EDIT
Some more code
for file in os.listdir("neg"):
with open("neg\\"+file,'r', encoding="utf-8") as f:
linesNeg.append(f.read().splitlines())
for file in os.listdir("pos"):
with open("pos\\"+file,'r', encoding="utf-8") as f:
linesPos.append(f.read().splitlines())
listTotal = linesNeg + linesPos

contents_list = []
for filename in filename_list:
with open(filename) as f:
contents_list.append(f.read())

There's definitely more than one way to do it. Assuming you have the opened files as file objects f1 and f2:
alist = []
alist.extend([f1.read(), f2.read()])
or
alist = [f.read() for f in (f1, f2)]

Personally I'd do something like this, but there's more than one way to skin this cat.
file_names = ['foo.txt', 'bar.txt']
def get_string(file_name):
with open(file_name, 'r') as fh:
contents = fh.read()
return contents
strings = [get_string(f) for f in file_names]

How would I read and write from multiple files in a single directory? Python

I am writing a Python code and would like some more insight on how to approach this issue.
I am trying to read in multiple files in order that end with .log. With this, I hope to write specific values to a .csv file.
Within the text file, there are X/Y values that are extracted below:
Textfile.log:
X/Y = 5
X/Y = 6
Textfile.log.2:
X/Y = 7
X/Y = 8
DesiredOutput in the CSV file:
5
6
7
8
Here is the code I've come up with so far:
def readfile():
import os
i = 0
for file in os.listdir("\mydir"):
if file.endswith(".log"):
return file
def main ():
import re
list = []
list = readfile()
for line in readfile():
x = re.search(r'(?<=X/Y = )\d+', line)
if x:
list.append(x.group())
else:
break
f = csv.write(open(output, "wb"))
while 1:
if (i>len(list-1)):
break
else:
f.writerow(list(i))
i += 1
if __name__ == '__main__':
main()
I'm confused on how to make it read the .log file, then the .log.2 file.
Is it possible to just have it automatically read all the files in 1 directory without typing them in individually?
Update: I'm using Windows 7 and Python V2.7

The simplest way to read files sequentially is to build a list and then loop over it. Something like:
for fname in list_of_files:
with open(fname, 'r') as f:
#Do all the stuff you do to each file
This way whatever you do to read each file will be repeated and applied to every file in list_of_files. Since lists are ordered, it will occur in the same order as the list is sorted to.
Borrowing from #The2ndSon's answer, you can pick up the files with os.listdir(dir). This will simply list all files and directories within dir in an arbitrary order. From this you can pull out and order all of your files like this:
allFiles = os.listdir(some_dir)
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
logFiles.sort(key = lambda x: x.split('.')[-1])
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
The above code will work with files name like "somename.log", "somename.log.2" and so on. You can then take logFiles and plug it in as list_of_files. Note that the last line is only necessary if the first file is "somename.log" instead of "somename.log.1". If the first file has a number on the end, just exclude the last step
Line By Line Explanation:
allFiles = os.listdir(some_dir)
This line takes all files and directories within some_dir and returns them as a list
logFiles = [fname for fname in allFiles if "log" in fname.split('.')]
Perform a list comprehension to gather all of the files with log in the name as part of the extension. "something.log.somethingelse" will be included, "log_something.somethingelse" will not.
logFiles.sort(key = lambda x: x.split('.')[-1])
Sort the list of log files in place by the last extension. x.split('.')[-1] splits the file name into a list of period delimited values and takes the last entry. If the name is "name.log.5", it will be sorted as "5". If the name is "name.log", it will be sorted as "log".
logFiles[0], logFiles[-1] = logFiles[-1], logFiles[0]
Swap the first and last entries of the list of log files. This is necessary because the sorting operation will put "name.log" as the last entry and "nane.log.1" as the first.

If you change the naming scheme for your log files you can easily return of list of files that have the ".log" extension. For example if you change the file names to Textfile1.log and Textfile2.log you can update readfile() to be:
import os
def readfile():
my_list = []
for file in os.listdir("."):
if file.endswith(".log"):
my_list.append(file)
print my_list will return ['Textfile1.log', 'Textfile2.log']. Using the word 'list' as a variable is generally avoided, as it is also used to for an object in python.

Printing out elements of list into separate text files in python

Im new to python programming and need some help with some basic file I/O and list manipulation.
currently i have a list (s) that has these elements in it:
['taylor343', 'steven435', 'roger101\n']
what i need to do is print each line into new text files with only the 1 element in the text files as shown below:
file1.txt
taylor343
file2.txt
steven435
file3.txt
roger101
Im currently trying to work with this using a loop but i can only output into 1 text file
for x in list:
output.write(x+"\n")
How can i get it to write every single line of list into new text files (not just one)
Thank you

You need to open each new file you want to write into. As a quick example:
items = ['taylor', 'steven', 'roger']
filenames = ['file1', 'file2', 'file3']
for item, filename in zip(items, filenames):
with open(filename, 'w') as output:
output.write(item + '\n')

#Joe Kington wrote an excellent answer that is very pythonic. A more verbose answer that might make understanding what is going on a little easier would be something like this:
s = ['taylor343', 'steven435', 'roger101\n']
f = open("file1.txt","w")
f.write(s[0]+"\n")
f.close()
f = open("file2.txt","w")
f.write(s[1]+"\n")
f.close()
f = open("file3.txt","w")
f.write(s[2]) # s[2] already has the newline, for some reason
f.close()
If I were to make it a bit more general, I'd do this:
s = ['taylor343', 'steven435', 'roger101'] # no need for that last newline
for i,name in enumerate(s):
f = open("file"+str(i+1)+".txt","w")
f.write(name+"\n")
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read content of txt files into lists to find duplicates - python

Related

Removing the current item I'm iterating over in a loop in Python 3.x

Check if there are new strings in a txt file

python: create a list of strings

How would I read and write from multiple files in a single directory? Python

Printing out elements of list into separate text files in python

Categories

Resources