Remove duplicates in text file line by line - python

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.
For example, the text file might contain:
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG;þ
Thus, in the above example, the script should only remove the bold strings.
I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.
Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.
Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.

#Prune's answer gives the idea but it needs to be modified like this:
input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""
input = input_file.split("\n")
for line in input:
seen_item = []
for item in line.split(";"):
if item not in seen_item or item == "þ":
seen_item.append(item)
print(";".join(seen_item))

import re
with open('file', 'r') as f:
file = f.readlines()
for line in file:
print(re.sub(r'([^;]+;)(\1)', r'\1', line))
Read the file by lines; then replace the duplicates using re.sub.

Related

Amend list from file - Correct syntax and file format?

I currently have a list hard coded into my python code. As it keeps expanding, I wanted to make it more dynamic by reading the list from a file. I have read through many articles about how to do this, but in practice I can't get this working. So firstly, here is an example of the existing hardcoded list:
serverlist = []
serverlist.append(("abc.com", "abc"))
serverlist.append(("def.com", "def"))
serverlist.append(("hji.com", "hji"))
When I enter the command 'print serverlist' the output is shown below and my list works perfectly when I access it:
[('abc.com', 'abc'), ('def.com', 'def'), ('hji.com', 'hji')]
Now I've replaced the above code with the following:
serverlist = []
with open('/server.list', 'r') as f:
serverlist = [line.rstrip('\n') for line in f]
With the contents of server.list being:
'abc.com', 'abc'
'def.com', 'def'
'hji.com', 'hji'
When I now enter the command print serverlist, the output is shown below:
["'abc.com', 'abc'", "'def.com', 'def'", "'hji.com', 'hji'"]
And the list is not working correctly. So what exactly am I doing wrong? Am I reading the file incorrectly or am I formatting the file incorrectly? Or something else?
The contents of the file are not interpreted as Python code. When you read a line in f, it is a string; and the quotation marks, commas etc. in your file are just those characters as parts of a string.
If you want to create some other data structure from the string, you need to parse it. The program has no way to know that you want to turn the string "'abc.com', 'abc'" into the tuple ('abc.com', 'abc'), unless you instruct it to.
This is the point where the question becomes "too broad".
If you are in control of the file contents, then you can simplify the data format to make this more straightforward. For example, if you just have abc.com abc on the line of the file, so that your string ends up as 'abc.com abc', you can then just .split() that; this assumes that you don't need to represent whitespace inside either of the two items. You could instead split on another character (like the comma, in your case) if necessary (.split(',')). If you need a general-purpose hammer, you might want to look into JSON. There is also ast.literal_eval which can be used to treat text as simple Python literal expressions - in this case, you would need the lines of the file to include the enclosing parentheses as well.
If you are willing to let go of the quotes in your file and rewrite it as
abc.com, abc
def.com, def
hji.com, hji
the code to load can be reduced to a one liner using the fact that files are iterables
with open('servers.list') as f:
servers = [tuple(line.split(', ')) for line in f]
Remember that using a file as an iterator already strips off the newlines.
You can allow arbitrary whitespace by doing something like
servers = [tuple(word.strip() for word in line.split(',')) for line in f]
It might be easier to use something like regex to parse the original format. You could use an expression that captures the parts of the line you care about and matches but discards the rest:
import re
pattern = re.compile('\'(.+)\',\\s*\'(.+)\'')
You could then extract the names from the matched groups
with open('servers.list') as f:
servers = [pattern.fullmatch(line).groups() for line in f]
This is just a trivialized example. You can make it as complicated as you wish for your real file format.
Try this:
serverlist = []
with open('/server.list', 'r') as f:
for line in f:
serverlist.append(tuple(line.rstrip('\n').split(',')))
Explanation
You want an explicit for loop so you cycle through each line as expected.
You need list.append for each line to append to your list.
You need to use split(',') in order to split by commas.
Convert to tuple as this is your desired output.
List comprehension method
The for loop can be condensed as below:
with open('/server.list', 'r') as f:
serverlist = [tuple(line.rstrip('\n').split(',')) for line in f]

Removing an imported text file (Python)

I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).

Unexpected output from textfile - cleaning read in lines correctly

I am trying to use a very basic text file as a settings file. Three lines repeat in this order/format that govern some settings/input for my program. Text file is as follows:
Facebook
1#3#5#2
Header1#Header2#Header3#Header4
...
This is read in using the following Python code:
f = open('settings.txt', 'r')
for row in f:
platform = f.readline()
rows_to_keep = int(f.readline().split('#'))
row_headers = f.readline().split('#')
clean_output(rows_to_keep, row_headers, platform)
I would expect single string to be read in platform, an array of ints in the second and an array of strings in the third. These are then passed to the function and this is repeated numerous times.
However, the following three things are happening:
Int doesn't convert and I get a TypeError
First line in text file is ignored and I get rows to keep in platform
\n at the end of each line
I suspect these are related and so am only posting one question.
You cannot call int on a list, you need do do some kind of list comprehension like
rows_to_keep = [int(a) for a in f.readline().split('#')]
You're reading a line, then reading another line from the file. You should either do some kind of slicing (see Python how to read N number of lines at a time) or call a function with the three lines after every third iteration.
use .strip() to remove end of lines and other whitespace.
Try this:
with open('settings.txt', 'r') as f:
platform, rows_to_keep, row_headers = f.read().splitlines()
rows_to_keep = [int(x) for x in rows_to_keep.split('#')]
row_headers = row_headers.split('#')
clean_output(rows_to_keep, row_headers, platform)
There are several things going on here. First, when you do the split on the second line, you're trying to cast a list to type int. That won't work. You can, instead, use map.
rows_to_keep = map(int,f.readline().strip().split("#"))
Additionally, you see the strip() method above. That removes trailing whitespace chars from your line, ie: \n.
Try that change and also using strip() on each readline() call.
With as few changes as possible, I've attempted to solve your issues and show you where you went wrong. #Daniel's answer is how I would personally solve the issues.
f = open('settings.txt', 'r')
#See 1. We remove the unnecessary for loop
platform = f.readline()
#See 4. We make sure there are no unwanted leading or trailing characters by stripping them out
rows_to_keep = f.readline().strip().split('#')
#See 3. The enumerate function creates a list of pairs [index, value]
for row in enumerate(rows_to_keep):
rows_to_keep[row[0]] = int(row[1])
row_headers = f.readline().strip().split('#')
#See 2. We close the file when we're done reading
f.close()
clean_output(rows_to_keep, row_headers, platform)
You don't need (and don't want) a for loop on f, as well as calls to readline. You should pick one or the other.
You need to close f with f.close().
You cannot convert a list to an int, you want to convert the elements in the list to int. This can be accomplished with a for loop.
You probably want to call .strip to get rid of trailing newlines.

Python Script Not Finding String that is Clearly Present in a file

I am attempting to analyze a csv file contaning a TCP stack trace. I'm checking conditionally if a line of the file contains a certain string, then adding it to the dictionary.
The strings I'm looking for are:
[SYN]
[SYN, ACK]
I have checked the file multiple times. Python can find the first string no problem, but cannot find the second. Here's the code that checks:
#variable declaration
synString = '[SYN]'
ackString = '[SYN, ACK]'
#some code
#iterate through csv:
with open('EECS325Hw3Lab3', newline = '') as captureFile:
captureReader = csv.reader(captureFile, delimiter=' ')
for row in captureReader:
#code that doesn't work.
if synString in row or ackString in row:
serverDict[currentServer].append(row)
And I know this doesn't work because when I print serverDict, I only see the [SYN] expression. What is happening here?
When you read it a with a csv.reader each row is a list of strings, and 'SYN' and 'ACK' will be consecutive elements of the list. Just use open.
for line in open(captureFile):
if synString in line or ackString in line:
serverDict[currentServer].append(line)
Each row will be a string, as you want it to be. You may want to strip the newlines, though.

How can I open a file and iterate through it, adding data from only certain lines?

I have the following code
my_file=open("test.stl","r+")
vertices=[]
for line in my_file:
line=line.strip()
line=line.split()
if line.startswith('vertex'):
vertices.append([[line[1],line[2],line[3]])
print vertices
my_file.close()
and right now it gives this error:
File "convert.py", line 10
vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
My file has a bunch of lines in it, alot of them formated as vertex 5.6354345 3.34344 7.345345 for example (stl file). I want to add those three numbers to my array so that my array will eventually have [[v1,v2,v3],[v1,v2,v3],....] where all those v's are from the lines. Reading other similar questions it looks like I may need to import sys, but I am not sure why this is.
Do the lines in your STL file have any leading whitespace?
If they do, you need to strip that off first.
line = line.strip()
Also: calling line.split() doesn't affect line. It produces a new list, and you're expected to give the new list a name and use it afterwards, like this:
fields = line.split()
vertices.append([fields[1], fields[2], fields[3]])
your not assigning line.strip to a variable e.g:
line_split = line.split()
vertices.append([[line_split[1],line_split[2],line_split[3]])
Another way would be:
for line in my_file:
line_split = line.split()
if line_split[0] == 'vertex':
vertices.append([[line_split[1],line_split[2],line_split[3]])
vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
Remove the first [ (there is missing ] otherwise) to fix the SyntaxError. There are other errors in your code.
To parse lines that have:
vertex 5.6354345 3.34344 7.345345
format into a list of 3D points with float coordinates:
with open("test.stl") as file:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
print vertices
Apart from what others have mentioned:
vertices.append([[line[1],line[2],line[3]])
One too many left brackets before line[1], should be:
vertices.append([line[1],line[2],line[3]])
print verticies
Your list is named vertices, not verticies.
list.split() does not modify the list; it produces an entirely new list.
Assign the result of line.split() to line: line = line.split()
Then proceed as normal.
http://www.tutorialspoint.com/python/string_split.htm
This won't solve the problem though as you should still be pulling individual characters out of line (instead of blank space) due to the fact that strings act as lists of characters to begin with (see below).
text = "cat"
print(text[1])
>>> 'a'
I suspect that Python never gets past the if line.startswith('vertex'): condition. So as others have said, the core issue probably involves leading space or the file itself.
Also, if you're only reading the file, there's no need to include the access mode "r+". my_file=open("test.stl") works just as well and is more pythonic.
Try to use:
for line in my_file.readlines():
readlines returns a list of all lines in the file.
You don't need to import sys in your case.

Categories