Removing an imported text file (Python) - python

I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!

I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).

Related

Remove duplicates in text file line by line

I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.
For example, the text file might contain:
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG;þ
Thus, in the above example, the script should only remove the bold strings.
I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.
Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.
Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.
#Prune's answer gives the idea but it needs to be modified like this:
input_file = """"þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ"""""
input = input_file.split("\n")
for line in input:
seen_item = []
for item in line.split(";"):
if item not in seen_item or item == "þ":
seen_item.append(item)
print(";".join(seen_item))
import re
with open('file', 'r') as f:
file = f.readlines()
for line in file:
print(re.sub(r'([^;]+;)(\1)', r'\1', line))
Read the file by lines; then replace the duplicates using re.sub.

python remove info from readlines() that doesn't match a list

I am reading a file, then save that information with readlines(). I then check to see if any of the data from one of my lists is in readlines. The problem I am facing is removing all the information from readlines that isn't in my list, so readlines only contains the information that is in my list, that is if there are any matches. When I say match, I mean if any of the words are found in any order. Could someone please help point me in the right direction? Thank you. I am using python 2.7 and am reading utf-8 files.
Edit: I am reading files and stores their information to readlines(), I then use my list to check and see if the file contains what I am looking for. If it does, then I want to remove all the data from readlines(), except the match found from my list. I save the matches to a text file. I hope this makes sense. If I am going about this the right way, please let me know.
Edit2: I am reading a file and then using readlines, which stores the data from that file in my readlines() variable. I know it would be helpful to share my code, but I am not allowed to do so.
Edit 3: Pseudo code
alist= ['hamburger','cow','meat']
openit = open.codecs('afile.html','utf-8-sig')
justreadit = openit.readlines()
for alist in justreadit:
print "found matches"
comment: remove any data that is not a list. When I tried putting in the pound sign as a normal comment, it didn't work.
edit4: I am looking for any of the words in the file in alist. No order, I just need to find the word and save it to a text file.
So let me see if I'm understanding this right.
You have a file that looks something like this:
I am a farmer
Sometimes, I farm chickens
I also have a cow
I like to eat hamburger meat
But not lamb
You want to grab the third and fourth lines out of this, because the third line has "cow", and the fourth line has both "hamburger" and "meat". If this is a correct understanding of your problem, here is code that will achieve that (assuming the above text is saved to afile.html in the current working directory).
word_list = ['hamburger', 'cow', 'meat']
with open('afile.html', encoding='utf-8-sig') as f:
lines = f.readlines()
for line in lines:
for word in word_list:
if word in line:
print(line)
break
Result:
I also have a cow
I like to eat hamburger meat
​
Is this the result you wanted?
Note that there are many ways this could fail. For example, the line I LIKE COW would not be printed, because "COW" is not in the same case as "cow". The line "I like cows" would be printed, because the substring "cow" is found in that line (even though the word "cow" isn't). Because the specification in your question is unclear about these things, I have not tried to guess at which of these behaviors you do or do not want.
I'm pretty new at this, but I think that since file.readlines() returns a list, with each list entry being a line from the target file. To only return matches, I would:
justreadit=openit.readlines()
matchlist=[]
for i in justreadit:
for h in alist:
if h==i:
outputlist.append(i)
return outputlist

Search for every blank line in document, and pop first line after that?

What I am trying to do is go through a document line by line, find each blank line, keep traversing until I hit the next line of text, and pop that line.
So for example, what I want to do is this:
Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop.
So it will go through each number of blank lines until it hits the next sentence, and pops that sentence only, then continues on. I am thinking I should use re.split('\n') , but I am not sure.
I am sorry I have no code to post but I really don't know where to start
any help would be much appreciated, thank you!
this is part of a larger code, which i've worked days and days on and have figured out up to this point, so I have done the bulk of the word.
If you do for line in filehandle: it will iterate over each line. If you have a flag that is true when the previous line is blank you can skip the next line then reset the flag.
The easiest novice solution by far is probably the way Steve suggested: Just iterate the lines, and use a flag to keep track of whether the last line was a blank line.
But if you want a higher-level solution, you need to rethink the problem at a higher level. What you're actually trying to specify is the first line of every paragraph but the first, where "paragraphs" are things divided by empty lines. Right?
So, how could you do that? Well, you can split on '\n\n' just as easily as on \n. So:
paragraphs = document.split('\n\n')
first_lines = [paragraph.partition('\n')[0] for paragraph in paragraphs]
popped_lines = first_lines[1:]
(I used partition instead of split here both because it splits only at the first '\n', leaving the rest alone, and because it handles one-line paragraphs right—which paragraph.split('\n', 1) would not.)
But you don't want a list of the popped lines, you want a list of everything but the popped lines, right?
paragraphs = document.split('\n\n')
first, rest = paragraphs[0], paragraphs[1:]
rest_edited = [paragraph.partition('\n')[1] for paragraph in rest]
And if you want to turn that back into a document:
all_edited = [first] + rest_edited
document_edited = '\n\n'.join(all_edited)
You can shorten that a bit by using slice assignment, although I'm not sure it's quite as readable:
paragraphs = document.split('\n\n')
paragraphs[1:] = [paragraph.partition('\n')[1] for paragraph in paragraphs[1:]]
document_edited = '\n\n'.join(paragraphs)
As J.F. Sebastian points out, the question is a little ambiguous. Does "blank lines" mean "empty lines", or "lines with nothing but whitespace in them"? If it's the latter, things are a bit more complicated, and the easiest solution probably is a simple regex (r'\n\s*\n') for the splitting into paragraphs.
Meanwhile, if what you have is a sequence of lines (and note that a file is a sequence of lines!) rather than one big string, you can do this without split at all, in a few different ways.
For example, paragraphs are groups of non-blank lines, right? So you can use the groupby function to get them:
groups = itertools.groupby(lines, bool)
Or, if "blank" doesn't mean "empty":
groups = itertools.groupby(lines, lambda line: not line.strip())
Note that this gives you (False, <sequence of lines>) for each paragraph, and (True, <sequence of blank lines>) for each blank line. If you want to preserve blank lines as-is, you can—but if you're happy just replacing each run of blank lines with a single empty line (which you obviously are if "blank" does mean "empty"), it's probably easier to throw away the blank paragraphs:
paragraphs = (group for (key, group) in paragraphs if not key)
Then you can remove the first element from all but the first group, and finally chain the groups back together into one big sequence:
first = next(paragraphs)
edited_paragraphs = (itertools.islice(paragraph, 1) for paragraph in paragraphs)
edited_document = itertools.chain(first, *edited_paragraphs)
Finally, what if you have runs of multiple blank lines in a row? Well, first you have to decide how to deal with them. If you have two blank lines, do you remove the second? If so, do you remove the first line of the next paragraph (because it was originally after a blank line), or not (because the blank line it was after was already removed)? What if you have three in a row? Splitting on '\n\n' will do one thing, splitting on '\n\s*\n' a different thing, and groupby yet another… but until you know what you want, it's impossible to say which is "right" or how to "fix" the others, of course.
I assume the original poster (OP) wants to remove those lines in-place, meaning removing those lines from the file. Here is a revised solution (my previous solution was off the mark. Thank you J.F Sebastian for telling me.
import fileinput
def remove_line_after_blank(filename, in_place_edit=False):
previous_line = ''
for line in fileinput.input(filename, inplace=in_place_edit):
if not (previous_line == '\n' and line != '\n'):
print line.rstrip()
previous_line = line
if __name__ == '__main__':
remove_line_after_blank('data.txt', in_place_edit=True)
Discussion
If you do not want to modify the original data file, remove , in_place_edit=True.
use re.findall to match all occurrence in a string:
>>> text = """Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop."""
>>> re.findall("\n\n+(.+)", text)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
>>> re.findall("\n\n+(.+)$", text, re.MULTILINE)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
The easiest way would be to split the text on newlines:
lines = your_string.split("\n")
That would break it up into an array (stored in lines), where each element of the array is a separate line of text. (As noted in the comments, if you have a file object already, you can just loop through that.)
Then you could go through each line of lines, checking for a newline. If you find one, you could "pop" out the next one. (I don't know what you mean by pop, so I just have the code printing out the lines you want.)
for line in lines:
if print_next_line:
print(line)
print_next_line = False
if line == "":
print_next_line = True

How can I open a file and iterate through it, adding data from only certain lines?

I have the following code
my_file=open("test.stl","r+")
vertices=[]
for line in my_file:
line=line.strip()
line=line.split()
if line.startswith('vertex'):
vertices.append([[line[1],line[2],line[3]])
print vertices
my_file.close()
and right now it gives this error:
File "convert.py", line 10
vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
My file has a bunch of lines in it, alot of them formated as vertex 5.6354345 3.34344 7.345345 for example (stl file). I want to add those three numbers to my array so that my array will eventually have [[v1,v2,v3],[v1,v2,v3],....] where all those v's are from the lines. Reading other similar questions it looks like I may need to import sys, but I am not sure why this is.
Do the lines in your STL file have any leading whitespace?
If they do, you need to strip that off first.
line = line.strip()
Also: calling line.split() doesn't affect line. It produces a new list, and you're expected to give the new list a name and use it afterwards, like this:
fields = line.split()
vertices.append([fields[1], fields[2], fields[3]])
your not assigning line.strip to a variable e.g:
line_split = line.split()
vertices.append([[line_split[1],line_split[2],line_split[3]])
Another way would be:
for line in my_file:
line_split = line.split()
if line_split[0] == 'vertex':
vertices.append([[line_split[1],line_split[2],line_split[3]])
vertices.append([[line[1],line[2],line[3]])
^
SyntaxError: invalid syntax
Remove the first [ (there is missing ] otherwise) to fix the SyntaxError. There are other errors in your code.
To parse lines that have:
vertex 5.6354345 3.34344 7.345345
format into a list of 3D points with float coordinates:
with open("test.stl") as file:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
print vertices
Apart from what others have mentioned:
vertices.append([[line[1],line[2],line[3]])
One too many left brackets before line[1], should be:
vertices.append([line[1],line[2],line[3]])
print verticies
Your list is named vertices, not verticies.
list.split() does not modify the list; it produces an entirely new list.
Assign the result of line.split() to line: line = line.split()
Then proceed as normal.
http://www.tutorialspoint.com/python/string_split.htm
This won't solve the problem though as you should still be pulling individual characters out of line (instead of blank space) due to the fact that strings act as lists of characters to begin with (see below).
text = "cat"
print(text[1])
>>> 'a'
I suspect that Python never gets past the if line.startswith('vertex'): condition. So as others have said, the core issue probably involves leading space or the file itself.
Also, if you're only reading the file, there's no need to include the access mode "r+". my_file=open("test.stl") works just as well and is more pythonic.
Try to use:
for line in my_file.readlines():
readlines returns a list of all lines in the file.
You don't need to import sys in your case.

Putting parts of a text file into a list

I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?
I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.
To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.
Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.
This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)
I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.

Categories