program for comparisons in python - python

I'm really new to programming (really, really new) and need help with the basics. I'm trying to write a program with python that will compare the contents of two .txt files, one a reference and the other the source. The contents are a simple random listing of names, and I want it to print out if there are any names in the source that are not in the reference.
I've looked at other stuff on this site but every time I tried it, the terminal would never actually give a result, even if there was a print command in the program.
I also have a hard time reading the language of a program and ascertaining it's exact function, so something with clear directions would be really appreciated.
As far as I have is:
ref = open("reference.txt")
sor = open("source.txt")
list1 = ref.read()
list2 = sor.read()
for i in list2:
if i != list:
print i
ref.close()
sor.close()
And when I try and run this, it says "expected an indented block"? at the 'print i' line. Why?
Please help me out, as I have to teach myself this stuff and am not doing too well.
Thanks.

If you are totally, completely new to programming then it will take you some time to be able to implement what you describe. Take a step back, pour yourself a beverage, and start here. Start at the beginning, and repeat each illustration until you understand.
http://docs.python.org/tutorial/

As previously mentioned, your inner if statement needs to be indented, as
for i in list2:
if i != list:
print i
This requires two indents because it is two nested blocks. As a basic rule of thumb, anywhere you're ending a line with a colon (:), you're starting a new code block, and should be indenting another level. This is so you can un-indent once to end the if block without ending the for block.
However, I doubt this will do what you want based on your description. It's likely you wanted something more like
sourceLines = set(sor.readLines())
for line in ref.readlines():
if line not in sourcelines:
print line

if blocks in python have to be indented, add another level of indent for your print i statement
for i in list2:
if i != list:
print i

These lines read the files as strings:
list1 = ref.read()
list2 = sor.read()
This loop iterates through the string one character at a time:
for i in list2:
This line compares the character to the list class:
if i != list:

I'll answer your indentation error first: you need another 4 spaces before the print statement. In Python indentation is important and you need to indent any block and dedent to end that block.
For your problem I am going to not give you written out code in advance but more of a flow on how to do it:
Create 2 sets (http://docs.python.org/library/stdtypes.html#set-types-set-frozenset)
Read both files into a seperate set (you can do this while iterating over a file and appending to your set).
Compare your two sets using the set1 - set2 syntax (see the link above) to show all items not common to both sets.
Hope you can make it work from this.
Now for the code:
with open('file1.txt') as file1:
set1 = set(line for line in file1)
with open('file2.txt') as file2:
set2 = set(line for line in file2)
print set1 - set2
This uses some principles you are probably not familiar with (look up: list comprehensions, generator comprehensions and the previously noted link about sets which are unique collections).

Related

Removing an imported text file (Python)

I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).

how to start printing from an alphabet while i'm getting a few new lines in the start

I have to print something by taking input from a file. The first few lines are empty. Therefore the output is turning out to be empty. It's like someone has pressed enter key 10 times before writing anything.
I want to ignore those inputs and consider only those which are not empty. What should I do?
By checking if there is anything apart from a newline character ("\n")is present in a line, your problem can be solved
fileObj=open(Filename)
for row in fileObj:
if len(row.replace("\n",""))>0:
print (row)
#Do your operations
If you can edit your question to add material, that would be helpful, but here’s a few pointers for now.
Assuming you’re taking the file in as a string (let’s call it "f"), you can loop over empty lines with a while loop:
charN = 0
while f[charN] == “\n”:
f = f[1:]
This allows you to chop off only the initial returns while keeping any line breaks later on in the file.
Note that, depending on the system this was written in, the enters may be stored as “\r\n”, in which case you could easily alter this for loop to remove those characters too. Good luck!

basic python file IO homework

I'm having trouble figuring out where I'm going wrong here.
The original file is:
python is a programming language that lets you WORK more quickly and integrate your systems more effectively.
you can learn to use python and see almost immediate gains in PRODUCTIVITY and lower maintenance COSTS.
it's very helpful for any field of study.
I'm trying to create a function that takes a file and reads it and then capitalizes the sentences, changes the caps lock to lower case and the "it's" to "this is". Then put the file back together and add a period after the sentences. Write the new file string into a .txt file named 'Edited.txt.
My code is:
def edit(aFile):
f = open(aFile, 'r')
xs = f.readlines()
f.close()
g = open('happy.txt', 'w')
for x in xs:
x.capitalize()
if x.isupper==1:
x.lower()
g.write(x)
g.close()
The error I get is "File not found-happy.txt(Access is denied). I tried to read the file and couldn't.
I am 100% positive that the file is there and the media path is set to the folder.
isupper
is a method that returns True or False, so the line should read:
if x.isupper():
not
if x.isupper==1:
Not sure if this answers your question, but you should really post more about the error for us to answer properly.
Additionally, many of the python string methods, such as capitalize() and lower() create COPIES of the string, and don't actually modify the original string. So if:
x = "TEST"
then calling
y = x.lower()
will result in x still being "TEST" and y being "test".
This statement doesn't do anything as is:
x.capitalize()
It returns x with the first character capitalized, but you don't save the results anywhere. Also, x remains unchanged after this statement. If you want to capitalize the first char of x, do this:
x = x.capitalize()
The first major mistake that I can see is that you are doing string methods without assigning them to anything. Strings are immutable, so x.capitalize() does nothing (as jh314 said).
In addition to what the others have said, your for x in xs line is saying "for every line in the file, do the following". Your file appears to only be one line, so you are trying to do everything on one line.
Try looking at the documentation on regular expressions and string methods.
http://docs.python.org/2/library/string.html
http://docs.python.org/2/library/re.html
They should be helpful for identifying the places within your line that you would like to modify.

Putting parts of a text file into a list

I have this text file and I need certain parts of it to be inserted into a list.
The file looks like:
blah blah
.........
item: A,B,C.....AA,BB,CC....
Other: ....
....
I only need to rip out the A,B,C.....AA,BB,CC..... parts and put them into a list. That is, everything after "Item:" and before "Other:"
This can be easily done with small input, but the problem is that it may contain a large number of items and text file may be pretty huge. Would using rfind and strip be as efficient for huge input as for small input, algorithmically speaking?
What would be an efficient way to do it?
I can see no need for rfind() nor strip().
It looks like you're simply trying to do:
start = 'item: '
end = 'Other: '
should_append = False
the_list = []
for line in open('file').readlines():
if line.startswith(start):
data = line[len(start):]
the_list.append(data)
should_append = True
elif line.startswith(end):
should_append = False
break
elif should_append:
the_list.append(line)
print the_list
This doesn't hold the whole file in memory, just the current line and the list of lines found between the start and the end patterns.
To answer the question about efficiency specifically, reading in the file and comparing it line by line will net O(n) average case performance.
Example by Code:
pattern = "item:"
with open("file.txt", 'r') as f:
for line in f:
if line.startswith(pattern):
# You can do what you like with it; split it along whitespace or a character, then put it into a list.
You're searching the entire file sequentially, and you have to compare some number of elements in the file before you come across the element you're looking for.
You have the option of building a search tree instead. While it costs O(n) to build, it would cost O(logkn) time to search (resulting in O(n) time overall, again), where k is the number of starting characters you'd have in your list.
Though I usually jump at the chance to employ regular expressions, I feel like for a single occurrence in a large file, it would be much more work and too computationally expensive to use regex. So perhaps the straightforward answer (in python) would be most appropriate:
s = 'item:'
yourlist = next(line[len(s)+1:].split(',') for line in open("c:\zzz.txt") if line.startswith(s))
This, of course, assumes that 'item:' doesn't exist on any other lines that are NOT followed by 'other:', but in the event 'item:' exists only once and at the start of the line, this simple generator should work for your purposes.
This problem is simple enough that it really only has two states, so you could just use a Boolean variable to keep track of what you are doing. But the general case for problems like this is to write a state machine that transitions from one state to the next until it has worked its way through the problem.
I like to use enums for states; unfortunately Python doesn't really have a built-in enum. So I am using a class with some class variables to store the enums.
Using the standard Python idiom for line in f (where f is the open file object) you get one line at a time from the text file. This is an efficient way to process files in Python; your initial lines, which you are skipping, are simply discarded. Then when you collect items, you just keep the ones you want.
This answer is written to assume that "item:" and "Other:" never occur on the same line. If this can ever happen, you need to write code to handle that case.
EDIT: I made the start_code and stop_code into arguments to the function, instead of hard-coding the values from the example.
import sys
class States:
pass
States.looking_for_item = 1
States.collecting_input = 2
def get_list_from_file(fname, start_code, stop_code):
lst = []
state = States.looking_for_item
with open(fname, "rt") as f:
for line in f:
l = line.lstrip()
# Don't collect anything until after we find "item:"
if state == States.looking_for_item:
if not l.startswith(start_code):
# Discard input line; stay in same state
continue
else:
# Found item! Advance state and start collecting stuff.
state = States.collecting_input
# chop out start_code
l = l[len(start_code):]
# Collect everything after "item":
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
elif state == States.collecting_input:
if not l.startswith(stop_code):
# Continue collecting input; stay in same state
# Split on commas to get strings. Strip white-space from
# ends of strings. Append to lst.
lst += [s.strip() for s in l.split(",")]
else:
# We found our terminating condition! Don't bother to
# update the state variable, just return lst and we
# are done.
return lst
else:
print("invalid state reached somehow! state: " + str(state))
sys.exit(1)
lst = get_list_from_file(sys.argv[1], "item:", "Other:")
# do something with lst; for now, just print
print(lst)
I wrote an answer that assumes that the start code and stop code must occur at the start of a line. This answer also assumes that the lines in the file are reasonably short.
You could, instead, read the file in chunks, and check to see if the start code exists in the chunk. For this simple check, you could use if code in chunk (in other words, use the Python in operator to check for a string being contained within another string).
So, read a chunk, check for start code; if not present discard the chunk. If start code present, begin collecting chunks while searching for the stop code. In a recent Python version you can concatenate the blocks one at a time with reasonable performance. (In an old version of Python you should store the chunks in a list, then use the .join() method to join the chunks together.)
Once you have built a string that holds data from the start code to the end code, you can use .find() and .rfind() to find the start code and end code, and then cut out just the data you want.
If the start code and stop code can occur more than once in the file, wrap all of the above in a loop and loop until end of file is reached.

Spell check program in python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?
There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right
I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.
It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

Categories