I've build a Python script to randomly create sentences using data from the Princeton English Wordnet, following diagrams provided by Gödel, Escher, Bach. Calling python GEB.py produces a list of nonsensical sentences in English, such as:
resurgent inaesthetic cost.
the bryophytic fingernail.
aversive fortieth peach.
the asterismal hide.
the flour who translate gown which take_a_dare a punch through applewood whom the renewed request enfeoff.
an lobeliaceous freighter beside tuna.
And saves them to gibberish.txt. This script works fine.
Another script (translator.py) takes gibberish.txt and, through py-googletrans Python module, tries to translate those random sentences to Portuguese:
from googletrans import Translator
import json
tradutor = Translator()
with open('data.json') as dataFile:
data = json.load(dataFile)
def buscaLocal(keyword):
if keyword in data:
print(keyword + data[keyword])
else:
buscaAPI(keyword)
def buscaAPI(keyword):
result = tradutor.translate(keyword, dest="pt")
data.update({keyword: result.text})
with open('data.json', 'w') as fp:
json.dump(data, fp)
print(keyword + result.text)
keyword = open('/home/user/gibberish.txt', 'r').readline()
buscaLocal(keyword)
Currently the second script outputs only the translation of the first sentence in gibberish.txt. Something like:
resurgent inaesthetic cost.
aumento de custos inestético.
I have tried to use readlines() instead of readline(), but I get the following error:
Traceback (most recent call last):
File "main.py", line 28, in <module>
buscaLocal(keyword)
File "main.py", line 11, in buscaLocal
if keyword in data:
TypeError: unhashable type: 'list'
I've read similar questions about this error here, but it is not clear to me what should I use in order to read the whole list of sentences contained in gibberish.txt (new sentences begin in a new line).
How can I read the whole list of sentences contained in gibberish.txt? How should I adapt the code in translator.py in order to achieve that? I am sorry if the question is a bit confuse, I can edit if necessary, I am a Python newbie and I would appreciate if someone could help me out.
Let's start with what you're doing to the file object. You open a file, get a single line from it, and then don't close it. A better way to do it would be to process the entire file and then close it. This is generally done with a with block, which will close the file even if an error occurs:
with open('gibberish.txt') as f:
# do stuff to f
Aside from the material benefits, this will make the interface clearer, since f is no longer a throwaway object. You have three easy options for processing the entire file:
Use readline in a loop since it will only read one line at a time. You will have to strip off the newline characters manually and terminate the loop when '' appears:
while True:
line = f.readline()
if not line: break
keyword = line.rstrip()
buscaLocal(keyword)
This loop can take many forms, one of which is shown here.
Use readlines to read in all the lines in the file at once into a list of strings:
for line in f.readlines():
keyword = line.rstrip()
buscaLocal(keyword)
This is much cleaner than the previous option, since you don't need to check for loop termination manually, but it has the disadvantage of loading the entire file all at once, which the readline loop does not.
This brings us to the third option.
Python files are iterable objects. You can have the cleanliness of the readlines approach with the memory savings of readline:
for line in f:
buscaLocal(line.rstrip())
this approach can be simulated using readline with the more arcane form of next to create a similar iterator:
for line in next(f.readline, ''):
buscaLocal(line.rstrip())
As a side point, I would make some modifications to your functions:
def buscaLocal(keyword):
if keyword not in data:
buscaAPI(keyword)
print(keyword + data[keyword])
def buscaAPI(keyword):
# Make your function do one thing. In this case, do a lookup.
# Printing is not the task for this function.
result = tradutor.translate(keyword, dest="pt")
# No need to do a complicated update with a whole new
# dict object when you can do a simple assignment.
data[keyword] = result.text
...
# Avoid rewriting the file every time you get a new word.
# Do it once at the very end.
with open('data.json', 'w') as fp:
json.dump(data, fp)
If you are using readline() function, you have to remember that this function only returns a line, so you have to use a loop to go through all of the lines in the text files. In case of using readlines(), this function does reads the full file at once, but return each of the lines in a list. List data type is unhashable and can not be used as key in a dict object, that's why if keyword in data: line emits this error, as keyword here is a list of all of the lines. a simple for loop will solve this problem.
text_lines = open('/home/user/gibberish.txt', 'r').readlines()
for line in text_lines:
buscaLocal(line)
This loop will iterate through all of the lines in the list and there will be error accessing the dict as key element will be a string.
Related
In Think Python by Allen Downey the excersise 13-2 asks to process any .txt file from gutenberg.org and skip the header information which end with something like "Produced by". This is the solution that author gives:
def process_file(filename, skip_header):
"""Makes a dict that contains the words from a file.
box = temp storage unit to combine two following word in one string
res = dict
filename: string
skip_header: boolean, whether to skip the Gutenberg header
returns: map from string of two word from file to list of words that comes
after them
Last two word in text maps to None"""
res = {}
fp = open(filename)
if skip_header:
skip_gutenberg_header(fp)
for line in fp:
process_line(line, res)
return res
def process_line(line, res):
for word in line.split():
word = word.lower().strip(string.punctuation)
if word.isalpha():
res[word] = res.get(word, 0) + 1
def skip_gutenberg_header(fp):
"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('Produced by'):
break
I really don't understand the flaw of execution in this code. Once the code starts reading the file using skip_gutenberg_header(fp) which contains "for line in fp:"; it finds needed line and breaks. However next loop picks up right where break statement left. But why? My vision of it is that there are two independent iterations here both containing "for line in fp:",
so shouldn't second one start form the beginning?
No, it shouldn't re-start from the beginning. An open file object maintains a file position indicator, which gets moved as you read (or write) the file. You can also move the position indicator via the file's .seek method, and query it via the .tell method.
So if you break out of a for line in fp: loop you can continue reading where you left off with another for line in fp: loop.
BTW, this behaviour of files isn't specific to Python: all modern languages that inherit C's notion of streams and files work like this.
The .seek and .tell methods are mentioned briefly in the tutorial.
For a more in-depth treatment of file / stream handling in Python, please see the docs for the io module. There's a lot of info in that document, and some of that information is mainly intended for advanced coders. You will probably need to read it several times and write a few test programs to absorb what it says, so feel free to skim through it the first time you try to read... or the first few times. ;)
My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?
If fp were a list, then of course they would. However it's not -- it's just an iterable. In this case it's a file-like object that has methods like seek, tell, and read. In the case of file-like objects, they keep state. When you read a line from them, it changes the position of the read pointer in the file, so the next read starts a line below.
This is commonly used to skip the header of tabular data (when you're not using a csv.reader, at least)
with open("/path/to/file") as f:
headers = next(f).strip() # first line
for line in f:
# iterate by-line for the rest of the file
...
I am trying to loop through a text file and apply some logic but I am not able to loop through the text file. So currently I have a text file that is structured like this:
--- section1 ---
"a","b","c"
"d","e","f"
--- section2 ---
"1","2","3"
"4","5","6"
--- section3 ---
"12","12","12"
"11","11","11"
I am trying to filter out the first line which contains '---' and convert the lines below into json until the next '---' line appear in the text document.
However I got this error " fields1 = next(file).split(',') StopIteration
with open(fileName,'r') as file:
for line in file:
if line.startswith('-') and 'section1' in line:
while '---' not in next(file):
fields1 = next(file).split(',')
for x in range(0,len(fields1)):
testarr.append({
config.get('test','test'): fields1[x]
})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
Any idea why my code is not working or how i can solve the error ?
The cause of your error is that you are misusing the file object genrator by calling next on it twice as often as you think. Each call to next gets a line and returns it. Therefore, while '---' not in next(file): fields1 = next(file).split(',') gets a line, checks it for ---, then gets another line and tries to parse it. This means that you are able to skip a line containing the --- by having it come up in the second next. In that case you will get to the end of the file before you find the line you are looking for. StopIteration is how iterators normally indicate that their input has been exhausted.
There are a couple of other issues you may want to address in your code:
Using next on a generator like a file when you are already inside a for loop may cause undefined behavior. You may be getting away with it this time, but it is not good practice in general. The main reason you are getting away with it, by the way, is possibly that you never actually return control to the for loop once the while is triggered, and not that files are particularly permissive in this regard.
The inner with that dumps your data to a file is inside your while loop. That means that the file you open with 'w' permissions will get truncated for every iteration of the while (i.e., each line in the file). As the array grows, the output will actually appear fine, but you probably want to move that out of the inner loop.
The simplest solution would be to rewrite the code in two loops: one to find the start of the part you care about, and the other to process it until the end is found.
Something like this:
test_arr = []
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---') and 'section1' in line:
break
for line in file:
if '---' in line:
break
fields1 = line.split(',')
for item in fields1:
testarr.append({config.get('test','test'): item})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
EDIT:
Taking #tripleee's advice, I have removed the regex check for the start line. While regex gives great precision and flexibility for finding a specific pattern, it is really overkill for this example. I would like to point out that if you are looking for a section other than section1, or if section1 appears after some other lines with dashes, you will absolutely need this two-loop approach. The one-loop solutions in the other answers will not work in a non-trivial case.
Looks like you are overcomplicating matters massively. The next inside the inner while loop I imagine is tripping up the outer for loop, but that's just unnecessary anyway. You are already looping over lines; pick the ones you want, then quit when you're done.
with open(fileName,'r') as inputfile:
for line in inputfile:
if line.startswith('-') and 'section1' in line:
continue
elif line.startswith('-'):
break
else:
testarr.append({config.get('test', 'test'): x
for x in line.split(',')})
with open(test_dir,'w') as test_file:
json.dump(testarr, test_file)
I hope I got the append right, as I wanted to also show you how to map the split fields more elegantly, but I'm not sure I completely understand what your original code did. (I'm guessing you'll want to trim the \n off the end of the line before splitting it, actually. Also, I imagine you want to trim the quotes from around each value. x.strip('"') for x in line.rstrip('\n').split(','))
I also renamed file to inputfile to avoid clashing with the reserved keyword file.
If you want to write more files, basically, add more states in the loop and move the write snippet back inside the loop. I don't particularly want to explain how this is equivalent to a state machine but it should not be hard to understand: with two states, you are skipping or collecting; to extend this, add one more state for the boundary when flipping back, where you write out the collected data and reinitialize the collected lines to none.
next() raises a StopIteration exception when the iterator is exhausted. In other words, your code gets to the end of the file, and you call next() again, and there's nothing more for it to return, so it raises that exception.
As for how to solve your problem, I think this might be what you want:
with open(fileName, 'r') as file:
for line in file:
if line.startswith('---'):
if 'section1' in line:
continue
else:
break
fields1 = line.split(',')
for x in range(len(fields1)):
testarr.append({
config.get('test', 'test'): fields1[x]
})
with open(test_dir, 'w') as test_file:
json.dump(testarr, test_file)
So I keep receiving this error:
ValueError: Mixing iteration and read methods would lose data
And 1) I don't quite understand why I'm receiving it, and 2) people with similar problems seem to be doing things with their code which are much more complex than a beginner (like myself) can adapt with.
The idea of my code is to read a data_file.txt and convert each line into its own individual array.
so far I have this:
array = [] #declaring a list with name '**array**'
with open('file.txt','r') as input_file:
for line in input_file:
line = input_file.readlines()
array.append(line)
print('done 1') #for test purposes
return array
And I keep recieving an error.
"Value error: Mixing iteration and read methods would lose data "message while extracting numbers from a string from a .txt file using python
The above question seemed to be doing something similar, calling in items for an array, however his code was skipping lines and using a range to call in certain parts of it, I don't need that. All I need is to call in all the lines and have them be made into an array.
Python: Mixing files and loops
In this question, once again, something much more than I can understand was being asked. From what I understood, he just wanted a code that would restart after an error and continue, and the answers were about that part. Once again not what I'm looking for.
The error is pretty much self-explanatory (once you know what it is about), so here goes.
You start with the loop for line in input_file:. File objects are iterable in Python. They iterate over the lines in the file. This means that for each iteration of the loop, line will contain the next line in your file.
Next you read a line manually line = input_file.readlines(). This attempts to read a line from the file, but you are already iterating over the lines in the for loop.
Files are usually read sequentially, with no going backwards. What you end up with is a conflict. If you read a line using readline, the iterator in the loop will be forced to return the line after next since it can not go back. However, it is promising to return the next line. The error is telling you that readline knows that there is an active iterator and that calling it would interfere with the loop.
If you take out line = input_file.readlines(), the loop will do what you expect it to.
To make an array of the lines of the file, with one line per array element:
with open('file.txt','r') as input_file:
array = input_file.readlines()
return array
since readlines will give you the whole file in one shot. Alternatively,
return list(open('file.txt','r'))
will do the same per the docs.
Very new to python and can't understand why this isn't working. I have a list of web addresses stored line by line in a text file. I want to store the first 10 in an array/list called bing, the next 10 in a list called yahoo, and the last 10 in a list called duckgo. I'm using the readlines function to read the data from the file into each array. The problem is nothing is being written to the lists. The count is incrementing like it should. Also, if I remove the loops altogether and just read the whole text file into one list it works perfectly. This leads me to believe that the loops are causing the problem. The code I am using is below. Would really appreciate some feedback.
count=0;
#Open the file
fo=open("results.txt","r")
#read into each array
while(count<30):
if(count<10):
bing = fo.readlines()
count+=1
print bing
print count
elif(count>=10 and count<=19):
yahoo = fo.readlines()
count+=1
print count
elif(count>=20 and count<=29):
duckgo = fo.readlines()
count+=1
print count
print bing
print yahoo
print duckgo
fo.close
You're using readlines to read the files. readlines reads all of the lines at once, so the very first time through your loop, you exhaust the entire file and store the result in bing. Then, every time through the loop, you overwrite bing, yahoo, or duckgo with the (empty) result of the next readlines call. So your lists all wind up being empty.
There are lots of ways to fix this. Among other things, you should consider reading the file a line at a time, with readline (no 's'). Or better yet, you could iterate over the file, line by line, simply by using a for loop:
for line in fo:
...
To keep the structure of your current code you could use enumerate:
for line_number, line in enumerate(fo):
if condition(line_number):
...
But frankly I think you should ditch your current system. A much simpler way would be to use readlines without a loop, and slice the resulting list!
lines = fo.readlines()
bing = lines[0:10]
yahoo = lines[10:20]
duckgo = lines[20:30]
There are many other ways to do this, and some might be better, but none are simpler!
readlines() reads all of the lines of the file. If you call it again, you get empty list. So you are overwriting your lists with empty data when you iterate through your loop.
You should be using readline() instead of readlines()
readlines() reads the entire file in at once, whereas readline() reads a single line from the file.
I suggest you rewrite it like so:
bing = []
yahoo = []
duckgo = []
with open("results.txt", "r") as f:
for i, line in enumerate(f):
if i < 10:
bing.append(line)
elif i < 20:
yahoo.append(line)
elif i < 30:
duckgo.append(line)
else:
raise RuntimeError, "too many lines in input file"
Note how we use enumerate() to get a running count of lines, rather than making our own count variable and needing to increment it ourselves. This is considered good style in Python.
But I think the best way to solve this problem would be to use itertools like so:
import itertools as it
with open("results.txt", "r") as f:
bing = list(it.islice(f, 10))
yahoo = list(it.islice(f, 10))
duckgo = list(it.islice(f, 10))
if list(it.islice(f, 1)):
raise RuntimeError, "too many lines in input file"
itertools.islice() (or it.islice() since I did the import itertools as it) will pull a specified number of items from an iterator. Our open file-handle object f is an iterator that returns lines from the file, so it.islice(f, 10) pulls exactly 10 lines from the input file.
Because it.islice() returns an iterator, we must explicitly expand it out to a list by wrapping it in list().
I think this is the simplest way to do it. It perfectly expresses what we want: for each one, we want a list with 10 lines from the file. There is no need to keep a counter at all, just pull the 10 lines each time!
EDIT: The check for extra lines now uses it.islice(f, 1) so that it will only pull a single line. Even one extra line is enough to know that there are more than just the 30 expected lines, and this way if someone accidentally runs this code on a very large file, it won't try to slurp the whole file into memory.
Im trying to take a text file and use only the first 30 lines of it in python.
this is what I wrote:
text = open("myText.txt")
lines = myText.readlines(30)
print lines
for some reason I get more then 150 lines when I print?
What am I doing wrong?
Use itertools.islice
import itertools
for line in itertools.islice(open("myText.txt"), 0, 30)):
print line
If you are going to process your lines individually, an alternative could be to use a loop:
file = open('myText.txt')
for i in range(30):
line = file.readline()
# do stuff with line here
EDIT: some of the comments below express concern about this method assuming there are at least 30 lines in the file. If that is an issue for your application, you can check the value of line before processing. readline() will return an empty string '' once EOF has been reached:
for i in range(30):
line = file.readline()
if line == '': # note that an empty line will return '\n', not ''!
break
index = new_index
# do stuff with line here
The sizehint argument for readlines isn't what you think it is (bytes, not lines).
If you really want to use readlines, try text.readlines()[:30] instead.
Do note that this is inefficient for large files as it first creates a list containing the whole file before returning a slice of it.
A straight-forward solution would be to use readline within a loop (as shown in mac's answer).
To handle files of various sizes (more or less than 30), Andrew's answer provides a robust solution using itertools.islice(). To achieve similar results without itertools, consider:
output = [line for _, line in zip(range(30), open("yourfile.txt", "r"))]
or as a generator expression (Python >2.4):
output = (line for _, line in zip(range(30), open("yourfile.txt", "r")))
for line in output:
# do something with line.
The argument for readlines is the size (in bytes) that you want to read in. Apparently 150+ lines is 30 bytes worth of data.
Doing it with a for loop instead will give you proper results. Unfortunately, there doesn't seem to be a better built-in function for that.