Google search issue in Python - python

I have implemented a program in python which performs the Google search and captures top ten links from the search results. I am using 'pygoogle' library for search, when I am implementing my program for the first two or three times, it is getting proper hits and the entire project is working very fine. But afterward, after certain links got downloaded, it's giving an error as follows. (gui_two.py is my program name)
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 1413, in __call__
return self.func(*args)
File "gui_two.py", line 113, in action
result = uc.utilcorpus(self.fn1,"")
File "/home/ci/Desktop/work/corpus/corpus.py", line 125, in utilcorpus
for url in g1.get_urls(): #this is key sentence based search loop
File "/home/ci/Desktop/work/corpus/pygoogle.py", line 132, in get_urls
for result in data['responseData']['results']:
TypeError: 'NoneType' object has no attribute '__getitem__'
I know this is most familiar error in python, but I am not able to do anything since it is a library. I wonder my program is spamming the Google or I need custom Google search API's or may be the other reason. Please give me precise information for performing search without any issue. I will be so grateful for your help.
Thanks.
Edited: Actually my code is very huge, here is a small piece of code, where problem arises.
g1 = pygoogle(query)
g1.pages = 1
for url in g1.get_urls(): #error is in this line
print "URL : ",url
It may work if we simply copy it in a simple .py file, but if we execute it many times, program gives an error.

Here's the culprit code from pygoogle.py (from http://pygoogle.googlecode.com/svn/trunk/pygoogle.py)
def get_urls(self):
"""Returns list of result URLs"""
results = []
search_results = self.__search__()
if not search_results:
self.logger.info('No results returned')
return results
for data in search_results:
if data and data.has_key('responseData') and data['responseData']['results']:
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results
Unlike every other place where data['responseData']['results'] is used, they're not both being checked for existence using has_key().
I suspect that your responseData is missing results, hence the for loop fails.
Since you have the source, you can edit this yourself.
Also make an issue for the project - very similar to this one in fact.

I fixed the issue by modifying the source code of pygoogle.py library program. The bug in this code is, whether an element has the data or none is not checked in the code. The modified code is:
def get_urls(self):
"""Returns list of result URLs"""
results = []
for data in self.__search__():
#following two lines are added to fix the issue
if data['responseData'] == None or data['responseData']['results'] == None:
break
for result in data['responseData']['results']:
if result:
results.append(urllib.unquote(result['unescapedUrl']))
return results

Related

RuntimeError: dictionary keys changed during iteration

I am trying to use the following Script but there have been some errors that I've needed to fix. I was able to get it running but for most instances of the data it tries to process the following error arises:
C:\Users\Alexa\OneDrive\Skrivbord\Database Samples\LIDC-IDRI-0001\1.3.6.1.4.1.14519.5.2.1.6279.6001.175012972118199124641098335511\1.3.6.1.4.1.14519.5.2.1.6279.6001.141365756818074696859567662357\068.xml
Traceback (most recent call last):
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 370, in <module>
parse_xml_file(xml_file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidc_data_to_nifti.py", line 311, in parse_xml_file
root=xmlHelper.create_xml_tree(file)
File "C:\Users\Alexa\OneDrive\Documents\Universitet\Nuvarande\KEX\LIDC-IDRI-processing-master\lidcXmlHelper.py", line 23, in create_xml_tree
for at in el.attrib.keys(): # strip namespaces of attributes too
RuntimeError: dictionary keys changed during iteration
This corresponds to the following code:
def create_xml_tree(filepath):
"""
Method to ignore the namespaces if ElementTree is used.
Necessary becauseElementTree, by default, extend
Tag names by the name space, but the namespaces used in the
LIDC-IDRI dataset are not consistent.
Solution based on https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma
instead of ET.fromstring(xml)
"""
it = ET.iterparse(filepath)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in el.attrib.keys(): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
return it.root
I am not at all familiar with xml file reading in python and this problem has gotten me stuck for the last two days now. I tried reading up on threads both here and on other forums but it did not give me significant insight. From what I understand the problem arises because we're trying to manipulate the same object we are reading from, which is not allowed? I tried to fix this by making a copy of the file and then having that change depending on what is read in the original file but I did not get it working properly.

Using Python, NLTK, to analyse German text

I am a beginner in Python and currently trying to use NLTK to analyze German text (extract the German noun and it's frequency of German text) by following this tutorial: https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
There are several issues that I faced during the process and I am not able to solve them.
When I follow the website to execute the code below:
import random
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
and it comes out with this
Traceback (most recent call last):
File "test2.py", line 7, in <module>
tagged_sents = list(corp.tagged_sents())
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 130, in tagged_sents
return LazyMap(get_tagged_words, self._grids(fileids))
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 215, in _grids
return concat(
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 433, in concat
raise ValueError("concat() expects at least one object!")
ValueError: concat() expects at least one object!
Then I try to fix by following this solution https://teamtreehouse.com/community/randomshuffle-crashes-when-passed-a-range-somenums-randomshufflerange5250
and alter the
tagged_sents = list(corp.tagged_sents())
to
tagged_sents = list(range(5,250))
And the ValueError didn't come out, I don't know what (5,250) means, although I have read the explanation.
Then I continue to execute the follow step
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)
And it shows
Traceback (most recent call last):
File "test1.py", line 90, in <module>
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
ModuleNotFoundError: No module named 'ClassifierBasedGermanTagger'
I have already downloaded the ClassifierBasedGermanTagger.py and init.py and put them in the folder which link to the VS CODE, don't know if it is correct as the passage said:
'Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:'
Please help me to fix these problems, thanks!
First of all, welcome to StackOverflow! Before posting a question, please make sure that you have done your own research and most of the time it solves the problem.
Secondly, range(start, end) is a very basic function in Python to get list of numbers based on the input and I don't think using it like the way you did is going to solve the problem. I would suggest you to use print to see what kind of data is being populated in corp and start debugging from there. Maybe corp is just empty and that's why you don't get any tagged_sents.
For the the import part, it is not clear to me where did you put the ClassifierBasedGermanTagger.py but wherever it is, it is not visible to your code. You can try to put your code (test2.py) and ClassifierBasedGermanTagger.py in the same directory. Read the link below for more details on how to properly import module in Python.
https://docs.python.org/3/reference/import.html

Python fwrite function error: a bytes like object is required, not str

There is several errors "a bytes-like object is required, not 'str'. But none of them is related to mine. I open the file using open(filename, "w"), not "wb". The code is like 150 lines long. The beginning of the code is to assign the input of command line into parser args.
The args.result is an empty txt file which I want to write my result onto.
I open it using open.
I think the following code should be enough to illustrate my question. Before the line 65, the code is writing two functions to be used in calculation but I think it should be irrelevant to the bug.
In the code, I manually create the file 'save/results/result.txt' in the command terminal. Then I open the file in the line 132.
The remaining code is
A interesting bug happens that the line 158 runs OK. "begin training\n" can be written into file. Then for the line 165, during the first time of loop, it is OK and "aa\n" can be written into file. But during the second loop, the program end with an error "a bytes-like" object is required, not 'str'. The error message is as following.
Anyone could provide a help of that?
Quite thanks.
I've never had trouble with similar code, but if you want a quick fix, I'd bet making a string named f_text and adding to it within the for loop, then writing all of it to the file after the last iteration of the for loop would be a simple way to circumvent the problem.
IE:
f_text = ""
for epoch in range(your_range):
# Do calculations
print(the_stuff_you_wanted_to)
f_text += the_stuff_you_wanted_to + "\n"
f.write(f_text)
I'm mostly posting this to act as a quick fix though. I feel that there's probably a better solution and could help more if you show more of your code, like where you actually initialize f.

loading Behave test results in JSON file in environment.py's after_all() throws error

I'm trying to send my Behave test results to an API Endpoint. I set the output file to be a new JSON file, run my test, and then in the Behave after_all() send the JSON result via the requests package.
I'm running my Behave test like so:
args = ['--outfile=/home/user/nathan/results/behave4.json',
'--for mat=json.pretty']
from behave.__main__ import main as behave_main
behave_main(args)
In my environment.py's after_all(), I have:
def after_all(context):
data = json.load(open('/home/user/myself/results/behave.json', 'r')) # This line causes the error
sendoff = {}
sendoff['results'] = data
r = requests.post(MyAPIEndpoint, json=sendoff)
I'm getting the following error when running my Behave test:
HOOK-ERROR in after_all: ValueError: Expecting object: line 124 column 1
(char 3796)
ABORTED: By user.
The reported error is here in my JSON file:
[
{
...
} <-- line 124, column 1
]
However, behave.json is outputted after the run and according to JSONLint it is valid JSON. I don't know the exact details of after_all(), but I think the issue is that the JSON file isn't done writing by the time I try to open it in after_all(). If I try json.load() a second time on the behave.json file after the file is written, it runs without error and I am able to view my JSON file at the endpoint.
Any better explanation as to why this is happening? Any solution or change in logic to get past this?
Yes, it seems as though the file is still in the process of being written when I try to access it in after_all(). I put in a small delay before I open the file in my code, then I manually viewed the behave.json file and saw that there was no closing ] after the last }.
That explains that. I will create a new question to find out how to get by this, or if a change in a logic is required.

Python - Error Parsing HTML w/ BeautifulSoup

I'm developing an app that inputs data onto a webpage shopping cart and verifies the total. That works fine, however, I am having issue with parsing the HTML output.
A previous discussion; retrieving essential data from a webpage using python, recommended using BeautifulSoup to make solve said user's problem.
I've borrowed some of the python code, and got it to work on a MacOS system. However when I copied the code over to an ubuntu installation, I'm seeing a strange error.
**The Code (where I'm seeing the issue):
response = opener.open(req)
html = response.read()
doc = BeautifulSoup.BeautifulSoup(html)
table = doc.find('tr', {'id':'carttablerow0'})
dump = [cell.getText().strip() for cell in table.findAll('td')]
print "\n Catalog Number: %s \n Description: %s \n Price: %s\n" %(dump[0], dump[1], dump[5])
**The Error ( on the ubuntu server)
Traceback (most recent call last):
File "./shopping_cart_checker.py", line 49, in <module>
dump = [cell.getText().strip() for cell in table.findAll('td')]
TypeError: 'NoneType' object is not callable
I think I've narrowed it down to getText() being the culprit. But I'm not certain why this works on MacOS and not ubuntu.
Any suggestions?
Thank you.
#########################
Hi Guys,
Thank you for the various suggestions. I've attempted most of them, (incorporating the "if cell" statement into the code, however it still isn't working.
# Ignacio Vazquez-Abrams -- Here's a copy of the HTML I'm attempting to strip:
http://pastebin.com/WdaeExnC
As to why it doesn't work on Ubutntu, no idea. However, you can try this:
dump = [(cell.getText() if cell.getText() else '').strip() for cell in table.findAll('td')]
It doesn't seem to be a problem with the code but with the HTML you are reading, what I would do is changing your code to do this:
dump = [cell.getText().strip() for cell in table.findAll('td') if cell]
That way if cell is None it will not try to execute getText and just skip that cell. You should debug if you can, i recommend you to use pdb or ipdb (the one i like to use). Here is a tutorial, with that you can stop just before the line and print values, etc.

Categories