I am getting 10 result of google search.
My scenario is:
if any result(link) out of 10 belongs to wikipedia, consider that result
Else consider Google instant result (result which appear on top before links) if exist
Else consider description of all 10 link
Here is my code:
for contentIndex in self.search_response['links']:
domain = self.search_response['links'][contentIndex]['domain']
if "wikipedia.org" in domain:
google_query = ''
google_query = self.search_response['links'][contentIndex]['content']
print "wiki link"
break
elif google_instant:
google_query = ''
google_query = google_instant
print "\n \n Instant result : " + google_instant
break
else:
google_query += self.search_response['links'][contentIndex]['content']
But this condition gets crashed. Like if first link is not wiki link and instant result is present then it will not connsider wiki link, but instant result.
You're breaking out of the loop on the google_instant condition. If this condition is met before you find a wikipedia link, then it will always use the google_instant link. What you actually need to do here is keep iterating through the results, then at the end check if there is a wikipedia or google instant link.
search_results = ''
wikipedia_result = None
google_instant_result = None
for contentIndex in self.search_response['links']:
domain = self.search_response['links'][contentIndex]['domain']
if "wikipedia.org" in domain:
wikipedia_rsult = self.search_response['links'][contentIndex]['content']
print "wiki link"
elif google_instant:
google_instant_result = google_instant
print "\n \n Instant result : " + google_instant
else:
search_results += self.search_response['links'][contentIndex]['content']
google_query = wikipedia_result or google_instant or search_results
Related
I am trying to connect with elements that carry the contact numbers on each site. I was able to create the routine to get the numbers, extract the contact number with available formats and regex and the following code snippet to get the element
contact_elem = browser.find_elements_by_xpath("//*[contains(text(), '" + phone_num + "')]")
Considering the example of https://www.cssfirm.com/, the contact number appears in 2 locations, the top header and the bottom footer
The element texts accompanying the contact number are as follows :
<h3>CALL US TODAY AT (855) 910-7824</h3> - Footer
<span>Call Us<br>Today</span> (855) 910-7824 - Header
The extracted phone number matches perfectly while printing it out. For some reason, the element from the header part is not being detected.
I tried by searching for elements and even by deleting the footer element from the browser before executing the rest of the code.
What could be the reason for it to go undetected?
P.S: Below is the amateurish,uncorrected code. Efficiency edits/suggestions are welcome. The same code has been tested with various sites and works fine.
url = 'http://www.cssfirm.com/'
browser.get(url)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
s = BeautifulSoup(parsed, 'html.parser')
s = s.decode('utf-8')
phoneNumberRegex = '(\s*(?:\+?(\d{1,4}))?[-. (]*(\d{1,})[-. )]*(\d{3}|[A-Z0-9]+)[-. \/]*(\d{4}|[A-Z0-9]+)[-. \/]?(\d{4}|[A-Z0-9]+)?(?: *x(\d+))?\s*)'
custom_re = ['([0-9]{4,4} )([0-9]{3,3} )([0-9]{4,4})',
'([0-9]{3,3} )([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2}-)([0-9]{4,4}-)([0-9]{4,4}-)(0)',
'(\([0-9]{3,3}\) )([0-9]{3,3}-)([0-9]{4,4})',
'(\+[0-9]{2,2} )(\(0\)[0-9]{4,4} )([0-9]{4,6})',
'([0-9]{5,5} )([0-9]{6,6})',
'(\+[0-9]{2,2}\(0\))([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2} )([0-9]{3,3} )([0-9]{4,4} )([0-9]{3,3})',
'([0-9]{3,3}-)([0-9]{3,3}-)([0-9]{4,4})']
phones = []
phones = re.findall(phoneNumberRegex, s)
phone_num_list = ()
phone_num = ''
matched = 0
for phoneHeader in phones:
#phoneHeader = phoneHeader.decode('utf-8')
for ph_cnd in phoneHeader:
for pttrn in custom_re:
phones = re.findall(pttrn,ph_cnd)
if(phones):
phone_num_list = phones
for x in phone_num_list:
phone_num = ''.join(x)
try:
contact_elem = browser.find_element_by_xpath("//*[contains(text(), '" + phone_num + "')]")
phone_num_txt = contact_elem.text
if(phone_num_txt):
matched = 1
break
except NoSuchElementException:
pass
if(matched == 1):
break
if(matched == 1):
break
if(matched == 1):
break
print("Phone number :",phone_num) <-- Perfect output
contact_elem <--empty for header or just the footer element
EDIT
Code updated. Forgot an important piece. Moreover, there is sleep time given in between to give time for the page to load. Considering it trivial, I haven't included them for a quick read.
I found a temporary solution by searching for the partial link text, as the number also comes on the link.
contact_elem2 = browser.find_element_by_partial_link_text(phone_num)
However, this does not answer the generic question as to why that text was ignored within the element.
This is the part of Udacity course WEB SEARCH ENGINE.The goal of this quiz is to write a program which extract all links from the web page.On the output program must return only LINKS.But in my case program returns all links and "NONE" twice.I know that the error in the second part of program after "WHILE" and after "ELSE".But i dont know what i must write there.
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return None,0
else:
start_quote = page.find('"', start_link)
endquo = page.find('"',start_quote + 1)
url = page[(start_quote + 1) : endquo]
return url,endquo
page = 'i know what you doing summer <a href="Udasity".i know what you doing summer <a href="Georgia" i know what you doing summer '
def ALLlink(page):
url = 1
while url != None:
url,endquo = get_next_target(page)
if url:
print url
page = page[endquo:]
else:
print ALLlink(page)
First, you can remove your else statement in your ALLlink() function since it's not doing anything.
Also, when comparing to None, you should use is not instead of !=:
while url != None: # bad
while url is not None # good
That said, I think your error is in your last line:
print ALLlink(page)
You basically have two print statements. The first is inside your function and the second is on the last line of your script. Really, you don't need the last print statement there because you're already printing in your ALLlink() function. So if you change the line to just ALLlink(page), I think it'll work.
If you do want to print there, you could modify your function to store the URLs in an array, and then print that array. Something like this:
def ALLlink(page):
urls = []
url = 1
while url is not None:
url, endquo = get_next_target(page)
if url:
urls.append(url)
page = page[endquo:]
return urls
print ALLlink(page)
I've been using the example in this post
to create a system that searches and gets a large number of Tweets in a short time period. However, each time I switch to a new API key (make a new cursor) the search starts all over from the beginning and gets me repeated Tweets. How do I get each cursor to start where the other left off? What am I missing? Here's the code I am using:
currentAPI = 0
a = 0
currentCursor = tweepy.Cursor(apis[currentAPI].search, q = '%40deltaKshatriya')
c = currentCursor.items()
mentions = []
onlyMentions = []
while True:
try:
tweet = c.next()
if a > 100000:
break
else:
onlyMentions.append(tweet.text)
for t in tTweets:
if tweet.in_reply_to_status_id == t.id:
print str(a) + tweet.text
mentions.append(tweet.text)
a = a + 1
except tweepy.TweepError:
print "Rate limit hit"
if (currentAPI < 9):
print "Switching to next sat in constellation"
currentAPI = currentAPI + 1
#currentCursor = c.iterator.next_cursor
currentCursor = tweepy.Cursor(apis[currentAPI].search, q = '%40deltaKshatriya', cursor = currentCursor)
c = currentCursor.items()
else:
print "All sats maxed out, waiting and will try again"
currentAPI = 0
currentCursor = tweepy.Cursor(apis[currentAPI].search, q = '%40deltaKshatriya', cursor = currentCursor)
c = currentCursor.items()
time.sleep(60 * 15)
continue
except StopIteration:
break
I found a workaround that I think works, although I still encounter some issues. The idea is to add into
currentCursor = tweepy.Cursor(apis[currentAPI].search, q = '%40deltaKshatriya', cursor = currentCursor, max_id = max_id)
Where max_id is the id of the last tweet fetched before the rate limit was hit. The only issue I've encountered is with StopIteration being raised really early (before I get the full 100,000 Tweets) but that I think is a different SO question.
I'm learning Collective intelligence programming in python. When I tried to repeat the pydelicious related codes, I found that pydelicious.get_popular('programming') didn't return any valid urls. The result was {'extended': '', 'description': u'something went wrong', 'tags': '', 'url': '', 'user': '', 'dt': ''}. So you can see that where is supposed to be some url is empty ('') and the description is something went wrong. I've installed the pydelicious using sudo easy_install with setup.py downloaded from google code. And I can successfully import pydelicious module. I'm not sure what the problem is.
from pydelicious import get_popular,get_userposts,get_urlposts
def initializeUserDict(tag,count=5):
user_dict={}
# get the top count popular posts
for p1 in get_popular(tag=tag)[0:count]:
# find all users who posted this
print p1
for p2 in get_urlposts(p1['url']):
user=p2['user']
user_dict[user]={}
return user_dict
user_dict=initializeUserDict('programming')
print user_dict
The problem comes from the Delicious API itself:
http://feeds.delicious.com/v2/rss/popular/starwars
Looking into the API documentation, it looks that this is no longer supported. But if you test if the 'recent' tags, it fails as well.
I sent them an email about this possible bug, lets see...
You should modify the__init__.py to:
rss = http_request('http://feeds.delicious.com/v2/rss').read()
I see the resource code again.
Maybe it is wrong.Because If you edit the code,the procedural answer always remain unchanged...I'm studing...
This is from d.hatena.ne.jp/seika_m/20150910:
I fixed 2 line of "pydelicious.py".
DLCS_RSS = 'http://del.icio.us/rss/'
to
DLCS_RSS = 'http://feeds.delicious.com/v2/rss/'
and
def get_popular(tag = ""):
return getrss(tag = tag, popular = 1)
to
def get_popular(tag = ""):
return getrss(tag = tag, popular = 0)
The problem was solved.
Indeed. Worked for me.
make changes to init.py
replace
elif popular == 0 and tag != '':
# http://del.icio.us/rss/tag/apple
# http://del.icio.us/rss/tag/web2.0
url = DLCS_RSS + "tag/%s" % tag
elif popular == 1 and tag == '':
url = DLCS_RSS + 'popular/'
elif popular == 1 and tag != '':
url = DLCS_RSS + 'popular/%s' % tag
with
elif popular == 0 and tag != '':
# http://del.icio.us/rss/tag/apple
# http://del.icio.us/rss/tag/web2.0
url = DLCS_RSS + "%s" % tag
elif popular == 1 and tag == '':
url = DLCS_RSS + 'popular/'
elif popular == 1 and tag != '':
url = DLCS_RSS + '%s' % tag
I'm getting a keyerror exception when I input a player name here that is not in the records list. I can search it and get back any valid name, but if I input anything else, i get a keyerror. I'm not really sure how to go about handling this since it's kindof confusing already dealing with like 3 sets of data created from parsing my file.
I know this code is bad I'm new to python so please excuse the mess - also note that this is a sortof test file to get this functionality working, which I will then write into functions in my real main file. Kindof a testbed here, if that makes any sense.
This is what my data file, stats4.txt, has in it:
[00000] Cho'Gath - 12/16/3 - Loss - 2012-11-22
[00001] Fizz - 12/5/16 - Win - 2012-11-22
[00002] Caitlyn - 13/4/6 - Win - 2012-11-22
[00003] Sona - 4/5/9 - Loss - 2012-11-23
[00004] Sona - 2/1/20 - Win - 2012-11-23
[00005] Sona - 6/3/17 - Loss - 2012-11-23
[00006] Caitlyn - 14/2/16 - Win - 2012-11-24
[00007] Lux - 10/2/14 - Win - 2012-11-24
[00008] Sona - 8/1/22 - Win - 2012-11-27
Here's my code:
import re
info = {}
records = []
search = []
with open('stats4.txt') as data:
for line in data:
gameid = [item.strip('[') for item in line.split(']')]
del gameid[-1]
gameidstr = ''.join(gameid)
gameid = gameidstr
line = line[7:]
player, stats, outcome, date = [item.strip() for item in line.split('-', 3)]
stats = dict(zip(('kills', 'deaths', 'assists'), map(int, stats.split('/'))))
date = tuple(map(int, date.split('-')))
info[player] = dict(zip(('gameid', 'player', 'stats', 'outcome', 'date'), (gameid, player, stats, outcome, date)))
records.append(tuple((gameid, info[player])))
print "\n\n", info, "\n\n" #print the info dictionary just to see
champ = raw_input() #get champion name
#print info[champ].get('stats').get('kills'), "\n\n"
#print "[%s] %s - %s/%s/%s - %s-%s-%s" % (info[champ].get('gameid'), champ, info[champ].get('stats').get('kills'), info[champ].get('stats').get('deaths'), info[champ].get('stats').get('assists'), info[champ].get('date')[0], info[champ].get('date')[1], info[champ].get('date')[2])
#print "\n\n"
#print info[champ].values()
i = 0
for item in records: #this prints out all records
print "\n", "[%s] %s - %s/%s/%s - %s - %s-%s-%s" % (records[i][0], records[i][1]['player'], records[i][1]['stats']['kills'], records[i][1]['stats']['deaths'], records[i][1]['stats']['assists'], records[i][1]['outcome'], records[i][1]['date'][0], records[i][1]['date'][1], records[i][1]['date'][2])
i = i + 1
print "\n" + "*" * 50
i = 0
for item in records:
if champ in records[i][1]['player']:
search.append(records[i][1])
else:
pass
i = i + 1
s = 0
if not search:
print "no availble records" #how can I get this to print even if nothing is inputted in raw_input above for champ?
print "****"
for item in search:
print "\n[%s] %s - %s/%s/%s - %s - %s-%s-%s" % (search[s]['gameid'], search[s]['player'], search[s]['stats']['kills'], search[s]['stats']['deaths'], search[s]['stats']['assists'], search[s]['outcome'], search[s]['date'][0], search[s]['date'][1], search[s]['date'][2])
s = s + 1
I tried setting up a Try; Except sort of thing but I couldn't get any different result when entering an invalid player name. I think I could probably set something up with a function and returning different things if the name is present or not but I think I've just gotten myself a bit confused. Also notice that no match does indeed print for the 8 records that aren't matches, though thats not quite how I want it to work. Basically I need to get something like that for any invalid input name, not just a valid input that happens to not be in a record in the loop.
Valid input names for this data are:
Cho'Gath, Fizz, Caitlyn, Sona, or Lux - anything else gives a keyerror, thats what I need to handle so it doesn't raise an error and instead just prints something like "no records available for that champion" (and prints that only once, rather then 8 times)
Thanks for any help!
[edit] I was finally able to update this code in the post (thank you martineau for getting it added in, for some reason backticks aren't working to block code and it was showing up as bold normal text when i pasted. Anyways, look at if not search, how can I get that to print even if nothing is entered at all? just pressing return on raw_input, currently it prints all records after **** even though i didn't give it any search champ
where is your exact error occurring?
i'm just assuming it is when champ = raw_input() #get champion name
and then info[champ]
you can either check if the key exists first
if champ not in info:
print 'no records avaialble'
or use get
if info.get(champ)
or you can just try and access the key
try:
info[champ]
# do stuff
except KeyError:
print 'no records available'
the more specific you can be in your question the better, although you explained your problem you really didn't include any specifics Please always include a traceback if available, and post the relevant code IN your post not on a link.
Here's some modifications that I think address your problem. I also reformatted the code to make it a little more readable. In Python it's possible to continue long lines onto the next either by ending with a \ or just going to the next line if there's an unpaired '(' or '[' on the previous line.
Also, the way I put code in my questions or answer here is by cutting it out of my text editor and then pasting it into the edit window, after that I make sure it's all selected and then just use the {} tool at the top of edit window to format it all.
import re
from pprint import pprint
info = {}
records = []
with open('stats4.txt') as data:
for line in data:
gameid = [item.strip('[') for item in line.split(']')]
del gameid[-1]
gameidstr = ''.join(gameid)
gameid = gameidstr
line = line[7:]
player, stats, outcome, date = [item.strip() for item in line.split('-', 3)]
stats = dict(zip(('kills', 'deaths', 'assists'), map(int, stats.split('/'))))
date = tuple(map(int, date.split('-')))
info[player] = dict(zip(('gameid', 'player', 'stats', 'outcome', 'date'),
(gameid, player, stats, outcome, date)))
records.append(tuple((gameid, info[player])))
#print "\n\n", info, "\n\n" #print the info dictionary just to see
pprint(info)
champ = raw_input("Champ's name: ") #get champion name
#print info[champ].get('stats').get('kills'), "\n\n"
#print "[%s] %s - %s/%s/%s - %s-%s-%s" % (
# info[champ].get('gameid'), champ, info[champ].get('stats').get('kills'),
# info[champ].get('stats').get('deaths'), info[champ].get('stats').get('assists'),
# info[champ].get('date')[0], info[champ].get('date')[1],
# info[champ].get('date')[2])
#print "\n\n"
#print info[champ].values()
i = 0
for item in records: #this prints out all records
print "\n", "[%s] %s - %s/%s/%s - %s - %s-%s-%s" % (
records[i][0], records[i][1]['player'], records[i][1]['stats']['kills'],
records[i][1]['stats']['deaths'], records[i][1]['stats']['assists'],
records[i][1]['outcome'], records[i][1]['date'][0],
records[i][1]['date'][1], records[i][1]['date'][2])
i = i + 1
print "\n" + "*" * 50
i = 0
search = []
for item in records:
if champ in records[i][1]['player']:
search.append(records[i][1])
i = i + 1
if not search:
print "no match"
exit()
s = 0
for item in search:
print "\n[%s] %s - %s/%s/%s - %s - %s-%s-%s" % (search[s]['gameid'],
search[s]['player'], search[s]['stats']['kills'],
search[s]['stats']['deaths'], search[s]['stats']['assists'],
search[s]['outcome'], search[s]['date'][0], search[s]['date'][1],
search[s]['date'][2])
s = s + 1