How to find sum of a for loop in BeautifulSoup - python

Please check my code here:
Sample url: http://py4e-data.dr-chuck.net/comments_42.html
Sum of the digits found in following url should be (2553).
I have to tried to sum up using several techs but can't find the correct one use the url provided at the top of the code. I need to sum up the strings numbers.
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# To read the file from the url
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
# To search for specific area of the file
tags = soup('span')
#print(tags)
sum = 0
# Filters your search further and prints the specific part as
#string
for tag in tags:
print(tag.contents[0])
#ChangeToInt = int(tag.contents[0])
#sum =+ ChangeToInt
#print(sum)

a few pointers, sum is a python builtin method for summing up lists of numbers so best not to use it as a variable name. also the syntax for adding to a variable is += but in your code you have =+. Your code works with just a change to that syntax ( i have also updated the variable name from sum to total and print only the total after the loop.
total = 0
for tag in tags:
print(tag.contents[0])
ChangeToInt = int(tag.contents[0])
total += ChangeToInt
print(total)
Alternatively you could write this using pythons sum method and a list comprehension to generate the numbers.
total = sum([int(tag.contents[0]) for tag in tags])
print(total)
additionally you can check this question for the differnece between += and =+

You simply have your increment syntax wrong:
sum =+ ChangeToInt
should instead be:
sum += ChangeToInt
Your code worked just fine for me after I fixed that.

Related

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

Scraping returning only one value

I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.

Web Scraping a wikipedia page

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis
Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]

Use BeautifulSoup to loop through and retrieve specific URLs

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.
I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).
Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.
Here is my code:
import urllib.request
import json
import ssl
from bs4 import BeautifulSoup
num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))
url='https://pr4e.dr-chuck.com/tsugi/mod/python-
data/data/known_by_Fikret.html'
print (url)
count=0
order=0
while count<num:
context = ssl._create_unverified_context()
htm=urllib.request.urlopen(url, context=context).read()
soup=BeautifulSoup(htm)
for i in soup.find_all('a'):
order+=1
if order ==position:
x=i.get('href')
print (x)
count+=1
url=x
print ('done')
This is a good problem to use recursion. Try to call a recursive function to do this:
def retrieve_urls_recur(url, position, index, deepness):
if index >= deepness:
return True
else:
plain_text = requests.get(url)
soup = BeautifulSoup(plain_text)
links = soup.find_all('a'):
desired_link = links[position].get('href')
print desired_link
return retrieve_urls_recur(desired_link, index+1, deepness)
and then call it with the desired parameters, in your case:
retrieve_urls_recur(url, 2, 0, 4)
2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively
ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess
Just get the link from find_all() by index:
while count < num:
context = ssl._create_unverified_context()
htm = urllib.request.urlopen(url, context=context).read()
soup = BeautifulSoup(htm)
url = soup.find_all('a')[position].get('href')
count += 1

"Expected string or buffer" error using Beautiful Soup

I'm trying a code that will pull numbers from a URL using Beautiful Soup, then sum these numbers, but I keep getting an error that looks like this:
Expected string or buffer
I think it's related to the regular expressions, but I can't pinpoint the problem.
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)
I recommend bs4 instead of BeautifulSoup (which is the old version). You also need to change this line:
y = re.findall ('([0-9]+)',tag)
to something like this:
y = re.findall ('([0-9]+)',tag.text)
See if this gets you further:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))

Categories