Python Beautiful Soup

Python Beautiful Soup - python

I am learning Beautiful Soup for Python and trying to parse a website "https://www.twitteraudit.com/". When I enter a twitter id in the search bar, it returns the results for some id in a fraction of seconds, but some id takes about a minute to process the data. In this case, how can I parse the HTML after it gets loaded or the result is done? And I tried to loop it, but it doesn't work that way. But what I figured was if I open a browser and load the web link and once its done it is storing the cache in the computer and the next time when I run for the same id it works perfectly.
Can anyone help me out with this? I appreciate the help. I attach the code below>>
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
from re import sub
def HTML(myURL):
uClient = uReq(myURL)
pageHTML = uClient.read()
uClient.close()
pageSoup = soup(pageHTML, "html.parser")
return pageSoup
def fakecheck(usr):
myURLfc = "https://www.twitteraudit.com/" + usr
pgSoup = HTML(myURLfc)
foll = pgSoup.findAll("div",{"class":"audit"})
link = foll[0].div.a["href"]
real = foll[0].findAll("span",{"class":"real number"})[0]["data-value"]
fake = foll[0].findAll("span",{"class":"fake number"})[0]["data-value"]
scr = foll[0].findAll("div",{"class":"score"})[0].div
scoresent = scr["class"][1]
score = re.findall(r'\d{1,3}',str(scr))[0]
return [link, real, fake, scoresent, score]
lis = ["BarackObama","POTUS44","ObamaWhiteHouse","MichelleObama","ObamaFoundation","NSC44","ObamaNews","WhiteHouseCEQ44","IsThatBarrak","obama_barrak","theprezident","barrakubama","BarrakObama","banackkobama","YusssufferObama","barrakisdabomb_","BarrakObmma","fuzzyjellymasta","BarrakObama6","bannalover101","therealbarrak","ObamaBarrak666","barrak_obama"]
for u in lis:
link, real, fake, scoresent, score = fakecheck(u)
print ("link : " + link)
print ("Real : " + real)
print ("Fake : " + fake)
print ("Result : " + scoresent)
print ("Score : " + score)
print ("=================")

I think the problem is some of the Twitter ID's have not yet been audited, and so I was getting an IndexError. However, putting the call to fakecheck(u) in a while True: loop that catches that error will continually check the website until an audit has been performed on that ID.
I put this code after the lis definition:
def get_fake_check(n):
return fakecheck(n)
for u in lis:
while True:
try:
link, real, fake, scoresent, score = get_fake_check(u)
break
except:
pass
I'm not sure if there is a way to automate the audit request on the website, but when a query is waiting, I manually clicked the "Audit" button on the website for that ID, and once the audit was completed, the script continued as usual until all ID audits were processed.

Related

Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code

I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.
from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3
# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
def mechanicalKeyboards():
url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
category = "Mechanical Keyboards"
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
# getCode = requests.get(url).status_code
# print(getCode)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
conn = sqlite3.connect('trademe.db')
c = conn.cursor()
c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
allResult = str(c.fetchall())
for listing in listingContainer:
title = listing.select("#-title")
location = listing.select("#-region")
auctionID = listing['data-aria-id'].split("-").pop()
fullListingURL = "https://www.trademe.co.nz/a/" + auctionID
image = listing.select("picture img")
try:
buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
except:
buyNow = "None"
try:
price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
except:
price = "None"
for t, l, i in zip(title, location, image):
if auctionID not in allResult:
print("Adding new data - " + t.text)
c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
conn.commit()
sleep(5)
I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.

When using requests.get(url) to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image) function to only yield 8 items since image variable is empty list after the 8th listing in listingContainer
To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.

Why Does My Code Scrape The First Record Only?

My code goes into a website, and clicks on records which causes drop downs.
My current code only prints the first drop down record, and not the others.
For example, the first record of the webpage when clicked, drops down 1 record. This record is shown attached. This is also the first and only dropdown record that gets printed as my output.
The code prints this
How do I get it to pull all drop down titles?
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,2):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
print(title)
buttonToClick = product.find_element_by_xpath('.//div[#class="sign"]')
buttonToClick.click()
time.sleep(5)
subProduct=driver.find_element_by_xpath(".//li[#class='sub_accordin_presentation']")
otherTitle=subProduct.find_element_by_xpath('.//h4').text.strip()
print(otherTitle)

You don't need selenium at all. Not sure what all the info is that you are after but the following shows you that the content is available, from within those expand blocks, with the response from a simple requests.get().:
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=1')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
print(session.select_one('h4').text)
sub_session = session.select('.sub_accordin_presentation')
if sub_session:
print([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print()
print()

Try:
productlist_length = len(driver.find_elements_by_xpath('//*[#class="jscroll-inner"]/ul/li'))
for product in productlist_length:
title = product.find_element_by_xpath('(.//*[#class="accordin_title"]/div)[3]/h4').text

Google news crawler flip pages

continuing on previous work to crawl all news result about query and to return title and url, I am refining the crawler to get all results from all pages in Google News. Current code seems can only return the 1st page Googel news search result. Would be grateful to know how to get all pages results. Many thanks!
my codes below:
import requests
from bs4 import BeautifulSoup
import time
import datetime
from random import randint
import numpy as np
import pandas as pd
query2Google = input("What do you want from Google News?\n")
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb")
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y" # URL for query of news results within one year and sort by date
#htmlpage = urllib2.urlopen(url).read()
time.sleep(randint(0, 2))#waiting
htmlpage = requests.get(url)
print("Status code: "+ str(htmlpage.status_code))
soup = BeautifulSoup(htmlpage.text,'lxml')
df = []
for result_table in soup.findAll("div", {"class": "g"}):
a_click = result_table.find("a")
#print ("-----Title----\n" + str(a_click.renderContents()))#Title
#print ("----URL----\n" + str(a_click.get("href"))) #URL
#print ("----Brief----\n" + str(result_table.find("div", {"class": "st"}).renderContents()))#Brief
#print ("Done")
df=np.append(df,[str(a_click.renderContents()).strip("b'"),str(a_click.get("href")).strip('/url?q='),str(result_table.find("div", {"class": "st"}).renderContents()).strip("b'")])
df = np.reshape(df,(-1,3))
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Search Crawl Done!")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
QGN(query2Google)

There used to be an ajax api, but it's no longer avaliable .
Still , you can modify your script with a for loop if you want to get a number of pages , or a while loop if you want to get all pages .
Example :
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y&start="
pages = 10 # the number of pages you want to crawl #
for next in range(0, pages*10, 10) :
page = url + str(next)
time.sleep(randint(1, 5)) # you may need longer than that #
htmlpage = requests.get(page) # you should add User-Agent and Referer #
print("Status code: " + str(htmlpage.status_code))
if htmlpage.status_code != 200 :
break # something went wrong #
soup = BeautifulSoup(htmlpage.text, 'lxml')
... process response here ...
next_page = soup.find('td', { 'class':'b', 'style':'text-align:left' })
if next_page is None or next_page.a is None :
break # there are no more pages #
Keep in mind that google doesn't like bots , you might get a ban .
You could add 'User-Agent' and 'Referer' in headers to simulate a web browser , and use time.sleep(random.uniform(2, 6)) to simulate a human ... or use selenium.

You can also add &num=25 to the end of your query and you'll get back a webpage with that number of results. In this example youll get back 25 google results back.

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url

Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!

I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url

I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.

Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.

The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here

if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).

You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.

I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here

Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)

The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Beautiful Soup - python

Related

Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code

Why Does My Code Scrape The First Record Only?

Google news crawler flip pages

Having problems following links with webcrawler

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

Categories

Resources