Use Beatiful Soup in scraping multiple websites

Use Beatiful Soup in scraping multiple websites - python

I want to know why lists all_links and all_titles don't want to receive any records from lists titles and links. I have tried also .extend() method and it didn't help.
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
soup = BeautifulSoup(page.content, 'html.parser')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
for i in range(1,5+1):
title_link(i)
all_links = all_links + links
all_titles = all_titles + titles
i+=1
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
#df.to_csv("./gumtree_page_1.csv", sep=';',index=False, encoding = 'utf-8')
#df.to_excel('./gumtree_page_1.xlsx')

When I ran your code, I got
NameError Traceback (most recent call last)
<ipython-input-3-6fff0b33d73b> in <module>
16 for i in range(1,5+1):
17 title_link(i)
---> 18 all_links = all_links + links
19 all_titles = all_titles + titles
20 i+=1
NameError: name 'links' is not defined
That points to a problem - variable named links is not defined in a global scope (where you add it to all_links). You can read about python scopes here. You'd need to return links and titles from title_link. Something similar to this:
def title_link(page_sum):
# your code here
return links, titles
for i in range(1,5+1):
links, titles = title_link(i)
all_links = all_links + links
all_titles = all_titles + titles
print(all_links)

This code is exhibits confusion about scoping. titles and links inside of title_link are local to that function. When the function ends, the data disappears and it cannot be accessed from another scope such as main. Use the return keyword to return values from functions. In this case, you'd need to return a tuple pair of titles and links like return titles, links.
Since functions should do one task only, having to return a pair shows reveals a possible design flaw. A function like title_link is overloaded and should probably be two separate functions, one to get titles and one to get links.
Having said that, the functions here seem like premature abstractions since the operations can be done directly.
Here's a suggested rewrite:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d"
data = {"title": [], "link": []}
for i in range(1, 6):
page = requests.get(url % (i, i, i))
soup = BeautifulSoup(page.content, "html.parser")
titles = soup.find_all("a", class_="href-link tile-title-text")
data["title"].extend([x.next_element for x in titles])
data["link"].extend("https://www.gumtree.pl" + x.get("href") for x in titles)
df = pd.DataFrame(data)
print(df.head(100))
Other remarks:
i+=1 is unnecessary; for loops move forward automatically in Python.
(1,5+1) is clearer as (1, 6).
List comprehensions are great, but if they run multiple lines, consider writing them as normal loops or creating an intermediate variable or two.
Imports should be at the top of a file only. See PEP-8.
list.extend(other_list) is preferable to list = list + other_list, which is slow and memory-intensive, creating a whole copy of the list.

Try this:
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
page.encoding = 'utf-8'
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
return links, titles
for i in range(1,5+1):
links, titles = title_link(i)
all_links.extend(links)
all_titles.extend(titles)
# i+=1 not needed in python
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
I think you just needed to get links and titles out of title_link(page_num).
Edit: removed the manual incrementing per comments
Edit: changed the all_links = all_links + links to all_links.extend(links)
Edit: website is utf-8 encoded, added page.encoding = 'utf-8' and as extra (probably unnecessary) measure, from_encoding='utf-8' to the BeautifulSoup

Related

web-scrape: get H4 attributes & href

I am trying to web-scrape a website. But I can get access to the attributes of some fields.
here is the code i used:
import urllib3
from bs4 import BeautifulSoup
import pandas as pd
scrap_list = pd.DataFrame()
for path in range(10): # scroll over the categories
for path in range(10): # scroll over the pages
url = 'https://www.samehgroup.com/index.php?route=product/category'+str(page)+'&'+'path='+ str(path)
req = urllib3.PoolManager()
res = req.request('GET', URL)
soup = BeautifulSoup(res.data, 'html.parser')
soup.findAll('h4', {'class': 'caption'})
# extract names
scrap_name = [i.text.strip() for i in soup.findAll('h2', {'class': 'caption'})]
scrap_list['product_name']=pd.DataFrame(scrap_name,columns =['Item_name'])
# extract prices
scrap_list['product_price'] = [i.text.strip() for i in soup.findAll('div', {'class': 'price'})]
product_price=pd.DataFrame(scrap_price,columns =['Item_price'])
I want an output that provides me with each product and its price. I still can't get that right.
Any help would be very much appreciated.

I think the problem here was looping through the website pages. I got the code below working by first making a list of urls containing numbered 'paths' corresponding to pages on the website. Then looping through this list and applying a page number to the url.
If you wanted to only get all the products from a certain page, this page can be selected from the urlist and by index.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
urlist = [] #create list of usable url's to iterate through,
for i in range(1,10): # 9 pages equal to pages on website
urlist.append('https://www.samehgroup.com/index.php?route=product/category&path=' + str(i))
namelist = []
newprice = []
for urlunf in urlist: #first loop to get 'path'
for n in range(100): #second loop to get 'pages'. set at 100 to cover website max page at 93
try: #try catches when pages containing products run out.
url = urlunf + '&page=' + str(n)
page = requests.get(url).text
soup = BeautifulSoup(page, 'html')
products = soup.find_all('div', class_='caption')
for prod in products: #loops over returned list of products for names and prices
name = prod.find('h4').text
newp = prod.find('p', class_='price').find('span', class_='price-new').text
namelist.append(name) #append data to list outside of loop
newprice.append(newp)
time.sleep(2)
except AttributeError: #if there are no more products it will move to next page
pass
df = pd.DataFrame() #create df and add scraped data
df['name'] = namelist
df['price'] = newprice

Python issue for crawling multiple page title

I am a marketer and want to conduct some basic market research using Python.
I wrote a simple coding to crawl multiple pages of title, but it does not work to put the title text in the list and to transfer it into Excel format. How can I do in this case?
I tried to create a list and used the extend() method to put these looped titles on the list, but it did not work:
import requests
import pandas as pd
from bs4 import BeautifulSoup
def content_get(url):
count = 0
while count < 4: #this case was to crawl titles of 4 pages
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
titles = soup.find(id="main-container").find_all("div", class_="r-ent")
for title in titles:
print([title.find('div', class_='title').text])
nextpageurl = soup.find("a", string="‹ 上頁")["href"]
url = "https://www.ptt.cc" + nextpageurl
count += 1
firstpage = "https://www.ptt.cc/bbs/movie/index9002.html"
content_get(firstpage)

You need to add the titles to a list outside of the while loop:
def content_get(url):
count = 0
titles = []
while count < 4:
r = requests.get(url)
soup = BeautifulSoup(r.text)
title_page = [title.text.replace('\n', '') for title in soup.find_all('div', {'class': 'title'})]
titles.extend(title_page)
nextpageurl = soup.find("a", string="‹ 上頁")["href"]
url = "https://www.ptt.cc" + nextpageurl
count += 1
return titles
If you don't want the list comprehension to get titles_page, that can be replaced with a traditional for loop:
titles_page = []
titles = soup.find_all('div', {'class': 'title'})
for title in titles:
titles_page.append(title.text.replace('\n', ''))
For the excel file:
def to_excel(text):
df = pd.DataFrame(text, columns=['Title'])
return df.to_excel('output.xlsx')

Using startswith function to filter a list of urls

I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?

Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']

Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)

Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer

Problems with web scraping (William Hill-UFC Odds)

I'm creating a web scraper that will let me get the odds of upcoming UFC Fights on William Hill. I'm using beautiful soup but have yet been able to successfully scrape the needed data. (https://sports.williamhill.com/betting/en-gb/ufc)
I need the fighters names and their odds.
I've attempted a variety of methods to try get the data, trying to scrape different tags etc., but nothing happens.
def scrape_data():
data = requests.get("https://sports.williamhill.com/betting/en-
gb/ufc")
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a',{'class': 'btmarket__name btmarket__name--
featured'}, href=True)
for link in links:
links.append(link.get('href'))
for link in links:
print(f"Now currently scraping link: {link}")
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
time.sleep(1)
fighters = soup.find_all('p', {'class': "btmarket__name"})
c = fighters[0].text.strip()
d = fighters[1].text.strip()
f1.append(c)
f2.append(d)
odds = soup.find_all('span', {'class': "betbutton_odds"})
a = odds[0].text.strip()
b = odds[1].text.strip()
f1_odds.append(a)
f2_odds.append(b)
return None
I would expect it to be exported to a CSV file. I'm currently using Morph.io to host and run the scraper, but it returns nothing.
If correct, it would output:
Fighter1Name:
Fighter2Name:
F1Odds:
F2Odds:
For every available fight.
Any help would be greatly appreciated.

The html returned has different attributes and values. You need to inspect the response.
For writing out to csv you will want to append "'" in front of odds to prevent odds being treated as fractions or dates. See commented out alternatives in code below.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://sports.williamhill.com/betting/en-gb/ufc')
soup = bs(r.content, 'lxml')
results = []
for item in soup.select('.btmarket:has([data-odds])'):
match_name = item.select_one('.btmarket__name[title]')['title']
odds = [i['data-odds'] for i in item.select('[data-odds]')]
row = {'event-starttime' : item.select_one('[datetime]')['datetime']
,'match_name' : match_name
,'home_name' : match_name.split(' vs ')[0]
#,'home_odds' : "'" + str(odds[0])
,'home_odds' : odds[0]
,'away_name' : match_name.split(' vs ')[1]
,'away_odds' : odds[1]
#,'away_odds' : "'" + str(odds[1])
}
results.append(row)
df = pd.DataFrame(results, columns = ['event-starttime','match_name','home_name','home_odds','away_name','away_odds'])
print(df.head())
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Web crawler - following links

Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?

Here is my solution using lxml. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[#class="LetterGroup"]//a/#href') # grab all link
print 'Length of all links = ', len(my_list)
my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[#class="statusHeader"]/text()')
table_value = tree.xpath('//td[#class="statusText"]/text()') + tree.xpath('//td[#class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[#class="tableRowPrimary"]/td[#class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[#class="tableRowSecondary"]/td[#class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!

A simple approach would be to iterate through your list of urls and parse them each individually:
for url in my_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
# then parse each page individually here
Alternatively, you could speed things up significantly using Futures.
from requests_futures.sessions import FuturesSession
def my_parse_function(html):
"""Use this function to parse each page"""
soup = BeautifulSoup(html)
all_paragraphs = soup.find_all('p')
return all_paragraphs
session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]
page_results = [my_parse_function(future.result()) for future in results]

This would be my solution for your problem
import requests
from bs4 import BeautifulSoup
def spider():
url = "http://www.kmdvalg.dk/main"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('div', {'class': 'LetterGroup'}):
anc = link.find('a')
href = anc.get('href')
print(anc.getText())
print(href)
# spider2(href) call a second function from here that is similar to this one(making url = to herf)
spider2(href)
print("\n")
def spider2(linktofollow):
url = linktofollow
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('tr', {'class': 'tableRowPrimary'}):
anc = link.find('td')
print(anc.getText())
print("\n")
spider()
its not done... i only get a simple element from the table but you get the idea and how its supposed to work.

Here is my final code that works smooth. Please let me know if I could have done it smarter!
import urllib2
from bs4 import BeautifulSoup
import codecs
f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1")
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
liste = []
alle_links = soup.find_all("a")
for link in alle_links:
link2 = link["href"]
liste.append(link2)
for url in liste[1:93]:
soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1'))
tds = soup.findAll('td')
stemmernu = soup.findAll('td', class_='StemmerNu')
print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n'
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Beatiful Soup in scraping multiple websites - python

Related

web-scrape: get H4 attributes & href

Python issue for crawling multiple page title

Using startswith function to filter a list of urls

Problems with web scraping (William Hill-UFC Odds)

Web crawler - following links

Categories

Resources