I'm doing scraping Indonesian news website from here. When I'm scraped the news articles from each news links, there is some HTML element on it. The output like this:
I want to remove the elements so the output is just the article. I already use .strip() but still doesn't affect the output. This is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
detik = requests.get('https://www.detik.com/terpopuler')
beautify = BeautifulSoup(detik.content, 'html5lib')
news = beautify.find_all('article', {'class','list-content__item'})
arti = []
for each in news:
try:
title = each.find('h3', {'class','media__title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text, 'html5lib')
content = soup.find('div', {'class', 'detail__body-text itp_bodycontent'}).text.strip()
print(title)
print(lnk)
arti.append({
'Headline': title,
'Content':content,
'Link': lnk
})
except:
continue
df = pd.DataFrame(arti)
df.to_csv('detik.csv', index=False)
Any help would be appreciated
You might be dealing with invalid tags. This thread might be useful:
https://stackoverflow.com/a/8439761/6100602
Related
Just started learning python (3.8), building a scraper to get some football stats. Here's the code so far.
I originally wanted to pull a div with id = 'div_alphabet' which is clearly in the html tree on the website, but for some reason bs4 wasn't pulling it in. I investigated further and noticed that when I pull in the parent div 'all_alphabet' and then look for all child divs, 'div_alphabet' is missing. The only thing weird about the html structure is the long block comment that sits right above 'div_alphabet'. Is this a potential issue?
https://www.pro-football-reference.com/players
import requests
from bs4 import BeautifulSoup
URL = 'https://www.pro-football-reference.com/'
homepage = requests.get(URL)
home_soup = BeautifulSoup(homepage.content, 'html.parser')
players_nav_URL = home_soup.find(id='header_players').a['href']
players_directory_page = requests.get(URL + players_nav_URL)
players_directory_soup = BeautifulSoup(players_directory_page.content, 'html.parser')
alphabet_nav = players_directory_soup.find(id='all_alphabet')
all_letters = alphabet_nav.find_all('div')
print(all_letters)
links = [a['href'] for a in players_directory_soup.select('ul.page_index li div a')]
names = [a.get_text() for a in players_directory_soup.select('ul.page_index li div a')]
This gives you a list and names of all the relative links of alphabetised players.
I wouldn't concern yourself with the div_alphabet it doesn't have any useful information.
Here we are selecting the ul tag with class "page_index". But you'll get a list, so we need to do a for loop and grab the href attribute. The get_text() also gives you the names.
If you haven't come across list comprehensions then this would also be acceptable.
links = []
for a in players_directory_soup.select('ul.page_index li div a'):
links.append(a['href'])
names = []
for a in players_directory_soup.select('ul.page_index li div a'):
names.append(a.get_text())
Something like this cod will make it:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div')
for link in data:
print(*[f'{a.get("href")}\n' for a in link.select('a')])
A more useful way to do this is to make a DataFrame with pandas of it and save it as a csv or something:
import requests
from bs4 import BeautifulSoup
import pandas as pd
players = []
headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://www.pro-football-reference.com/players/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('ul.page_index li div a')
for link in data:
players.append([link.get_text(strip=True), 'https://www.pro-football-reference.com' + link.get('href')])
print(players[0])
df = pd.DataFrame(players, columns=['Player name', 'Url'])
print(df.head())
df.to_csv('players.csv', index=False)
I'm trying to scrape reviews from TrustPilot, but the code always return with blank sheets and the headers/categories I specified. Could someone help me with this?
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd
driver= webdriver.Chrome()
names=[] #List to store name of the product
headers=[] #List to store price of the product
bodies=[]
ratings=[] #List to store rating of the product
dates=[]
#driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.trustpilot.com/review/birchbox.com?page=2")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser", parse_only=SoupStrainer('a'))
for a in soup.findAll('a', href=True, attrs={'class':'reviews-container'}):
name=a.find('div', attrs={'class':'consumer-information_name'})
header=a.find('div', attrs={'class':'review-content_title'})
body=a.find('div', attrs={'class':'review-content_text'})
rating=a.find('div', attrs={'class':'star-rating star-rating--medium'})
date=a.find('div', attrs={'class':'review-date--tooltip-target'})
names.append(name.text)
headers.append(header.text)
bodies.append(body.text)
ratings.append(rating.text)
dates.append(date.text)
print ('webpage, no errors')
df = pd.DataFrame({'User Name':names,'Header':headers,'Body':bodies,'Rating':ratings,'Date':dates})
df.to_csv('reviews02.csv', index=False, encoding='utf-8')
print ('csv made')```
The issue is soup.findAll('a', href=True, attrs={'class':'reviews-container'}) is not finding any results, so there are 0 iterations in the loop. Make sure you are using the correct tags and class names. Also you don't need to use a loop because BeautifulSoup has a find_all method. I used the requests module to open the web page, though it shouldn't make a difference.
from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.trustpilot.com/review/birchbox.com?page=2")
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})
And now each list has 20 entries.
when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.
So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...
I want to write a script to get a home page's links to social media (twitter / facebook mostly), and I'm completely stuck since I am fairly new to Python.
The task I want to accomplish is to parse the website, find the social media links, and save it in a new data frame where each column would contain the original URL, the twitter link, and the facebook link. Here's what I have so far of this code for the new york times website:
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
r = requests.get(url)
sm_sites = ['twitter.com','facebook.com']
soup = BeautifulSoup(r.content, 'html5lib')
all_links = soup.find_all('a', href = True)
for site in sm_sites:
if all(site in sm_sites for link in all_links):
print(site)
else:
print('no link')
I'm having some problems understanding what the loop is doing, or how to make it work for what I need it to. I also had tried to store the site instead of doing print(site) but that was not working... So I figured I'd ask for help. Before asking, I went through a bunch of responses here but none could get me to do what I needed to do.
the way this code works, you already have your links. Your homepage link is the starting url, so http://www.nytimes.com.
And you have the social media urls sm_sites = ['twitter.com','facebook.com'], all you're doing is confirming they exist on the main page. If you want to save the list of confirmed social media urls, then append them to a list
Here is one way to get the social media links off a page
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/python"
r = requests.get(url)
sm_sites = ['twitter.com','facebook.com']
sm_sites_present = []
soup = BeautifulSoup(r.content, 'html5lib')
all_links = soup.find_all('a', href = True)
for sm_site in sm_sites:
for link in all_links:
if sm_site in link.attrs['href']:
sm_sites_present.append(link.attrs['href'])
print(sm_sites_present)
output:
['https://twitter.com/stackoverflow', 'https://www.facebook.com/officialstackoverflow/']
Update
for a df of urls
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.display import display
urls = [
"https://stackoverflow.com/questions/tagged/python",
"https://www.nytimes.com/",
"https://en.wikipedia.org/"
]
sm_sites = ['twitter.com','facebook.com']
sm_sites_present = []
columns = ['url'] + sm_sites
df = pd.DataFrame(data={'url' : urls}, columns=columns)
def get_sm(row):
r = requests.get(row['url'])
output = pd.Series()
soup = BeautifulSoup(r.content, 'html5lib')
all_links = soup.find_all('a', href = True)
for sm_site in sm_sites:
for link in all_links:
if sm_site in link.attrs['href']:
output[sm_site] = link.attrs['href']
return output
sm_columns = df.apply(get_sm, axis=1)
df.update(sm_columns)
df.fillna(value='no link')
output
This will do what you want with regards to adding it to a DataFrame. You can iterate through a list of websites (urlsToSearch), adding a row to the dataframe for each one containing the base website, all facebook links, and all twitter links.
from bs4 import BeautifulSoup
import requests
import pandas as pd
df = pd.DataFrame(columns=["Website", "Facebook", "Twitter"])
urlsToSearch = ["http://www.nytimes.com","http://www.businessinsider.com/"]
for url in urlsToSearch:
r = requests.get(url)
tw_links = []
fb_links = []
soup = BeautifulSoup(r.text, 'html.parser')
all_links = [link['href'] for link in soup.find_all('a', href = True)] #only get href
for link in all_links:
if "twitter.com" in link:
tw_links.append(link)
elif "facebook.com" in link:
fb_links.append(link)
df.loc[df.shape[0]] = [url,fb_links,tw_links] #Add row to end of df
Please bear with me. I am quite new at Python - but having a lot of fun. I am trying to code a web crawler that crawls through election results from the last referendum in Denmark. I have managed to extract all the relevant links from the main page. And now I want Python to follow each of the 92 links and gather 9 pieces of information from each of those pages. But I am so stuck. Hope you can give me a hint.
Here is my code:
import requests
import urllib2
from bs4 import BeautifulSoup
# This is the original url http://www.kmdvalg.dk/
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
my_list = []
all_links = soup.find_all("a")
for link in all_links:
link2 = link["href"]
my_list.append(link2)
for i in my_list[1:93]:
print i
# The output shows all the links that I would like to follow and gather information from. How do I do that?
Here is my solution using lxml. It's similar to BeautifulSoup
import lxml
from lxml import html
import requests
page = requests.get('http://www.kmdvalg.dk/main')
tree = html.fromstring(page.content)
my_list = tree.xpath('//div[#class="LetterGroup"]//a/#href') # grab all link
print 'Length of all links = ', len(my_list)
my_list is a list consist of all links. And now you can use for loop to scrape information inside each page.
We can for loop through each links. Inside each page, you can extract information as example. This is only for the top table.
table_information = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
table_key = tree.xpath('//td[#class="statusHeader"]/text()')
table_value = tree.xpath('//td[#class="statusText"]/text()') + tree.xpath('//td[#class="statusText"]/a/text()')
table_information.append(zip([t]*len(table_key), table_key, table_value))
For table below the page,
table_information_below = []
for t in my_list:
page_detail = requests.get(t)
tree = html.fromstring(page_detail.content)
l1 = tree.xpath('//tr[#class="tableRowPrimary"]/td[#class="StemmerNu"]/text()')
l2 = tree.xpath('//tr[#class="tableRowSecondary"]/td[#class="StemmerNu"]/text()')
table_information_below.append([t]+l1+l2)
Hope this help!
A simple approach would be to iterate through your list of urls and parse them each individually:
for url in my_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
# then parse each page individually here
Alternatively, you could speed things up significantly using Futures.
from requests_futures.sessions import FuturesSession
def my_parse_function(html):
"""Use this function to parse each page"""
soup = BeautifulSoup(html)
all_paragraphs = soup.find_all('p')
return all_paragraphs
session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in my_list]
page_results = [my_parse_function(future.result()) for future in results]
This would be my solution for your problem
import requests
from bs4 import BeautifulSoup
def spider():
url = "http://www.kmdvalg.dk/main"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('div', {'class': 'LetterGroup'}):
anc = link.find('a')
href = anc.get('href')
print(anc.getText())
print(href)
# spider2(href) call a second function from here that is similar to this one(making url = to herf)
spider2(href)
print("\n")
def spider2(linktofollow):
url = linktofollow
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('tr', {'class': 'tableRowPrimary'}):
anc = link.find('td')
print(anc.getText())
print("\n")
spider()
its not done... i only get a simple element from the table but you get the idea and how its supposed to work.
Here is my final code that works smooth. Please let me know if I could have done it smarter!
import urllib2
from bs4 import BeautifulSoup
import codecs
f = codecs.open("eu2015valg.txt", "w", encoding="iso-8859-1")
soup = BeautifulSoup(urllib2.urlopen('http://www.kmdvalg.dk/').read())
liste = []
alle_links = soup.find_all("a")
for link in alle_links:
link2 = link["href"]
liste.append(link2)
for url in liste[1:93]:
soup = BeautifulSoup(urllib2.urlopen(url).read().decode('iso-8859-1'))
tds = soup.findAll('td')
stemmernu = soup.findAll('td', class_='StemmerNu')
print >> f, tds[5].string,";",tds[12].string,";",tds[14].string,";",tds[16].string,";", stemmernu[0].string,";",stemmernu[1].string,";",stemmernu[2].string,";",stemmernu[3].string,";",stemmernu[6].string,";",stemmernu[8].string,";",'\r\n'
f.close()