Beautiful Soup. Text extraction into a dataframe

Beautiful Soup. Text extraction into a dataframe - python

I'm trying to extract the information from a single web-page that contains multiple similarly structured recordings. Information is contained within div tags with different classes (I'm interested in username, main text and date). Here is the code I use:
import bs4 as bs
import urllib
import pandas as pd
href = 'https://example.ru/'
sause = urllib.urlopen(href).read()
soup = bs.BeautifulSoup(sause, 'lxml')
user = pd.Series(soup.Series('div', class_='Username'))
main_text = pd.Series(soup.find_all('div', class_='MainText'))
date = pd.Series(soup.find_all('div', class_='Date'))
result = pd.DataFrame()
result = pd.concat([user, main_text, date], axis=1)
The problem is that I receive information with all tags, while I want only a text. Surprisingly, .text attribute doesn't work with find_all method, so now I'm completely out of ides.
Thank you for any help!

list comprehension is the way to go, to get all the text within MainText for example, try
[elem.text for elem in soup.find_all('div', class_='MainText')]

Related

BeautifulSoup - Can't get tbody

I'm trying to get a table that is located inside multiple nests.
I'm new to Beautifulsoup and I have practiced some simple eeemples.
The issue is that, I can't understand why my code can't get the "div" tag that has the class "Explorer is-embed".
Because from that point, I can go deeper to get to the tbody where all the data that I want to scrape are located.
thanks for your help in advance.
Below is my code:
url = "https://ourworldindata.org/covid-cases"
url_content = requests.get(url)
soup = BeautifulSoup(url_content.text, "lxml")
########################
div1 = soup3.body.find_all("div", attrs={"class":"content-wrapper"})
div2 = div1[0].find_all("div", attrs={"class":"offset-content"})
sections = div2[0].find_all('section')
figure = sections[1].find_all("figure")
div3 = figure[0].find_all("div")
div4 = div3[0].find_all("div")
Here is a snapshot of the "div" tag that I'm not getting.
Figure

Data is dynamically loaded. Instead, grab the public source csv (other formats available)
https://ourworldindata.org/coronavirus-source-data
import pandas as pd
df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df.head()
Values you see in the Daily new confirmed COVID-19 cases (per 1M)
table are calculated from the same data as in that file for the two dates being compared e.g.

BeautifulSoup get text from child element within container

I am trying to scrape used car listing prices and names excluding those posted by a dealership. I am having trouble as I would like to put this in a dataframe using panda but can only do so once I can get the right information. Here is the code.
from bs4 import BeautifulSoup as bs4
import requests
import csv
import pandas as pd
import numpy as np
pages_to_scrape=2
pages=[]
prices=[]
names=[]
for i in range(1,pages_to_scrape+1):
url = 'https://www.kijiji.ca/b-cars-trucks/ottawa/used/page-{}/c174l1700185a49'.format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text,'html.parser')
for k in soup.findAll('div', class_='price'):
if k.find(class_='dealer-logo'):
continue
else:
price=k.getText()
prices.append(price.strip())
My code up to here works as intended. Since 'dealer-logo' is a child of 'price'. However, I am having trouble having this work for the names, as the 'title' class is within 'info-container' where 'price' is also found.
As such, abc=soup.find('a', { 'class' : 'title' }) returns only the first element of the page when I want it to iterate through every listing that does not have 'dealer-logo' in it, and findAll obviously wouldn't work as it would give every element. findNext gives me a NoneType.
for l in soup.findAll('div', class_='info-container'):
if l.findAll(class_='dealer-logo'):
continue
else:
abc=soup.find('a', { 'class' : 'title' })
name=abc.getText()
names.append(name.strip())
print(names)
print(prices)
Below is the code I am scraping. I want to ignore all instances where 'dealer-logo' is present, and get the price and title for the listing and add it to a list.

With bs4 4.7.1+ you can use :not and :has to filter out the logo items. Select for a parent node then have the two target child items selected in a comprehension where you group them as tuples then convert to DataFrame with pandas.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.kijiji.ca/b-cars-trucks/ottawa/used/c174l1700185a49')
soup = bs(r.content, 'lxml')
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('.info-container:not(:has(.dealer-logo))') if 'wanted' not in i.select_one('a.title').text.lower())
df
N.B.
It seems at times, one gets slightly more results than you see on page.
I think you can likely also filter out the wanted ads in the css, rather than the if as above, with
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('div:not(div.regular-ad) > .info-container:not(:has(.dealer-logo))') )

Python BeautifulSoup4 - Scrape Section/Table Header and Values from Multiple Sections/Tables

I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!

EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance

This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)

alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])

set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Sifting a list returned from a webscrape produced with Beautiful Soup

I am using python to code. I have been trying to webscrape the names, team images, and colleges of nba draft prospects.However when I scrape for the name of the colleges I get both the college page and the college name. How do I get it so that I only see the colleges? I have tried adding .string and .text to the end of anchor (anchor.string).
import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup
list = []
soup = BeautifulSoup(urllib2.urlopen(
'http://www.cbssports.com/nba/draft/mock-draft'
).read()
)
rows = soup.findAll("table",
attrs = {'class':'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[2].findAll("a")[1:]
if anchor:
print anchor

Instead of just:
print anchor
use:
print anchor[0].text

The format of an anchor in html is <a href='web_address'>Text-that-is-displayed</a> so unless there's already a fancy html parser library (I'd bet there is, just don't know of any), you'll likely need to use some kind of regular expressions to parse out the part of the anchor that you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup. Text extraction into a dataframe - python

list comprehension is the way to go, to get all the text within MainText for example, try [elem.text for elem in soup.find_all('div', class_='MainText')]

Related

BeautifulSoup - Can't get tbody

BeautifulSoup get text from child element within container

Python BeautifulSoup4 - Scrape Section/Table Header and Values from Multiple Sections/Tables

How to solve, finding two of each link (Beautifulsoup, python)

Sifting a list returned from a webscrape produced with Beautiful Soup

Categories

Resources