I'm having difficulty scraping specific tables from Wikipedia. Here is my code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find('table', {"class":"wikitable sortable jquery-tablesorter"})
df = pd.read_html(str(cities))
df=pd.DataFrame(df[0])
print(df.to_string())
The class is taken from the info inside the table tag when you inspect the page, I'm using Edge as a browser. Changing the index (df[0]) causes it to say the index is out of range.
Is there a unique identifier in the wikipedia source code for each table? I would like a solution, but I'd really like to know where I'm going wrong too, as I feel I'm close and understand this.
I think your main difficulty was in extracting the html that corresponds to your class... "wikitable sortable jquery-tablesorter" is actually three separate classes and need to be separate entries in the dictionary. I have included two of those entries in the code below.
Hopefully this should help:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
# 200
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find_all('table', {"class": "wikitable", "class": "sortable"})
print(cities[0])
# <table class="wikitable sortable">
# <tbody><tr>
# <th>Name of Town
# </th>
# <th>State
# ....
tables = pd.read_html(str(cities[0]))
print(tables[0])
# Name of Town State ... Population (2011) Ref
# 0 Achhnera Uttar Pradesh ... 22781 NaN
# 1 Adalaj Gujarat ... 11957 NaN
# 2 Adoor Kerala ... 29171 NaN
# ....
For simpler solution, you only need pandas. No need for requests and BeautifulSoup
import pandas as pd
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
tables = pd.read_html(wikiurl)
In here, tables will return lists of dataframe, you can select from the dataframe tables[0] .. etc
Don't parse the HTML directly. Use the provided API by MediaWiki as shown here: https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page
In your case, I use the Method 2: Use the Parse API with the following URL: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_towns_in_India_by_population&prop=text&formatversion=2&format=json
Process the result accordingly. You might still need to use BeautifulSoup to extract the HTML table and it's content
Related
I want to delete advertisment text from scraped data but after i decompose it i get error saying
list index out of range
I think its becouse after decompose is blank space or somthing. Without decompose loop works ok.
import requests
from bs4 import BeautifulSoup
url = 'https://www.marketbeat.com/insider-trades/ceo-share-buys-and-sales/'
companyName = 'title-area'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('table')[0].tbody.find_all('tr')
# delete advertisment
soup.find("tr", class_="bottom-sort").decompose()
for el in table:
print(el.find_all('td')[0].text)
You can use tag.extract() to delete the tag. Also, delete the tag before you find all <tr> tags:
import requests
from bs4 import BeautifulSoup
url = "https://www.marketbeat.com/insider-trades/ceo-share-buys-and-sales/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# delete advertisment
for tr in soup.select("tr.bottom-sort"):
tr.extract()
table = soup.find_all("table")[0].tbody.find_all("tr")
for el in table:
print(el.find_all("td")[0].text)
Prints:
...
TZOOTravelzoo
NEOGNeogen Co.
RKTRocket Companies, Inc.
FINWFinWise Bancorp
WMPNWilliam Penn Bancorporation
There is nothing wrong using decompose() you only have to pay attention to the order in your process:
# first delete advertisment
soup.find("tr", class_="bottom-sort").decompose()
# then select the table rows
table = soup.find_all('table')[0].tbody.find_all('tr')
I'm trying to scrape some data from here: https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly.
I'd like to get the dates in the first row (ie. 31-Mar-21 31-Dec-20 30-Sep-20 30-Jun-20 31-Mar-20).
The problem comes when I try to get the date, with bs4 it outputs nothing. I wrote this code:
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
html_content = requests.get(url).text
soup = BeautifulSoup (html_content, "lxml")
a = soup.find('div', attrs = {"class": "tables-container"})
date = a.find("time").text;
When I execute it, it gives me nothing. Printing a, it can be seen that the find () doesn't get the date ... `
<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>
Thanks.
The data is embedded within the page in JSON form. You can use this example how to parse it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
x = data["props"]["initialState"]["markets"]["financials"]["financial_tables"]
headers = x["income_interim_tables"][0]["headers"]
print(*headers, sep="\n")
Prints:
2021-03-31
2020-12-31
2020-09-30
2020-06-30
2020-03-31
As I do not have enough reputation to comment:
The problem is that the scraped HTML does not contain the dates. The time tags are empty.
You need a way to scrape while pre-rendering the JavaScript which fills in the dates. This is a different topic which requires some headless browser or other approaches, e.g. https://www.scrapingbee.com/blog/scrapy-javascript/
https://www.worldometers.info/coronavirus/#countries is the website that I'm using and I'm trying to get the table with All tab selected to pull from html into my jupyter notebook. The problem I seem to be having is if I use class = 'table' it pulls all continent tabs first then the all table and it messes up how my data gets pulled in when I try looking at rows.
import requests
import lxml.html as lh
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/#countries'
page = requests.get(url)
print(page.status_code) #Checking the http response status code. Should be 200
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
all_tables=soup.find_all("table")
right_table = soup.find('table',{'class':'table'})
col_headers = [th.getText() for th in right_table.findAll('th')]
data = [[td.getText() for td in right_table.findAll('td')] for tr in right_table()]
When I try to combine the col_headers and data it says I have13 columns passed, data had 2990 columns. Any guidance would be appreciated.
You have "flattened" the table - created a list of all <td>s. What you need to do is to create a nested list:
data = [ [ td.text for td in tr.find_all("td") ] for tr in right_table.find_all("tr")]
df = pd.DataFrame(data, columns=col_header)
print(df.shape) # (231, 13)
I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you
Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.
I am writing a Python script using BeautifulSoup to scrape values from this webpage: https://uk-air.defra.gov.uk/latest/currentlevels
I want to use soup.find() to get values for "Hourly mean Nitrogen dioxide" and "Last updated" from the table row where the "Monitoring site" is "Edinburgh St Leonards".
As I am new to web scraping I am having a bit of trouble so would be grateful for any help on this.
Scrap all the html tables in a list of tables.
The table index may change, then you should not rely on a row/column index.
A part of the folowing script look up for the index of the searched data. Moreover, it prints the header name: so you know want are the data you get.
from bs4 import BeautifulSoup
import urllib.request
import re
with urllib.request.urlopen('https://uk-air.defra.gov.uk/latest/currentlevels?view=region') as response:
htmlData = response.read()
soup = BeautifulSoup(htmlData, 'html5lib')
tables = soup.find_all('table', attrs={'class':'current_levels_table'})
#what you want to check:
Iwant = ['nitrogen', 'update']
about = 'Edinburgh'
for table in tables:
#get header to have the data (we're looking for) column number and table real names
table_head = table.find('thead')
headrows = table_head.find_all('tr')
measures = headrows[1].find_all('th')
for colnum, measure in enumerate(measures):
index.update({colnum: measure.text.strip() for wanted in Iwant if re.search(wanted+'(?iu)', measure.text)})
#get table content and look for Edinburgh
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cels = row.find_all('td')
rowContent = [cel.text.strip().replace(u'\xa0', u' ').replace(u'\n Timeseries Graph', u'') for cel in cels if cel]
if re.search(about+'(?iu)', rowContent[0]):
for indexwanted, measurewanted in index.items():
print(measurewanted, ':', rowContent[indexwanted])
Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
Edinburgh St Leonards
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'
you can start with this:
import requests
from bs4 import BeautifulSoup
# Request the page, set headers to prevent 403 Forbidden
page = requests.get(
url='https://uk-air.defra.gov.uk/latest/currentlevels',
headers={'User-Agent': 'Not blank'})
# Get html from page
html = page.text
# BeautifulSoup object
soup = BeautifulSoup(html, 'html5lib')
for table in soup.find_all('table'):
# Print all tables on the page
print(table)