I want to parse data from wikipedia table, and turn in into a pandas dataframe.
https://en.wikipedia.org/wiki/MIUI
there is a table called 'version history'
so far I have written following code, but still can't get the data
wiki='https://en.wikipedia.org/wiki/MIUI'
table_class='wikitable sortable mw-collapsible mw-no-collapsible jquery-tablesorter
mw-made-collapsible'
response = requests.get(wiki)
soup = BeautifulSoup(response.text,'html.parser')
miui_v = soup.find('table', attrs={'class': table_class})
In html I downloaded table you are searching for has different class:
class="wikitable mw-collapsible mw-made-collapsible"
I guess it can changes dependend on some browser and their extensions. I recommend to start with element that has id to guarantee match. In your case you can do:
miui_v = soup.find("div", {"id": "mw-content-text"})
my_table = miui_v.findChildren("div")[0].findChildren("table")[1]
Related
So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!
Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...
For my own interest, I want to crawl the table of properties from "https://thinkimmo.com/search?noReset=true". After having clicked on "TABELLE" (TABLE) you can see all properties listed in a table.
With the following code I am able to see the table:
driver.get("https://thinkimmo.com/search?noReset=true")
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div[2]/div/div[2]/div/div/div/div[1]/div/div/button[2]/span[1]').click()
Now I am able to crawl some parts of the table with the following code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
htmltable = soup.find('table', { 'class' : 'MuiTable-root' })
def tableDataText(table):
rows = []
trs = table.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
list_table = tableDataText(htmltable)
list_table
The result however is not what I expect. I only get the first 7 headings, but all other headings are not returned.
After I had a closer look at the HTML of the webpage I am not sure how to get all headings and results of the table.
I am looking forward to solving the problem of getting only some parts of the heading. And more closely I am interested in why I am failing.
What I see in the result of table = soup.find("table") is that after the 7th heading title the table closes.
Thanks in advance.
Steffen
The site uses a backend api you can edit to bulk download data:
import requests
import pandas as pd
results = 1000
url = f'https://api.thinkimmo.com/immo?active=true&type=APARTMENTBUY&sortBy=publishDate,desc&from=0&size={str(results)}&grossReturnAnd=false&allowUnknown=false&excludePlatforms=ebk,immowelt&favorite=false&noReset=true&excludedFields=true&geoSearches=[]&averageAggregation=buyingPrice%3BpricePerSqm%3BsquareMeter%3BconstructionYear%3BrentPrice%3BrentPricePerSqm%3BrentPricePerSqm%3BrunningTime&termsAggregation=platforms.name.keyword,60'
resp = requests.get(url).json()
df = pd.DataFrame(resp['results'])
df.to_csv('thinkimmo.csv',index=False)
print('Saved to thinkimmo.csv')
This is alot of unstructured data but should help. If you want to inspect what is in this api call and only get certain parts of the returned JSON then you can open your browser's Developer Tools - Network - fetch/XHR and reload the page to see all the backend requests fire. You are looking for one that starts "immo?" take a look at the Payload and Preview to see all the data. That's what we are scraping above.
I'm trying to get a table that is located inside multiple nests.
I'm new to Beautifulsoup and I have practiced some simple eeemples.
The issue is that, I can't understand why my code can't get the "div" tag that has the class "Explorer is-embed".
Because from that point, I can go deeper to get to the tbody where all the data that I want to scrape are located.
thanks for your help in advance.
Below is my code:
url = "https://ourworldindata.org/covid-cases"
url_content = requests.get(url)
soup = BeautifulSoup(url_content.text, "lxml")
########################
div1 = soup3.body.find_all("div", attrs={"class":"content-wrapper"})
div2 = div1[0].find_all("div", attrs={"class":"offset-content"})
sections = div2[0].find_all('section')
figure = sections[1].find_all("figure")
div3 = figure[0].find_all("div")
div4 = div3[0].find_all("div")
Here is a snapshot of the "div" tag that I'm not getting.
Figure
Data is dynamically loaded. Instead, grab the public source csv (other formats available)
https://ourworldindata.org/coronavirus-source-data
import pandas as pd
df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df.head()
Values you see in the Daily new confirmed COVID-19 cases (per 1M)
table are calculated from the same data as in that file for the two dates being compared e.g.
I am trying to extract the table but my code seems like not working as it returns the value none
i wanted to extract it with xpath but i couldnt try xpath as i have no knowledge and i am little bit familiar with beautifulsoup. how can i extract this table and save in csv?
the website i am using is :https://training.gov.au/Organisation/Details/31102
import requests
from bs4 import BeautifulSoup
url = 'https://training.gov.au/Organisation/Details/31102'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('tabel', id = 'ScopeQualification')
print(table)
If you're trying to extract the values of that table the best way is pandas
here's a cheat sheet for it so you can get exactly what you want
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.