Python BeautifulSoup parsing / crawl table - python

For my own interest, I want to crawl the table of properties from "https://thinkimmo.com/search?noReset=true". After having clicked on "TABELLE" (TABLE) you can see all properties listed in a table.
With the following code I am able to see the table:
driver.get("https://thinkimmo.com/search?noReset=true")
driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div/div[2]/div/div[2]/div/div/div/div[1]/div/div/button[2]/span[1]').click()
Now I am able to crawl some parts of the table with the following code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
htmltable = soup.find('table', { 'class' : 'MuiTable-root' })
def tableDataText(table):
rows = []
trs = table.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
list_table = tableDataText(htmltable)
list_table
The result however is not what I expect. I only get the first 7 headings, but all other headings are not returned.
After I had a closer look at the HTML of the webpage I am not sure how to get all headings and results of the table.
I am looking forward to solving the problem of getting only some parts of the heading. And more closely I am interested in why I am failing.
What I see in the result of table = soup.find("table") is that after the 7th heading title the table closes.
Thanks in advance.
Steffen

The site uses a backend api you can edit to bulk download data:
import requests
import pandas as pd
results = 1000
url = f'https://api.thinkimmo.com/immo?active=true&type=APARTMENTBUY&sortBy=publishDate,desc&from=0&size={str(results)}&grossReturnAnd=false&allowUnknown=false&excludePlatforms=ebk,immowelt&favorite=false&noReset=true&excludedFields=true&geoSearches=[]&averageAggregation=buyingPrice%3BpricePerSqm%3BsquareMeter%3BconstructionYear%3BrentPrice%3BrentPricePerSqm%3BrentPricePerSqm%3BrunningTime&termsAggregation=platforms.name.keyword,60'
resp = requests.get(url).json()
df = pd.DataFrame(resp['results'])
df.to_csv('thinkimmo.csv',index=False)
print('Saved to thinkimmo.csv')
This is alot of unstructured data but should help. If you want to inspect what is in this api call and only get certain parts of the returned JSON then you can open your browser's Developer Tools - Network - fetch/XHR and reload the page to see all the backend requests fire. You are looking for one that starts "immo?" take a look at the Payload and Preview to see all the data. That's what we are scraping above.

Related

How do I get the data from a table of google spreadsheet using requests in python?

I am doing a project that asks me to obtain 3 databases from google spreadsheet using the requests library (I do the processing afterwards). The problem is that when I get the GET of the url and apply a ".text" or ".json" or ".content" it gives me the structured information of the entire spreadsheet but I want the values of the row and column. Any ideas???
Here are the spreadsheets:
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit
https://docs.google.com/spreadsheets/d/1udwn61l_FZsFsEuU8CMVkvU2SpwPW3Krt1OML3cYMYk/edit
The best way of getting the data from a google spreadsheet in python is by using gspread, which is a Python API for Google Sheets.
However, there are alternatives if you aren't the owner of the spreadsheet (or you just want to do it by other method as an exercise). For instance, you can do it using requests and bs4 modules as you can see in this answer.
Applied to your specific case, the code would look like this ("Datos Argentina - Salas de Cine" spreadsheet):
import typing
import requests
from bs4 import BeautifulSoup
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows
Important note: with the link provided (and the code above) you will only be able to get the first 100 rows of data!
This can be fixed in more than one way. What I've tried is modifying the url of the spreadsheet to display the data as a simple html table (reference).
Old url: https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
New url: (remove edit#gid=1691373423 and add gviz/tq?tqx=out:html&tq&gid=1) https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1
Now you are able to obtain all the rows that the spreadsheet contains:
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/u/0/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows

Parsing data from Wikipedia table

I want to parse data from wikipedia table, and turn in into a pandas dataframe.
https://en.wikipedia.org/wiki/MIUI
there is a table called 'version history'
so far I have written following code, but still can't get the data
wiki='https://en.wikipedia.org/wiki/MIUI'
table_class='wikitable sortable mw-collapsible mw-no-collapsible jquery-tablesorter
mw-made-collapsible'
response = requests.get(wiki)
soup = BeautifulSoup(response.text,'html.parser')
miui_v = soup.find('table', attrs={'class': table_class})
In html I downloaded table you are searching for has different class:
class="wikitable mw-collapsible mw-made-collapsible"
I guess it can changes dependend on some browser and their extensions. I recommend to start with element that has id to guarantee match. In your case you can do:
miui_v = soup.find("div", {"id": "mw-content-text"})
my_table = miui_v.findChildren("div")[0].findChildren("table")[1]

BeautifulSoup - Can't get tbody

I'm trying to get a table that is located inside multiple nests.
I'm new to Beautifulsoup and I have practiced some simple eeemples.
The issue is that, I can't understand why my code can't get the "div" tag that has the class "Explorer is-embed".
Because from that point, I can go deeper to get to the tbody where all the data that I want to scrape are located.
thanks for your help in advance.
Below is my code:
url = "https://ourworldindata.org/covid-cases"
url_content = requests.get(url)
soup = BeautifulSoup(url_content.text, "lxml")
########################
div1 = soup3.body.find_all("div", attrs={"class":"content-wrapper"})
div2 = div1[0].find_all("div", attrs={"class":"offset-content"})
sections = div2[0].find_all('section')
figure = sections[1].find_all("figure")
div3 = figure[0].find_all("div")
div4 = div3[0].find_all("div")
Here is a snapshot of the "div" tag that I'm not getting.
Figure
Data is dynamically loaded. Instead, grab the public source csv (other formats available)
https://ourworldindata.org/coronavirus-source-data
import pandas as pd
df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df.head()
Values you see in the Daily new confirmed COVID-19 cases (per 1M)
table are calculated from the same data as in that file for the two dates being compared e.g.

Beautiful Soup AssertionError

I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)
type(soup) # we see that soup is a BeautifulSoup object
column_headers = [th.getText() for th in
soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
df = pd.DataFrame(candidate_data, columns=column_headers)
df.head() # head() lets us see the 1st 5 rows of our DataFrame by default
df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)
The data on the page [ definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.
These lines:
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.
Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.
I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

Categories