I'm trying to loop over a few hundred pages of a site and grab Buddhist quotes and then save them into a dataframe. I've mostly got the code working, but am struggling with parsing some of the text appropriately. On each page i'm scraping there are 5 quotes, and from what I can tell in the HTML output no obvious identifier for each. So i've attempted to loop over what I scrape from each page but it's either overwriting all previous quotes (i.e quotes 1-4) or just grouping them all together into a single cell.
See set-up and code below:
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
new_result = pd.DataFrame({
"quote":[soup.find_all("div", class_="post-content")],
"quote_date":[soup.find_all("div", class_="post-date")],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata
As you can see from the image attached this is bundling each 5 quotes into a single cell and making a new row of the data for each page. Any thoughts on how I can split these up so I have one row per quote and date? I tried looping over soup.find_all("div", class_="post-content") but figure I must have been constructing the dataframe incorrectly as that overwrote all but the last quote on each page.
what my dataframe currently looks like
Thanks in advance! Chris
How to fix?
You should add an additional for loop to get your goal:
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
Example
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata
Related
I'm learning web scraping on Python and I decided to test my skills in the HackerRank Leaderboard page, so I wrote the code below expecting no errors before adding the country restriction to the tester function for then exporting my csv file successfully.
But then the Python console replied:
AttributeError: 'NoneType' object has no attribute 'find_all'
The error above corresponds to the line 29 from my code (for i in table.find_all({'class':'ellipsis'}):), so I decided to come here in order to ask for assistance, I'm afraid there could be more syntax or logic errors, so it's better to get rid of my doubts by getting a feedback from experts.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
pd.set_option('display.max_columns', None)
#Declaring a variable for looping over all the pages
pages = np.arange(1, 93, 1)
a = pd.DataFrame()
#loop cycle
for url in pages:
#get html for each new page
url ='https://www.hackerrank.com/leaderboard?page='+str(url)
page = requests.get(url)
sleep(randint(3,10))
soup = BeautifulSoup(page.text, 'lxml')
#get the table
table = soup.find('header', {'class':'table-header flex'})
headers = []
#get the headers of the table and delete the "white space"
for i in table.find_all({'class':'ellipsis'}):
title = i.text.strip()
headers.append(title)
#set the headers to columns in a new dataframe
df = pd.DataFrame(columns=headers)
rows = soup.find('div', {'class':'table-body'})
#get the rows of the table but omit the first row (which are headers)
for row in rows.find_all('table-row-wrapper')[1:]:
data = row.find_all('table-row-column ellipsis')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
#set the data of the Txn Count column to float
Txn = df['SCORE'].values
#combine all the data rows in one single dataframe
a = a.append(pd.DataFrame(df))
def tester(mejora):
mejora = mejora[(mejora['SCORE']>2250.0)]
return mejora.to_csv('new_test_Score_Count.csv')
tester(a)
Do you guys have any ideas or suggestions that could fix the problem?
the error states, that you table element is None. i'm guessing here but you cant get the table from the page with bs4 because it is loaded after with javascript. I would recommend to use selenium for this instead
I've trying to pass the content of a pre tag to a pandas dataframe but i've not been able to, this is what i have so far:
import requests,pandas
from bs4 import BeautifulSoup
#url
url='http://weather.uwyo.edu/cgi-bin/sounding?region=samer&TYPE=TEXT%3ALIST&YEAR=2019&MONTH=09&FROM=2712&TO=2712&STNM=80222'
peticion=requests.get(url)
soup=BeautifulSoup(peticion.content,"html.parser")
#get only the pre content I want
all=soup.select("pre")[0]
#write the content in a text file
with open('sound','w') as f:
f.write(all.text)
#read it
df = pandas.read_csv('sound')
df
I'm getting a not structured dataframe and since I have to do this with several urls I would rather to pass the data directly after the line 12 without the need of writing a file.
this is the dataframe I get
It is fixed width text so you need to generate the lines by splitting on '\n' and then the columns by using a fixed width value. You could use csv to save on overhead but you wanted a dataframe.
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://weather.uwyo.edu/cgi-bin/sounding?region=samer&TYPE=TEXT%3ALIST&YEAR=2019&MONTH=09&FROM=2712&TO=2712&STNM=80222')
soup = bs(r.content, 'lxml')
pre = soup.select_one('pre').text
results = []
for line in pre.split('\n')[1:-1]:
if '--' not in line:
row = [line[i:i+7].strip() for i in range(0, len(line), 7)]
results.append(row)
df = pd.DataFrame(results)
print(df)
I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'
I got this code working for the first page and needed the user agent as it didn't work otherwise.
The problem I get is the search brings the first page, but on the second you have "page=2" and continuing so need to scrape all or as much as needed from the search
"https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4"
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Output is a Pandas DataFrame
need to scrape all pages to output a large dataframe
Okay, so this problem ended up getting stuck in my head, so I worked it out.
import pandas as pd
import requests
hdr={'User-Agent':'Chrome/70.0.3538.110'}
table_dfs={}
for page_number in range(951):
http= "https://www.vesselfinder.com/vessels?page={}&minDW=20000&maxDW=300000&type=4".format(page_number+1)
url= requests.get(http,headers=hdr)
table_dfs[page_number]= pd.read_html(url.text)
it will return the first column (vessel) as a nan value. That's the column for the image, ignore it if you don't need it.
the next column will be called 'built' it has the ships name, and type of ship in it. You'll need to .split() to separate them, and then you can replace column(vessel) with the ships name.
If it works for you I'd love to boost my reputation with a nice green check mark.
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
^this code above is the same thing as
urls=['some list of urls you want to scrape']
table_dfs= [pd.read_html(url) for url in urls]
you can crawl through the urls you're looking for and apply that, and then if you want to do something with/to the tables you can just go:
for table in table_dfs:
table + 'the thing you want to do'
Note that the in-line for loop of table_dfs is in a list. That means that you might not be able to discern which url it came from if the scrape is big enough. Pieca seemed to have a solution that could be used to iterate the websites urls, and create a dictionary key. Note that this solution may not apply to every website.
url_list = {page_number:"https://www.vesselfinder.com/vessels?page=
{}&minDW=20000&maxDW=300000&type=4".format(page_number) for page_number
in list(range(1, 953))}
table_dfs={}
for url in range(1,len(url_list)):
table_dfs[url]= pd.read_html(url_list[url],header=hdr)
I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.
This is what I have so far:
from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)
Thank You
Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.
Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
if 'SVENY' in th.string:
desired_columns.append(headers.index(th))
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
cells= row.findAll('td')
for column in desired_columns:
print(cells[column].text)
In response to your second request:
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
data = {}
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
if 'SVENY' in th.string:
data[th.string] = {'column': headers.index(th), 'data': []}
column_count += 1
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
date = row.findAll('th')[0].text
cells= row.findAll('td')
for header,info in data.items():
column_number = info['column']
cell_data = [date,cells[column_number].text]
info['data'].append(cell_data)
This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.
As an example:
for year_number in data['SVENY01']['data']:
print(year_number)
['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.
You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.