Web scraping from Website having multiple page results

Web scraping from Website having multiple page results - python

I want the table of results of all the webpages for my selection from a website which returns multiple pages of results.
I tried the below code :
enter code here
import pandas as pd
dfs = []
while i<27:
url = " "
dframe = pd.read_html(url.str(i), header=1)
dfs.append(dframe[0].dropna(thresh=3))
i=i+1
I expect the dframe to hold the records of all the 30 pages of results.
But im unable to run this ,it never stops running even after hours of running

import pandas as pd
import numpy as np
df2 = pd.DataFrame()
for i in np.arange(26):
url = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;page="+str(i)+";spanmin1=01+Jan+2007;spanval1=span;template=results;type=bowling"
df = pd.read_html(url)[2]
df2 = pd.concat([df2, df])
df2.drop(columns = 'Unnamed: 14', inplace = True)
This worked for me. When I browsed the website I only had 26 pages though. I also investigated a single page, and the table your looking at is the [2] df on the list that read_html returns. Unnamed: 14 is the column with the arrow on the very right.

I've added and changed some stuff from your original code to make it work.
import pandas as pd
dfs = []
i = 0
while i < 26:
url = (
"http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;page="
+ str(i)
+ ";spanmin1=01+Jan+2007;spanval1=span;template=results;type=bowling"
)
dframe = pd.read_html(url, attrs={"class": "engineTable"})
dfs.append(dframe[2].drop(columns="Unnamed: 14"))
i = i + 1
result = pd.concat(dfs)
print(result)

Related

How to fill cell by cell of an empty pandas dataframe which has zero columns with a loop?

I need to scrape hundreds of pages and instead of storing the whole json of each page, I want to just store several columns from each page into a pandas dataframe. However, at the beginning when the dataframe is empty, I have a problem. I need to fill an empty dataframe without any columns or rows. So the loop below is not working correctly:
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame()
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df['Customer_id'] = i
df['Name'] = jdata['user']['profile']['Name']
...
In this case, what should I do?

You can solve this by using enumerate(), together with loc:
for index, i in enumerate(cids):
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df.loc[index, 'Customer_id'] = i
df.loc[index, 'Name'] = jdata['user']['profile']['Name']

If you specify your column names when you create your empty dataframe, as follows:
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
Then you can then just append your new data using:
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
(plus any other columns you populate) then you can add a row to the dataframe for each iteration of your for loop.
import pandas as pd
import requests
cids = [4100,4101,4102,4103,4104]
df = pd.DataFrame(columns = ['Customer_id', 'Name'])
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df = df.append({'Customer_id' : i, 'Name' : jdata['user']['profile']['Name']}, ignore_index=True)
It should be noted that using append on a DataFrame in a loop is usually inefficient (see here) so a better way is to save your results as a list of lists (df_data), and then turn that into a DataFrame, as below:
cids = [4100,4101,4102,4103,4104]
df_data = []
for i in cids:
url_info = requests.get(f'myurl/{i}/profile')
jdata = url_info.json()
df_data.append([i, jdata['user']['profile']['Name']])
df = pd.DataFrame(df_data, columns = ['Customer_id', 'Name'])

How to get data from a link inside a webpage in Python?

I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas. For this I use the following code and get the data quite easily -
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]
But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details". I would also like to add the data from inside that hyperlink to every row in our dataframe. How do we do that?

As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url)
link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]
link_text = link.text
df = pd.read_html(link_text)
df = df[-1]
df['links'] = res
print(df)
Hope that solves your problem.

Create Pandas Dataframe from WebScraping results of stock price

I'm trying to write an script which creates an Pandas Dataframe (df) an add every time x an stock price to the df. The Data is from wrebscraping.
This is my code, but I have no idea how to add every time x (e.g. 1min) new data to the df and not replace the old.
import bs4 as bs
import urllib.request
import time as t
import re
import pandas as pd
i = 1
while i == 1:
# !! Scraping
link = 'https://www.onvista.de/aktien/DELIVERY-HERO-SE-Aktie-DE000A2E4K43'
parser = urllib.request.urlopen(link).read()
soup = bs.BeautifulSoup(parser, 'lxml')
stock_data = soup.find('ul', {'class': 'KURSDATEN'})
stock_price_eur_temp = stock_data.find('li')
stock_price_eur = stock_price_eur_temp.get_text()
final_stock_price = re.sub('[EUR]','', stock_price_eur)
print (final_stock_price)
t.sleep(60)
# !! Building an dataframe
localtime = t.asctime(t.localtime(t.time()))
stock_data_b = {
'Price': [final_stock_price],
'Time': [localtime],
}
df = pd.DataFrame(stock_data_b, columns=['Price', 'Time'])
I hope you can help me with an idea for this problem.

Because you create df inside the loop, you're re-writing that variable name each time, writing over the data from the previous iteration. You want to initialize a dataframe before the loop, and then add to it each time.
Before your loop, add the line
df2 = pd.DataFrame()
which just creates an empty dataframe. After the end of the code you posted, add
df2 = df2.append(df, ignore_index = True)
which will tack each new df on to the end of df2.

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])

Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

python pd.read_html gives error

I've been using pandas and request to pull some tables to get NFL statistics. It's been going pretty well, I've been able to pull tables from other sites, until I tried to get NFL combine table from this one particular site.
It gives me the error message after df_list = pd.read_html(html)
The error I get is:
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
Here's the code I've been using at other sites that worked really well.
import requests
import pandas as pd
df = pd.DataFrame()
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=1987&pos=&college='
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
I've read and seen a little bit about BeautifulSoup, but the simplicity of the pd.read_html() is just so nice and compact. So I don't know if there's a quick fix that I am not aware of, or if I need to indeed dive into BeautifulSoup to get these tables from 1987 - 2017.

This isn't shorter, but may be more robust:
import requests
import pandas as pd
from bs4 import BeautifulSoup
A convenience function:
def souptable(table):
for row in table.find_all('tr'):
yield [col.text for col in row.find_all('td')]
Return a DataFrame with data loaded for a given year:
def getyear(year):
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?year=%d&pos=&college=' % year
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
data = list(souptable(soup.table))
df = pd.DataFrame(data[1:], columns=data[0])
df = df[pd.notnull(df['Name'])]
return df.apply(pd.to_numeric, errors="ignore")
This function slices out the heading row when the DataFrame is created, uses the first row for column names, and filters out any rows with an empty Name value.
Finally, concatenate up as many years as you need into a single DataFrame:
dfs = pd.concat([getyear(year) for year in range(1987, 1990)])

Ok. After doing some more research. I looks like the issue is that the last row is one merged cell, and that's possibly where the issue is coming in. So I did go into using BeautifulSoup to pull the data. Here is my solution:
import requests
import pandas as pd
from bs4 import BeautifulSoup
I wanted to pull for each year from 1987 to 2017
seasons = list(range(1987, 2018))
df = pd.DataFrame()
temp_df = pd.DataFrame()
So it would run through each year. Appending each cell into a new row. Then again knowing the last cell is a "blank", I eliminate that last row by defining the dataframe as df[:-1] before it loops and appends the next years data.
for i in seasons:
df = df[:-1]
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=%s&pos=&college=' % (i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
temp_df = row
df = df.append(temp_df, ignore_index = True)
Finally, since there is no new year to append, I need to eliminate the last row. Then I reshape the dataframe into the 16 columns, rename the columns from the first row, and then eliminate the row headers within the dataframe.
df = df[:-1]
df = (pd.DataFrame(df.values.reshape(-1, 16)))
df.columns = df.iloc[0]
df = df[df.Name != 'Name']
I'm still learning python, so any input, advice, any respectful constructive criticism is always welcome. Maybe there is a better, more appropriate solution?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping from Website having multiple page results - python

Related

How to fill cell by cell of an empty pandas dataframe which has zero columns with a loop?

How to get data from a link inside a webpage in Python?

Create Pandas Dataframe from WebScraping results of stock price

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

python pd.read_html gives error

Categories

Resources