The below code provide information from all the numeric tags in the page. Can I use a filter to extract once for each region
For example : https://opensignal.com/reports/2019/04/uk/mobile-network-experience , I am interested in numbers only under the regional analysis tab and for all regions.
import requests
from bs4 import BeautifulSoup
html=requests.get("https://opensignal.com/reports/2019/04/uk/mobile-network-experience").text
soup=BeautifulSoup(html,'html.parser')
items=soup.find_all('div',class_='c-ru-graph__rect')
for item in items:
provider=item.find('span', class_='c-ru-graph__label').text
prodvalue=item.find_next_sibling('span').find('span', class_='c-ru-graph__number').text
print(provider + " : " + prodvalue)
I want a table or df as below
Easter Region
o2 Vodaphone 3 EE
4G Availability 82 76.9 73.0 89.2
Upload Speed Experience 5.6 5.9 6.8 9.5
Any pointers that can help in getting the result ?
Here is how I would do it for all regions. Requires bs4 4.7.1. AFAICS you have to assume consistent ordering of companies.
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get("https://opensignal.com/reports/2019/04/uk/mobile-network-experience")
soup = BeautifulSoup(r.content,'lxml') #'html.parser' if lxml not installed
metrics = ['4g-availability', 'video-experience', 'download-speed' , 'upload-speed', 'latency']
headers = ['02', 'Vodaphone', '3', 'EE']
results = []
for region in soup.select('.s-regional-analysis__region'):
for metric in metrics:
providers = [item.text for item in region.select('.c-ru-chart:has([data-metric="' + metric + '"]) .c-ru-graph__number')]
row = {headers[i] : providers[i] for i in range(len(providers))}
row['data-metric'] = metric
row['region'] = region['id']
results.append(row)
df = pd.DataFrame(results, columns = ['region', 'data-metric', '02','Vodaphone', '3', 'EE'] )
print(df)
Sample output:
Assuming fixed the order of companies (it is, indeed), you can simply reduce the content to examine to only those div's containing the information you need.
import requests
from bs4 import BeautifulSoup
html = requests.get("https://opensignal.com/reports/2019/04/uk/mobile-network-experience").text
soup = BeautifulSoup(html,'html.parser')
res = soup.find_all('div', {'id':'eastern'})
aval = res[0].find_all('div', {'data-chart-name':'4g-availability'})
avalname = aval[0].find('span', {'class':'js-metric-name'}).text
upload = res[0].find_all('div', {'data-chart-name':'upload-speed'})
uploadname = upload[0].find('span', {'class':'js-metric-name'}).text
companies = [i.text for i in aval[0].find_all('span', class_='c-ru-graph__label')]
row1 = [i.text for i in aval[0].find_all('span', class_='c-ru-graph__number')]
row2 = [i.text for i in upload[0].find_all('span', class_='c-ru-graph__number')]
import pandas as pd
df = pd.DataFrame({avalname:row1,
uploadname:row2})
df.index = companies
df = df.T
output
O2 Vodafone 3 EE
4G Availability 82.0 76.9 73.0 89.2
Upload Speed Experience 5.6 5.9 6.8 9.5
Related
I am writing a small program to fetch stock exchange data using Python. The sample code below makes a request to a URL and it should return the appropriate data. Here is the resource that I am using:
https://python.plainenglish.io/4-python-libraries-to-help-you-make-money-from-webscraping-57ba6d8ce56d
from xml.dom.minidom import Element
from selenium import webdriver
from bs4 import BeautifulSoup
import logging
from selenium.webdriver.common.by import By
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
url = "http://eoddata.com/stocklist/NASDAQ/A.htm"
driver = webdriver.Chrome(executable_path="C:\Program Files\Chrome\chromedriver")
page = driver.get(url)
# TODO: find element by CSS selector
stock_symbol = driver.find_elements(by=By.CSS_SELECTOR, value='#ctl00_cph1_divSymbols')
soup = BeautifulSoup(driver.page_source, features="html.parser")
elements = []
table = soup.find('div', {'id','ct100_cph1_divSymbols'})
logging.info(f"{table}")
I've added a todo for getting the element that I am trying to retrieve from the program.
Expected:
The proper data should be returned.
Actual:
Nothing is returned.
It is most common practice to scrape tables with pandas.read_html() to get its texts, so I would also recommend it.
But to answer your question and follow your approach, select <div> and <table> more specific:
soup.select('#ctl00_cph1_divSymbols table')`
To get and store the data you could iterat the rows and append results to a list:
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://eoddata.com/stocklist/NASDAQ/A.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text)
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
pd.DataFrame(data)
Output
Code
Name
High
Low
Close
Volume
Change
url
0
AACG
Ata Creativity Global ADR
1.390
1.360
1.380
8,900
0
https://eoddata.com/stockquote/NASDAQ/AACG.htm
1
AACI
Armada Acquisition Corp I
9.895
9.880
9.880
5,400
-0.001
https://eoddata.com/stockquote/NASDAQ/AACI.htm
2
AACIU
Armada Acquisition Corp I
9.960
9.960
9.960
300
-0.01
https://eoddata.com/stockquote/NASDAQ/AACIU.htm
3
AACIW
Armada Acquisition Corp I WT
0.1900
0.1699
0.1700
36,400
-0.0193
https://eoddata.com/stockquote/NASDAQ/AACIW.htm
4
AADI
Aadi Biosciences Inc
13.40
12.66
12.90
98,500
-0.05
https://eoddata.com/stockquote/NASDAQ/AADI.htm
5
AADR
Advisorshares Dorsey Wright ETF
47.49
46.82
47.49
1,100
0.3
https://eoddata.com/stockquote/NASDAQ/AADR.htm
6
AAL
American Airlines Gp
14.44
13.70
14.31
45,193,100
-0.46
https://eoddata.com/stockquote/NASDAQ/AAL.htm
...
Here's link for scraping : https://stockanalysis.com/stocks/
I'm trying to get all the rows of the table (6000+ rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.
I tried almost everything I can. I'm , ALSO, a beginner in web scraping.
My code :
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000
What should I do?
Take a look down the bottom of the html and you will see this
<script id="__NEXT_DATA__" type="application/json">
Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.
As stated, it's in the <script> tags. Pull it and read it in.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])
Output:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]
I'm trying to scrape Historical Bitcoin Data from coinmarketcap.com in order to get close, volume, date, high and low values since the beginning of the year until Sep 30, 2021. After going through threads and videos for hours, and I'm new to scraping with Python, I don't know what my mistake is (or is there something with the website I don't detect?). The following is my code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
closeList = []
volumeList = []
dateList = []
highList = []
lowList = []
website = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/'
r = requests.get(website)
r = requests.get(website)
soup = BeautifulSoup(r.text, 'lxml')
tr = soup.find_all('tr')
FullData = []
for item in tr:
closeList.append(item.find_all('td')[4].text)
volumeList.append(item.find_all('td')[5].text)
dateList.append(item.find('td',{'style':'text-align: left;'}).text)
highList.append(item.find_all('td')[2].text)
lowList.append(item.find_all('td')[3].text)
FullData.append([closeList,volumeList,dateList,highList,lowList])
df_columns = ["close", "volume", "date", "high", "low"]
df = pd.DataFrame(FullData, columns = df_columns)
print(df)
As a result I only get:
Empty DataFrame
Columns: [close, volume, date, high, low]
Index: []
The task obliges me to scrape with BeautifulSoup and then export to csv (which obviously then is simply df.to_csv - can somebody help me out? That would be highly appreciated.
Actually, data is loaded dynamically by javascript from api calls json response. So you can grab data easily as follows:
Code:
import requests
import json
import pandas as pd
api_url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart=1632441600&timeEnd=1637712000'
r = requests.get(api_url)
data = []
for item in r.json()['data']['quotes']:
close = item['quote']['close']
volume =item['quote']['volume']
date=item['quote']['timestamp']
high=item['quote']['high']
low=item['quote']['low']
data.append([close,volume,date,high,low])
cols = ["close", "volume","date","high","low"]
df = pd.DataFrame(data, columns= cols)
print(df)
#df.to_csv('info.csv',index = False)
Output:
close volume date high low
0 42839.751696 4.283935e+10 2021-09-24T23:59:59.999Z 45080.491063 40936.557169
1 42716.593147 3.160472e+10 2021-09-25T23:59:59.999Z 42996.259704 41759.920425
2 43208.539105 3.066122e+10 2021-09-26T23:59:59.999Z 43919.300970 40848.461660
3 42235.731847 3.098003e+10 2021-09-27T23:59:59.999Z 44313.245882 42190.632576
4 41034.544665 3.021494e+10 2021-09-28T23:59:59.999Z 42775.146142 40931.662500
.. ... ... ... ... ...
56 58119.576194 3.870241e+10 2021-11-19T23:59:59.999Z 58351.113266 55705.180685
57 59697.197134 3.062426e+10 2021-11-20T23:59:59.999Z 59859.880442 57469.725661
58 58730.476639 2.612345e+10 2021-11-21T23:59:59.999Z 60004.426383 58618.931432
59 56289.287323 3.503612e+10 2021-11-22T23:59:59.999Z 59266.358468 55679.840404
60 57569.074876 3.748580e+10 2021-11-23T23:59:59.999Z 57875.516397 55632.759912
[61 rows x 5 columns]
I´m trying to make an easy scraper with python for ebay, the thing is I can´t use the dataframe I make.
My code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from csv import reader
url = "https://www.ebay.es/sch/i.html?_from=R40&_nkw=iphone&_sacat=0&LH_TitleDesc=0&_fsrp=1&Modelo=Apple%2520iPhone%2520X&_dcat=9355"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
productslist = []
results = soup.find_all('div', {'class': 's-item__info clearfix'})
print(len(results))
for item in results:
product = {
'title': item.find('h3', {'class': 's-item__title'}),
'soldprice': item.find('span', {'class': 's-item__price'})
}
productslist.append(product)
df = pd.DataFrame(productslist)
df
But the dataframe I get it´s like this:
Dataframe
I would like to be able to work with the numbers of the price but I can´t use it, I imagine it´s because dtype: object, I would like to know how to convert for example [359,00 EUR] in 359,00 to be able to make graphics.
Thanks.
To remove the "price" from a [...] format, use the .text method. Also, make sure that the price not None:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.ebay.es/sch/i.html?_from=R40&_nkw=iphone&_sacat=0&LH_TitleDesc=0&_fsrp=1&Modelo=Apple%2520iPhone%2520X&_dcat=9355"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
productslist = []
results = soup.find_all("div", {"class": "s-item__info clearfix"})
for item in results:
title = item.find("h3", {"class": "s-item__title"})
price = item.find("span", {"class": "s-item__price"})
if price is None:
continue
productslist.append({"title": title.text, "soldprice": price.text})
df = pd.DataFrame(productslist)
print(df)
Output:
title soldprice
0 APPLE IPHONE X 64 GB A+LIBRE+FACTURA+8 ACCESOR... 359,00 EUR
1 Apple iPhone X - 64GB - Plata (Libre) 177,50 EUR
2 Apple iPhone X - 64GB - Blanco (Libre) 181,50 EUR
3 iphone x 64gb 240,50 EUR
4 Iphone x 256 gb 370,00 EUR
5 Apple iPhone X - 256GB - Space Gray (Libre) 400,00 EUR
6 Nuevo anuncioSMARTPHONE APPLE IPHONE X 64GB LI... 334,95 EUR
...
...
I'm trying to export the results of this code to a CSV file. I copied 2 of the results further down below after the code. There are 14 items for each stock and I'd like to write to a CSV file and have a column for each of the 14 items and one row for each stock.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
This is the format of the results, 14 items/columns for each stock.
PTN
Palatin Technologies, Inc.
Healthcare
Diagnostic Substances
USA
240.46M
9.22
193.43M
2.23M
0.76
1.19
7.21%
1,703,285
3
LKM
Link Motion Inc.
Technology
Application Software
China
128.95M
-
50.40M
616.76K
1.73
1.30
16.07%
1,068,798
4
Tried this but couldn't get this to work.
TextWriter x = File.OpenWrite ("my.csv", ....);
x.WriteLine("Column1,Column2"); // header
x.WriteLine(coups.Cells[0].Text + "," + coups.Cells[1].Text);
This should work:
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
Few things here:
I used "table-[light|dark]-row-cp" becouse all rows of intrest had one of those classes (and no other rows had them)
There are two sepearate parts: one is fetching data in correct structure, other - writing CSV file.
I used pandas CSV writer, because I'm familiar with it, but when you have rectangular data (named "data" here) you may use any other CSV writer
You should never name variables with reserved names, like 'sub' or 'link' : )
Hope that helps.
Why don't you use the built-in csv.writer?