import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('td'):
text_list = [text for text in up.stripped_strings]
print(text_list)
These code is working good and they will get me the correct output but they will not give output in these format as you seen below I want output in these format can you help me
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
Nazionalità Germania
Sito web www.amannesmann.de
Stand Pad. 3 E14 F11
Telefono +492191989-0
Fax +492191989-201
E-mail sales#mannesmann.de
Membro di Cecimo
Social
pandas has a builtin html table scraper, so you can run:
df = pd.read_html('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
This returns a list of all tables on the page as dataframes, you can access your data with df[0]:
0
1
0
Indirizzo
Bliedinghauserstrasse 27
1
Città
Remscheid
2
Nazionalità
Germania
3
Sito web
www.amannesmann.de
4
Stand
Pad. 3 E14 F11
5
Telefono
+492191989-0
6
Fax
+492191989-201
7
E-mail
sales#mannesmann.de
8
Membro di
nan
9
Social
nan
You can use .get_text() method to extract text and use parameters to avoid whitespaces and give extra space using separator
data=table.find_all("tr")
for i in data:
print(i.get_text(strip=True,separator=" "))
Output:
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
...
Instead of selecting <td>, select <tr> and use .stripped_strings on it to get the row wise data and then append them to the Dataframe.
Here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
temp = []
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('tr'):
temp.append([text for text in up.stripped_strings])
df = pd.DataFrame(temp)
print(df)
0 1
0 Indirizzo Bliedinghauserstrasse 27
1 Città Remscheid
2 Nazionalità Germania
3 Sito web www.amannesmann.de
4 Stand Pad. 3 E14 F11
5 Telefono +492191989-0
6 Fax +492191989-201
7 E-mail sales#mannesmann.de
8 Membro di None
9 Social None
Related
I don't usually play with BeautifulSoup in Python so I am struggling to find the value 8.133,00 that matches with the Ibex 35 in the web page: https://es.investing.com/indices/indices-futures
So far I am getting all the info of the page, but I can't filter to get that value:
site = 'https://es.investing.com/indices/indices-futures'
hardware = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101
Firefox/106.0'}
request = Request(site,headers=hardware)
page = urlopen(request)
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I appreciate a hand to get that value.
Regards
Here is a way of getting that bit of information - a dataframe with all the info in that table containing IBEX 35, DAX, and so on, you can then slice that dataframe as you wish.
import pandas as pd
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
url = 'https://es.investing.com/indices/indices-futures'
r = scraper.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[class="datatable_table__D_jso quotes-box_table__nndS2 datatable_table--mobile-basic__W2ilt"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
0 1 2 3 4
0 IBEX 35derived 8.098,10 -3510 -0,43% NaN
1 US 500derived 3.991,90 355 +0,90% NaN
2 US Tech 100derived 11.802,20 1962 +1,69% NaN
3 Dow Jones 33.747,86 3249 +0,10% NaN
4 DAXderived 14.224,86 7877 +0,56% NaN
5 Índice dólarderived 106255 -1837 -1,70% NaN
6 Índice euroderived 11404 89 +0,79% NaN
See https://pypi.org/project/cloudscraper/
I am writing a small program to fetch stock exchange data using Python. The sample code below makes a request to a URL and it should return the appropriate data. Here is the resource that I am using:
https://python.plainenglish.io/4-python-libraries-to-help-you-make-money-from-webscraping-57ba6d8ce56d
from xml.dom.minidom import Element
from selenium import webdriver
from bs4 import BeautifulSoup
import logging
from selenium.webdriver.common.by import By
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
url = "http://eoddata.com/stocklist/NASDAQ/A.htm"
driver = webdriver.Chrome(executable_path="C:\Program Files\Chrome\chromedriver")
page = driver.get(url)
# TODO: find element by CSS selector
stock_symbol = driver.find_elements(by=By.CSS_SELECTOR, value='#ctl00_cph1_divSymbols')
soup = BeautifulSoup(driver.page_source, features="html.parser")
elements = []
table = soup.find('div', {'id','ct100_cph1_divSymbols'})
logging.info(f"{table}")
I've added a todo for getting the element that I am trying to retrieve from the program.
Expected:
The proper data should be returned.
Actual:
Nothing is returned.
It is most common practice to scrape tables with pandas.read_html() to get its texts, so I would also recommend it.
But to answer your question and follow your approach, select <div> and <table> more specific:
soup.select('#ctl00_cph1_divSymbols table')`
To get and store the data you could iterat the rows and append results to a list:
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://eoddata.com/stocklist/NASDAQ/A.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text)
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
pd.DataFrame(data)
Output
Code
Name
High
Low
Close
Volume
Change
url
0
AACG
Ata Creativity Global ADR
1.390
1.360
1.380
8,900
0
https://eoddata.com/stockquote/NASDAQ/AACG.htm
1
AACI
Armada Acquisition Corp I
9.895
9.880
9.880
5,400
-0.001
https://eoddata.com/stockquote/NASDAQ/AACI.htm
2
AACIU
Armada Acquisition Corp I
9.960
9.960
9.960
300
-0.01
https://eoddata.com/stockquote/NASDAQ/AACIU.htm
3
AACIW
Armada Acquisition Corp I WT
0.1900
0.1699
0.1700
36,400
-0.0193
https://eoddata.com/stockquote/NASDAQ/AACIW.htm
4
AADI
Aadi Biosciences Inc
13.40
12.66
12.90
98,500
-0.05
https://eoddata.com/stockquote/NASDAQ/AADI.htm
5
AADR
Advisorshares Dorsey Wright ETF
47.49
46.82
47.49
1,100
0.3
https://eoddata.com/stockquote/NASDAQ/AADR.htm
6
AAL
American Airlines Gp
14.44
13.70
14.31
45,193,100
-0.46
https://eoddata.com/stockquote/NASDAQ/AAL.htm
...
I am trying to scrape the proxy list of this site. However I can't find the the value inside the textarea tag.
Here is my code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://openproxy.space/list/azneonYD26")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find('section', class_='data')
rows =results.find('textarea')
print(rows)
Actually, you can scrape that <script> tag and extract all proxy data (contry, count, all the IPs) with a bit of regex magic and some chained replace().
Here's how:
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openproxy.space/list/azneonYD26").text
scripts = BeautifulSoup(page, "html.parser").find_all("script")
proxy_script = re.search(r"LIST\",data:(.*),code", scripts[2].string).group(1)
proxy_data = json.loads(
(
re.sub(r":([a-z])", r':"\1"', proxy_script)
.replace("code", '"code"')
.replace("count", '"count"')
.replace("items", '"items"')
.replace("active", '"active"')
)
)
for proxy in proxy_data:
print(proxy["code"], proxy["count"], proxy["items"][0])
Output:
CN 122 222.129.37.240:57114
US 82 98.188.47.132:4145
DE 51 78.46.218.20:12855
IN 15 43.224.10.37:6667
FR 9 51.195.91.196:9095
AR 8 186.126.181.223:1080
RU 7 217.28.221.10:30005
GB g 46.101.24.42:1080
SG g 8.210.163.246:50001
NL f 188.166.34.137:9000
BD 3 103.85.232.20:1080
NO d 146.59.156.73:9095
CA d 204.101.61.82:4145
BR d 179.189.226.186:8080
HK b 119.28.128.211:1080
AU b 139.99.237.180:9095
VN b 123.16.56.161:1080
KR b 125.135.221.94:54398
TH b 101.108.25.227:9999
BG b 46.10.218.194:1080
AT b 195.144.21.185:1080
VE b 200.35.79.77:1080
IE b 52.214.159.193:9080
ES b 185.66.58.142:42647
JP b 139.162.78.109:1080
UA b 46.151.197.254:8080
PL b 147.135.208.13:9095
If you want to view everything just print out the proxy_data variable.
I'm trying to web scrape this URL = https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php.
I need to gather the values of "N°cole." column and "Nombre Colegiado" column.
I'm using BeautifulSoup but I get only values of "N°cole." column. How can I fix that?
Thanks!
This is my code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("span",{'class':'colColegiado'})
numero_col = []
for i in data:
data_num = i.text.strip()
numero_col.append(data_num)
numero_col
['Nº cole.',
'6478',
'13107',
'7341',
'12110',
'5625',
'4877',
'4700',
'9126',
'8444',
'13120',
'5023',
'12235',
'7747',
'17701',
'17391',
'17944',
'17772',
'7230',
'11729',
'17275']
You're currently fetching the values from the wrong html elements - it should be from all <p>s with the resalto class.
import requests
from bs4 import BeautifulSoup
#import pandas as pd
#import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("p",{'class':'resalto'})
schools = []
for result in data:
data_num = result.contents[0].text.strip()
#numero_col.append(data_num)
data_name = str(result.contents[1])
schools.append((data_num,data_name))
print(schools)
Instead of selecting all p at once, you can loop through the paragraphs in the table only. The following code takes page number and saves the table to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
pageno = 1
res = requests.get(f'https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php?nombre=&ap=&colegio=&col=&nif=&pagina={pageno}')
soup = BeautifulSoup(res.text,"html.parser")
header = soup.find("div", {"id":"contactaForm"}).find("h4")
cols = [header.find("span").get_text(), header.get_text().replace(header.find("span").get_text(),"")]
data = []
for p in soup.find("div", {"id":"contactaForm"}).find_all("p"):
if len(p['class']) == 0 or p['class'][0] == "resalto":
child = list(p.children)
data.append([child[0].get_text(strip=True), child[1]])
df = pd.DataFrame(data,columns=cols)
df.to_csv("data.csv", index=False)
print(df)
Output:
Nº cole. Nombre colegiado
0 6478 GUADALUPE LAZARO LAZARO
1 13107 JOSE MARIA PIÑA MANZANO
2 7341 HEIKE ELFRIEDE BIRKHOLZ
3 12110 ESTHER TIZON ROLDAN
4 5625 MARIA DOLORES TOMAS GARCIA-VAQUERO
5 4877 MARIA CARMEN CASADO LLAVONA
6 4700 MANUEL GUILABERT ORTEGA-VILLAIZAN
7 9126 MARIA ESPERANZA ASENSIO ALMAZAN
8 8444 CONCEPCION VIALARD RODRIGUEZ
9 13120 NURIA VILLAESCUSA SANCHEZ
10 5023 ARTURO BONET BLANCO
11 12235 ALFONSO JIMENEZ LOPEZ
12 7747 JACOBUS PETRUS SINNIGE
13 17701 ANIA BRAVO FIGUEREDO
14 17391 LUSINE DAMIRCHYAN
15 17944 ISALKOU DJIL MERHBA
16 17772 CARLA DENISSE FIGUEROA PIEDRA
17 7230 MARIA ISABEL VISO CABAÑERO
18 11729 PILAR GARCIA SALAZAR
19 17275 MARIA LOURDES MALLEN LLUIS
I have a dataset from which I want to extract some URLs. The problem is when I want to add the extracted values back to the data frame the index of rows is not correct so the extracted values do not correspond to the correct value
my_data
username date text extracted_url
0 sports 2018-05-08 13:20 something google.com [google.com]
1 sports 2018-05-08 12:34 two links google.com yahoo.com [google.com, yahoo.com]
2 sports 2018-05-08 12:34 some text without links
3 sports 2018-05-08 12:34 google.com [google.com]
Code
import pandas as pd
import requests
import urllib, urlparse
from urlparse import urlsplit
my_file = pd.read_csv('my_file.csv', sep=';', engine='python', error_bad_lines=False)
df = pd.DataFrame(my_file)
text = my_file['text'].str.extract('(https?://[^>]+)' , expand=False).dropna()
print my_file
sep = ' :|\spic|#'
r = text.str.split(pat=sep, expand=False)
se = pd.Series(r)
links = []
item_ids = []
my_file['extracted_links'] = r
for index, row in r.iteritems():
link = row[0].replace(" ", "")
response = requests.get(link).url
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
if base_url=="http://www.google.com/":
item_id = response.rsplit('/', 1)
links.append(response)
item_ids.append(item_id[-1])
else:
links.append('nan')
item_ids.append('nan')
df['links'] = pd.Series(links)
df['item_ids'] = pd.Series(item_ids)
df.to_csv('example.csv')
the output that I get
extracted_url links
0 [google.com] google.com
1 [google.com, yahoo.com] google.com
2 google.com
3 [google.com]
expected output:
extracted_url links
0 [google.com] google.com
1 [google.com, yahoo.com] google.com
2 nan nan
3 [google.com] google.com
it is working as expected with the following code now, although I am not sure if this is the most elegant solution
for index, row in r.iteritems():
link = row.replace(" ", "")
response = requests.get(link).url
base_url = "{0.scheme}://{0.netloc}/".format(urlsplit(response))
if base_url=="http://www.sxc.com/":
re = urllib.unquote(response.encode("ascii"))
item_id = re.rsplit('/', 1)
df['links'].loc[index] = re
df['item_ids'].loc[index] = item_id[-1]