How extract data from the site (corona) by BeautifulSoup? - python

I want to save the number of articles in each country in the form of the name of the country, the number of articles in a file for my research work from the following site. To do this, I wrote this code, which unfortunately does not work.
http://corona.sid.ir/
!pip install bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
url='http://corona.sid.ir/'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
Result=
[]

You are using the wrong url. Try this:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
url = 'http://corona.sid.ir/world.svg'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
rows = []
for each in soup.find_all(attrs={"class":"value"}):
row = {}
row['country'] = each.text.split(':')[0]
row['count'] = each.text.split(':')[1].strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
country count
0 Andorra 17
1 United Arab Emirates 987
2 Afghanistan 67
3 Albania 143
4 Armenia 49
.. ... ...
179 Yemen 54
180 Mayotte 0
181 South Africa 1938
182 Zambia 127
183 Zimbabwe 120
[184 rows x 2 columns]

Related

Code giving blank output working on only particular Website while both websites are exactly same

I scraping a district school website in which all website build are same every URL in websites are completely same except their names. The code I use only working on one school when I put it other school name it giving blank output anyone help me where I am going wrong. Here is the working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?
field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,2):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data).to_csv('fcps_school.csv',index=False)
print(df)
here is the other URL I am trying to scrap:
https://aldrines.fcps.edu/staff-directory?keywords=&field_last_name_from=&field_last_name_to=&items_per_page=10&page=
https://aldrines.fcps.edu
I've scraped 10 pages as an example without changing anything from the existing code and it's working fine and also getting the same output in csv file.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output
Name Position contact_url
0 Bouchera Abutaa Instructional Assistant https://fairfaxhs.fcps.edu/staff/bouchera-abutaa
1 Margaret Aderton Substitute Teacher - Regular Term https://fairfaxhs.fcps.edu/staff/margaret-aderton
2 Aja Adu-Gyamfi School Counselor, HS https://fairfaxhs.fcps.edu/staff/aja-adu-gyamfi
3 Paul Agyeman Custodian II https://fairfaxhs.fcps.edu/staff/paul-agyeman
4 Jin Ahn Food Services Worker https://fairfaxhs.fcps.edu/staff/jin-ahn
.. ... ... ...
95 Tiffany Haddock School Counselor, HS https://fairfaxhs.fcps.edu/staff/tiffany-haddock
96 Heather Hakes Learning Disabilities Teacher, MS/HS https://fairfaxhs.fcps.edu/staff/heather-hakes
97 Gabrielle Hall History & Social Studies Teacher, HS https://fairfaxhs.fcps.edu/staff/gabrielle-hall
98 Sydney Hamrick English Teacher, HS https://fairfaxhs.fcps.edu/staff/sydney-hamrick
99 Anne-Marie Hanapole Biology Teacher, HS https://fairfaxhs.fcps.edu/staff/anne-marie-ha...
[100 rows x 3 columns]
Update:
Actually, success of webscraping not noly depends on good coding skill but also 50% success depends on good understanding the website.
Domain name:
https://fairfaxhs.fcps.edu
and
2.https://aldrines.fcps.edu
aren't the same and h1 tag's attribute value is a bit difference, otherwise, the both website's structure is alike.
Working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://aldrines.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://aldrines.fcps.edu'+link.a.get('href') for link in soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark7').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output:
Name ... contact_url
0 Jamileh Abu-Ghannam ... https://aldrines.fcps.edu/staff/jamileh-abu-gh...
1 Linda Adgate ... https://aldrines.fcps.edu/staff/linda-adgate
2 Rehab Ahmed ... https://aldrines.fcps.edu/staff/rehab-ahmed
3 Richard Amernick ... https://aldrines.fcps.edu/staff/richard-amernick
4 Laura Arm ... https://aldrines.fcps.edu/staff/laura-arm
.. ... ... ...
95 Melissa Weinhaus ... https://aldrines.fcps.edu/staff/melissa-weinhaus
96 Kathryn Wheeler ... https://aldrines.fcps.edu/staff/kathryn-wheeler
97 Latoya Wilson ... https://aldrines.fcps.edu/staff/latoya-wilson
98 Shane Wolfe ... https://aldrines.fcps.edu/staff/shane-wolfe
99 Michael Woodring ... https://aldrines.fcps.edu/staff/michael-woodring
[100 rows x 3 columns]

InvalidSchema: No connection adapters were found for "link"?

I have a dataset with multiple links and I'm trying to get the text of all the links using the code below, but I'm getting a error message "InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'".
Dataset:
links
'https://en.wikipedia.org/wiki/Wagner_Group'
'https://en.wikipedia.org/wiki/Vladimir_Putin'
'https://en.wikipedia.org/wiki/Islam_in_Russia'
The code I'm using to web-scrape is:
def get_data(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
return(text)
#works fine
url = 'https://en.wikipedia.org/wiki/M142_HIMARS'
get_data(url)
#Doesn't work
df['links'].apply(get_data)
Error: InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'"
Thank you in advance
#It works just fine when I apply it to a single url but it doens't work when I apply
it to a dataframe.
df['links'].apply(get_data) is not compatible with requests and bs4.
You can try one of the right ways as follows:
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links =[
'https://en.wikipedia.org/wiki/Wagner_Group',
'https://en.wikipedia.org/wiki/Vladimir_Putin',
'https://en.wikipedia.org/wiki/Islam_in_Russia']
data = []
for url in links:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
for pra in soup.select('div[class="mw-parser-output"] > table~p'):
paragraph = pra.get_text(strip=True)
data.append({
'paragraph':paragraph
})
#print(data)
df = pd.DataFrame(data)
print(df)
Output:
paragraph
0 TheWagner Group(Russian:Группа Вагнера,romaniz...
1 The group came to global prominence during the...
2 Because it often operates in support of Russia...
3 The Wagner Group first appeared in Ukraine in ...
4 The Wagner Group itself was first active in 20...
.. ...
440 A record 18,000 Russian Muslim pilgrims from a...
441 For centuries, theTatarsconstituted the only M...
442 A survey published in 2019 by thePew Research ...
443 Percentage of Muslims in Russia by region:
444 According to the 2010 Russian census, Moscow h...
[445 rows x 1 columns]

Trouble scraping sports table - python

I'm having a lot of trouble scraping results off a sporting table in Python.
I am new to scrapping and have tried everything I can find online.
The website is https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38
and I'm just trying to get the team name and tries.
Any suggestions would be appreciated.
Make sure you have the good requirements:
if you are using Anaconda go to Anaconda cmd line and type:
> pip install beautifulsoup4
> pip install requests
Now, You can try a scrapping library called beautifulsoup, you can specify the name of the div you want in the html source code of your website, with your link and the library catch all the data, example:
import requests
from bs4 import BeautifulSoup
#create variable page
page = requests.get('https://www.imdb.com/title/tt7286456/criticreviews?ref_=tt_ov_rt')
#create the variable soup and calling BeatifulSoup library for our web page
soup = BeautifulSoup(page.text, 'html.parser')
file = open("data.txt", "w") #don't forget create a data.txt file in the same repository of your file.py
#maping the div with class_="summary" with the function find_all()
for x in soup.find_all('div', class_='summary'):
print(x) #print the data scrapped
file.write(x.text) #store the data in the file.txt
file.write("\n")
file.close()
More details in my Scrapping project https://github.com/mehdimaaref7/Scrapping-Sentiment-Analysis/blob/master/big_data.py
The data.txt file is optional, you can just use the variable x with print(x), to display the data you need to scrap.
No need to iterate. Just get the json then let pandas parse it into a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
url = 'https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', {'id':'vue-stats-detail'})['q-data']
jsonData = json.loads(data)
df = pd.DataFrame(jsonData['totalStats']['leaders'])
Output:
print(df)
teamNickName ... played
0 Storm ... 24
1 Roosters ... 24
2 Rabbitohs ... 24
3 Panthers ... 24
4 Eels ... 24
5 Cowboys ... 24
6 Sharks ... 24
7 Broncos ... 24
8 Sea Eagles ... 24
9 Raiders ... 24
10 Titans ... 24
11 Dragons ... 24
12 Knights ... 24
13 Warriors ... 24
14 Bulldogs ... 24
15 Wests Tigers ... 24
[16 rows x 5 columns]

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

Web scraping in Python for loop issue doesn't return expected data

I'm having an issue scraping the F1 website using BeautifulSoup where I have specified the data I have required using a for loop from the website however I am only retrieving one result instead of all the results within the class.
Below is my following code
import requests
from bs4 import BeautifulSoup
from csv import writer
page = requests.get("https://www.formula1.com/")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("div", class_="race-list")
for container in data:
countryname = container.find_all("span", class_="name")
country = countryname[0].text
racetype = container.find_all("span", class_="race-type")
rtype = racetype[0].text
racetime = container.find_all("time", class_="day")
racetimename = racetime[0].text.replace("\n", "").strip()
print(country)
My Current Output -
Australia
Expected Output -
Australia
Bahrain
China
etc
Thanks in advance!
The culprit:
country = countryname[0].text
The reason:
There are 21 countries and you're only fetching the first one at zeroth index i.e.
country = countryname[0].text
The answer:
Loop through the 'countryname' to find all the elements:
import requests
from bs4 import BeautifulSoup
from csv import writer
page = requests.get("https://www.formula1.com/")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("div", class_="race-list")
#
# print(data)
for container in data:
countryname = container.find_all("span", class_="name")
for count in countryname:
country = count.text
racetype = container.find_all("span", class_="race-type")
rtype = count.text
racetime = container.find_all("time", class_="day")
racetimename = count.text.replace("\n", "").strip()
print(country)
OUTPUT:
Australia
Bahrain
China
Azerbaijan
Spain
Monaco
Canada
France
Austria
Great Britain
Germany
Hungary
Belgium
Italy
Singapore
Russia
Japan
Mexico
United States
Brazil
Abu Dhabi

Categories