I have a dataset with multiple links and I'm trying to get the text of all the links using the code below, but I'm getting a error message "InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'".
Dataset:
links
'https://en.wikipedia.org/wiki/Wagner_Group'
'https://en.wikipedia.org/wiki/Vladimir_Putin'
'https://en.wikipedia.org/wiki/Islam_in_Russia'
The code I'm using to web-scrape is:
def get_data(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
return(text)
#works fine
url = 'https://en.wikipedia.org/wiki/M142_HIMARS'
get_data(url)
#Doesn't work
df['links'].apply(get_data)
Error: InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'"
Thank you in advance
#It works just fine when I apply it to a single url but it doens't work when I apply
it to a dataframe.
df['links'].apply(get_data) is not compatible with requests and bs4.
You can try one of the right ways as follows:
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links =[
'https://en.wikipedia.org/wiki/Wagner_Group',
'https://en.wikipedia.org/wiki/Vladimir_Putin',
'https://en.wikipedia.org/wiki/Islam_in_Russia']
data = []
for url in links:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
for pra in soup.select('div[class="mw-parser-output"] > table~p'):
paragraph = pra.get_text(strip=True)
data.append({
'paragraph':paragraph
})
#print(data)
df = pd.DataFrame(data)
print(df)
Output:
paragraph
0 TheWagner Group(Russian:Группа Вагнера,romaniz...
1 The group came to global prominence during the...
2 Because it often operates in support of Russia...
3 The Wagner Group first appeared in Ukraine in ...
4 The Wagner Group itself was first active in 20...
.. ...
440 A record 18,000 Russian Muslim pilgrims from a...
441 For centuries, theTatarsconstituted the only M...
442 A survey published in 2019 by thePew Research ...
443 Percentage of Muslims in Russia by region:
444 According to the 2010 Russian census, Moscow h...
[445 rows x 1 columns]
Related
I'm trying to scrape a git .md file. I have made a python scraper but I'm kind of stuck on how to actually get the data I want. The page has a long list of job listings. they are all in separate Li elements. I want to get the A elements. After the A elements, there is just plain text separated by | I want to scrape those as well. I really want this to end up as a CSV file with the A tag as a column, the location text before the | as a column, and the remaining description text as a column.
Here's my code:
from bs4 import BeautifulSoup
import requests
import json
def getLinkData(link):
return requests.get(link).content
content = getLinkData('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')
soup = BeautifulSoup(content, 'html.parser')
ul = soup.find_all('ul')
li = soup.find_all("li")
data = []
for uls in ul:
rows = uls.find_all('a')
data.append(rows)
print(data)
When I run this I get the A tags, but obviously not the rest yet. There seem to be a few other ul elements that are included. I just want the one with all the job LIs but the LIs nor the UL have any ids or classes. Any suggestions on how to accomplish what I want? Maybe add Pandas into this(not sure how)
screenshot:
screenshot2:
import requests
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/poteto/hiring-without-whiteboards/master/README.md'
res = requests.get(url).text
jobs = res.split('## A - C\n\n')[1].split('\n\n## Also see')[0]
jobs = [j[3:] for j in jobs.split('\n') if j.startswith('- [')]
df = pd.DataFrame(columns=['Company', 'URL', 'Location', 'Info'])
for i, job in enumerate(jobs):
company, rest = job.split(']', 1)
url, rest = rest[1:].split(')', 1)
rest = rest.split(' | ')
if len(rest) == 3:
_, location, info = rest
else:
_, location = rest
info = np.NaN
df.loc[i, :] = (company, url, location, info)
df.to_csv('file.csv')
print(df.head())
prints
index
Company
URL
Location
Info
0
Able
https://able.co/careers
Lima, PE / Remote
Coding interview, Technical interview (Backlog Refinement + System Design), Leadership interview (Behavioural)
1
Abstract
https://angel.co/abstract/jobs
San Francisco, CA
NaN
2
Accenture
https://www.accenture.com/us-en/careers
San Francisco, CA / Los Angeles, CA / New York, NY / Kuala Lumpur, Malaysia
Technical phone discussion with architecture manager, followed by behavioral interview focusing on soft skills
3
Accredible
https://www.accredible.com/careers
Cambridge, UK / San Francisco, CA / Remote
Take home project, then a pair-programming and discussion onsite / Skype round.
4
Acko
https://acko.com
Mumbai, India
Phone interview, followed by a small take home problem. Finally a F2F or skype pair programming session
import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import chain
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = list(chain.from_iterable([[(
i['href'],
i.get_text(strip=True),
*(i.next_sibling[3:].split(' | ', 1) if i.next_sibling else ['']*2))
for i in x.select('a')] for x in soup.select(
'h2[dir=auto] + ul', limit=9)]))
df = pd.DataFrame(goal)
df.to_csv('data.csv', index=False)
main('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')
I've been trying to format terminal output in line with this stack.
Yet for some reason, the only method that works is the use of columnar but that's limited to showing only 3 rows of text.
I've tried almost all of the methods and yet I almost aways get an output that looks like this:
[Payne, Roberts and Davis, Vasquez-Davidson, Jackson, Chambers and Levy, Savage-Bradley, Ramirez Inc, Rogers-Yates, Kramer-Klein, Meyers-Johnson, Hughes-Williams, Jones, Williams and Villa, Garcia PLC, Gregory and Sons, Clark, Garcia and Sosa, Bush PLC, Salazar-Meyers, Parker, Murphy and Brooks, Cruz-Brown, Macdonald-Ferguson, Williams, Peterson and Rojas, Smith and Sons, Moss, Duncan]
I've been trying to learn how to scrape a website and display the output in a readable format.
import requests
from bs4 import BeautifulSoup
from columnar import columnar
import numpy as np
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
job_elements = results.find_all("div", class_="card-content")
title = []
company = []
location = []
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
formatted_title_element = title_element.text.strip()
formatted_company_element = company_element.text.strip()
formatted_location_element = location_element.text.strip()
title.append(formatted_title_element)
company.append(formatted_company_element)
location.append(formatted_location_element)
data = []
data.append(title)
data.append(company)
data.append(location)
headers = ['Title', 'Company', 'Location']
table = columnar(data, headers, no_borders=True)
print(table)
The columar solution above is the only one from that stack that doesn't automatically format things in the example at the top of the question but again, it only outputs 3 lines. Columnar does have a variable head=x which is meant to show x number of rows, however when I use the switch I get the same output as in the example at the very top.
I have the following url https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801 where the last 6 digits is a unique identifier for a specific runner. I want to find all of the 6 digit unique identifiers on this page.
I've tried to scrape all urls on the page (code shown below), but unfortunately I only get a high-level summary. Rather than an in depth list which should contain >5000 runners. Im hoping to get a list/dataframe which shows:
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500000
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500005
etc.
This is what i've been able to do so far. I appreciate any help!
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://www.gbgb.org.uk//greyhound-profile//")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
Thanks for the help in advance!
The data is loaded dynamicall from the external API URL. You can use next example how to load the data (with the IDs):
import json
import requests
api_url = "https://api.gbgb.org.uk/api/results/dog/517801" # <-- 517801 is the ID from your URL in the question
params = {"page": "1", "itemsPerPage": "20", "race_type": "race"}
page = 1
while True:
params["page"] = page
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
if not data["items"]:
break
for i in data["items"]:
print(
"{:<30} {}".format(
i.get("winnerOr2ndName", ""), i.get("winnerOr2ndId", "")
)
)
page += 1
Prints:
Ferndale Boom 534358
Laganore Mustang 543937
Tickity Kara 535237
Thor 511842
Ballyboughlewiss 519556
Beef Cakes 551323
Distant Millie 546674
Lissan Kels 525148
Rosstemple Marko 534276
Happy Harry 550042
Porthall Ella 550841
Southlodge Eden 531677
Effernogue Beef 547416
Faydas Truffle 528780
Johns Lass 538763
Faydas Truffle 528780
Toms Hero 543659
Affane Buzz 547555
Emkay Flyer 531456
Ballymac Tilly 492923
Kilcrea Duke 542178
Sporting Sultan 541880
Droopys Poet 542020
Shortwood Elle 527241
Rosstemple Marko 534276
Erics Bozo 541863
Swift Launch 536667
Longsearch 523017
Swift Launch 536667
Takemyhand 535023
Floral Print 527192
Rustys Aero 497270
Autumn Dapper 519528
Droopys Kiwi 511989
Deep Chest 520634
Newtack Henry 525511
Indian Nightmare 524636
Lady Mascara 528399
Tarsna Yankee 517373
Leathems Act 516918
Final Star 514015
Ascot Faye 500812
Ballymac Ernie 503569
you can convert the result content to a pandas dataframe then just use winnerOr2ndName and winnerOr2ndId columns
Example
import json
import requests
import pandas as pd
def get_items(dog_id):
url = f"https://api.gbgb.org.uk/api/results/dog/{dog_id}?page=-1"
params = {"page": "-1", "itemsPerPage": "20", "race_type": "race"}
response = requests.get(url, params=params).json()
MAX_PAGES = response["meta"]["pageCount"]
result = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
result["winnerOr2ndId"] = result["winnerOr2ndId"].astype(int)
while int(params.get("page"))<MAX_PAGES:
params["page"] = str(int(params.get("page")) + 1)
response = requests.get(url, params=params).json()
new_items = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
new_items["winnerOr2ndId"] = new_items["winnerOr2ndId"].astype(int)
result = pd.concat([result, new_items])
return result.drop_duplicates()
It would generate a dataframe looking like this:
I want to save the number of articles in each country in the form of the name of the country, the number of articles in a file for my research work from the following site. To do this, I wrote this code, which unfortunately does not work.
http://corona.sid.ir/
!pip install bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
url='http://corona.sid.ir/'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
Result=
[]
You are using the wrong url. Try this:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
url = 'http://corona.sid.ir/world.svg'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
rows = []
for each in soup.find_all(attrs={"class":"value"}):
row = {}
row['country'] = each.text.split(':')[0]
row['count'] = each.text.split(':')[1].strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
country count
0 Andorra 17
1 United Arab Emirates 987
2 Afghanistan 67
3 Albania 143
4 Armenia 49
.. ... ...
179 Yemen 54
180 Mayotte 0
181 South Africa 1938
182 Zambia 127
183 Zimbabwe 120
[184 rows x 2 columns]
sorry for the noobish question.
I'm learning to use BeautifulSoup, and I'm trying to extract a specific string of data within a table.
The website is https://airtmrates.com/ and the exact string I'm trying to get is:
VES Bolivar Soberano Bank Value Value Value
The table doesn't have any class so I have no idea how to find and parse that string.
I've been pulling something out of my buttcheeks but I've failed miserably. Here's the last code I tried so you can have a laugh:
def airtm():
#URLs y ejecución de BS
url = requests.get("https://airtmrates.com/")
response = requests.get(url)
html = response.content
soup_ = soup(url, 'html.parser')
columns = soup_.findAll('td', text = re.compile('VES'), attrs = {'::before'})
return columns
The page is dynamic meaning you'll need the page to render before parsing. You can do that with either Selenium or Requests-HTML
I'm not too familiar with Requests-HTML, but I have used Selenium in the past. This should get you going. Also, whenever I'm tooking to pull a <table>, tag I like to use pandas to parse. But BeautifulSoup can still be used, just takes a little more work to iterate through the table, tr, td tags. Pandas can do that work for you with .read_html():
from selenium import webdriver
import pandas as pd
def airtm(url):
#URLs y ejecución de BS
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
tables = pd.read_html(driver.page_source)
df = tables[0]
df = df[df['Code'] == 'VES']
driver.close()
return df
results = airtm('https://airtmrates.com/')
Output:
print (results)
Code Name Method Rate Buy Sell
120 VES Bolivar Soberano Bank 2526.7 2687.98 2383.68
143 VES Bolivar Soberano Mercado Pago 2526.7 2631.98 2429.52
264 VES Bolivar Soberano MoneyGram 2526.7 2776.59 2339.54
455 VES Bolivar Soberano Western Union 2526.7 2746.41 2383.68