How to get data in Webscraping Python

How to get data in Webscraping Python - python

https://mor.nlm.nih.gov/RxNav/search?searchBy=NDC&searchTerm=51079045120
how to get RXCUI number from this website in python, I am unable to get
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
cont = soup.findAll("div", {"id": "titleHolder"})
rx = cont.find("span", id = "rxcuiDecoration")
print(rx.text)

The site renders using javascript you have to use the API
import requests
r = requests.get('https://rxnav.nlm.nih.gov/REST/ndcstatus.json?caller=RxNav&ndc=51079045120')
print(r.json()['ndcStatus']['rxcui'])

Related

Pulling p tags from multiple URLs

I've struggled on this for days and not sure what the issue could be - basically, I'm trying to extract the profile box data (picture below) of each link -- going through inspector, I thought I could pull the p tags and do so.
I'm new to this and trying to understand, but here's what I have thus far:
-- a code that (somewhat) succesfully pulls the info for ONE link:
import requests
from bs4 import BeautifulSoup
# getting html
url = 'https://basketball.realgm.com/player/Darius-Adams/Summary/28720'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
playerinfo = container.find_all('p')
print(playerinfo)
I then also have a code that pulls all of the HREF tags from multiple links:
from bs4 import BeautifulSoup
import requests
def get_links(url):
links = []
website = requests.get(url)
website_text = website.text
soup = BeautifulSoup(website_text)
for link in soup.find_all('a'):
links.append(link.get('href'))
for link in links:
print(link)
print(len(links))
get_links('https://basketball.realgm.com/dleague/players/2022')
get_links('https://basketball.realgm.com/dleague/players/2021')
get_links('https://basketball.realgm.com/dleague/players/2020')
So basically, my goal is to combine these two, and get one code that will pull all of the P tags from multiple URLs. I've been trying to do it, and I'm really not sure at all why this isn't working here:
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
profiles.append(profile.get('p'))
for profile in profiles:
print(profile)
get_profile('https://basketball.realgm.com/player/Darius-Adams/Summary/28720')
get_profile('https://basketball.realgm.com/player/Marial-Shayok/Summary/26697')
Again, I'm really new to web scraping with Python but any advice would be greatly appreciated. Ultimately, my end goal is to have a tool that can scrape this data in a clean way all at once.
(Player name, Current Team, Born, Birthplace, etc).. maybe I'm doing it entirely wrong but any guidance is welcome!

You need to combine your two scripts together and make requests for each player. Try the following approach. This searches for <td> tags that have the data-td=Player attribute:
import requests
from bs4 import BeautifulSoup
def get_links(url):
data = []
req_url = requests.get(url)
soup = BeautifulSoup(req_url.content, "html.parser")
for td in soup.find_all('td', {'data-th' : 'Player'}):
a_tag = td.a
name = a_tag.text
player_url = a_tag['href']
print(f"Getting {name}")
req_player_url = requests.get(f"https://basketball.realgm.com{player_url}")
soup_player = BeautifulSoup(req_player_url.content, "html.parser")
div_profile_box = soup_player.find("div", class_="profile-box")
row = {"Name" : name, "URL" : player_url}
for p in div_profile_box.find_all("p"):
try:
key, value = p.get_text(strip=True).split(':', 1)
row[key.strip()] = value.strip()
except: # not all entries have values
pass
data.append(row)
return data
urls = [
'https://basketball.realgm.com/dleague/players/2022',
'https://basketball.realgm.com/dleague/players/2021',
'https://basketball.realgm.com/dleague/players/2020',
]
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
for entry in data:
print(entry)

Struggling to scrape a telegram public channel with BeautifulSoup

I'm practicing web scraping with BeautifulSoup but I struggle to finish printing a dictionary including the items I've scraped
The web targeted can be any telegram public channel (web version) and I pretend to collect and add as part of the dictionary the text message, timestamp, views and image url (if exist attached to the post).
I've inspected the code for the 4 elements but the one related to the image url has no class or span, so I've ended scraping them it via regex. The other 3 elements are easily retrievable.
Let's go by parts:
Importing modules
from bs4 import BeautifulSoup
import requests
import re
Function to get the images url from the public channel
def pictures(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
link = str(soup.find_all('a', class_ = 'tgme_widget_message_photo_wrap')) #converted to str in order to be able to apply regex
image_url = re.findall(r"https://cdn4.*.*.jpg", link)
return image_url
Soup to get the text message, timestamp and views
picture_list = pictures(url)
url = "https://t.me/s/computer_science_and_programming"
channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}
for content in tgpost:
full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
full_message['timestamp'] = content.find('time', class_ = 'time').text
full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text
print(full_message)
I would really appreciate if someone can help me, I'm new to Python and I don't know how I could do it to
Check if the post contains an image and if so, add it to the dictionary
Print the dictionary including image_url as key and the url as value for each post.
Thank you very much

I think you want something like this.
from bs4 import BeautifulSoup
import requests, re
url = "https://t.me/s/computer_science_and_programming"
channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}
for content in tgpost:
full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
full_message['timestamp'] = content.find('time', class_ = 'time').text
full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text
if content.find('a', class_ = 'tgme_widget_message_photo_wrap') != None :
link = str(content.find('a', class_ = 'tgme_widget_message_photo_wrap'))
full_message['url_image'] = re.findall(r"https://cdn4.*.*.jpg", link)[0]
elif 'url_image' in full_message:
full_message.pop('url_image')
print(full_message)

How to loop through urls in python with beautifulsoup in python

So I have this code where I extract the number of goals score from a certain website. I want to loop it through so that I can get the information for every player, for example: https://www.futbin.com/20/player/1, then https://www.futbin.com/20/player/2, then https://www.futbin.com/20/player/3 up to 5000. How would I go about that? Here is my code for how I get the goals information.
url = 'https://www.futbin.com/20/player/143/cristiano-ronaldo'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
googclose = soup.find_all(class_='ps4-pgp-data')
hi=soup.find_all('div', attrs={'class':'ps4-pgp-data'})[4]

all_goals=[]
for i in range(1,51):
url = 'https://www.futbin.com/20/player/{}'.format(i)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
goal=soup.find_all('div', attrs={'class':'ps4-pgp-data'})[4].text.strip()
all_goals.append(goal)

How can I get the correct urls of ads?

I'm trying to scrape the urls of the ads on "Marktplaats" website (link is provided below).
As you can see I'm looking for 30 URLs. These URLs are placed inside a 'href' field and all start with "/a/auto-s/". Unfortunately, I only keep getting the first few URLs. I found out that on this sites all the data is places within "<li class = "mp-Listing mp-Listing--list-item"> ... </li>". Does anyone have an idea how to fix it? (you can see that you won't find all the URLs of the ads when you run my code)
Link:
https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true
My code:
import requests
from bs4 import BeautifulSoup
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
url_list = soup.find_all(class_ = 'mp-Listing mp-Listing--list-item')
print(url_list)

You can try something like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append(li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
Output
/a/auto-s/oldtimers/a1302359148-allis-chalmers-ed40-1965.html
/a/auto-s/bestelauto-s/a1258166221-opel-movano-2-3-cdti-96kw-2018.html
/a/auto-s/oldtimers/a1302359184-chevrolet-biscayne-bel-air-1960.html
/a/auto-s/renault/a1240974413-ruim-aanbod-rolstoelauto-s-www-autoland-nl.html
/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
/a/auto-s/bestelauto-s/a1212708374-70-x-kleine-bestelwagens-lage-km-scherpe-prijzen.html
/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
/a/auto-s/bestelauto-s/a1295698562-35-x-renault-kangoo-2013-t-m-2015-v-a-25000-km.html
/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
/a/auto-s/dodge/a1279463634-5-jaar-ram-dealer-garantie-lage-bijtelling.html
/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
/a/auto-s/bestelauto-s/a1299134880-citroen-berlingo-1-6-hdi-2017-airco-sd-3-zits-v-a-179-p-m.html
/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
/a/auto-s/vrachtwagens/m1458101965-scania-torpedo.html
/a/auto-s/toyota/m1458101624-toyota-yaris-1-0-12v-vvt-i-aspiration-5dr-2012.html
/a/auto-s/dodge/a1279447576-5-jaar-ram-dealer-garantie-en-historie-bekijk-onze-website.html
You can also build the actual url of the page by appending 'https://www.marktplaats.nl' to li.a.get('href'). So, your whole code should look like this:
import requests
from bs4 import BeautifulSoup
def parse_links(url):
links = []
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for li in soup.find_all(class_="mp-Listing mp-Listing--list-item"):
links.append('https://www.marktplaats.nl' + li.a.get('href'))
return links
url = "https://www.marktplaats.nl/l/auto-s/#f:10882,10898|PriceCentsTo:350000|constructionYearFrom:2001|offeredSince:TODAY|searchInTitleAndDescription:true"
links = parse_links(url)
print('\n'.join(map(str, links)))
It should produce the output like this:
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359157-morris-minor-cabriolet-1970.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302359138-mercedes-benz-g-500-guard-pantzer-1999.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457703674-golf-6-1-2tsi-comfortline-bluemotion-77kw-2de-eigenaar.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1457564187-peugeot-208-1-6-e-hdi-68kw-92pk-5-d-2014-zwart.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1457124365-volkswagen-touareg-3-2-v6-177kw-4motion-aut-2004-grijs.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456753596-volkswagen-golf-vii-2-0-tdi-highline-150pk-xenon-trekhaak.html
https://www.marktplaats.nl/a/auto-s/volkswagen/a1279696849-vw-take-up-5-d-radio-airco-private-lease.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m940111355-bus-verkopen-bestelauto-inkoop-bestelwagen-opkoper-rdw.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1456401063-volkswagen-golf-1-6-74kw-2000-zwart.html
https://www.marktplaats.nl/a/auto-s/renault/m1456242548-renault-espace-2-0-dci-110kw-e4-2006-zwart.html
https://www.marktplaats.nl/a/auto-s/nissan/m1448699345-nissan-qashqai-1-5-dci-connect-2011-grijs-panoramadak.html
https://www.marktplaats.nl/a/auto-s/citroen/a1277007710-citroen-c1-feel-5-d-airco-private-lease-vanaf-189-euro-mnd.html
https://www.marktplaats.nl/a/auto-s/bmw/m1452641019-bmw-5-serie-2-0-520d-touring-aut-2014-grijs.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1448671698-mercedes-benz-a-klasse-a250-amg-224pk-7g-dct-panoramadak-wid.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455671862-bmw-3-serie-2-0-i-320-cabrio-aut-2007-bruin.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1455562699-volkswagen-transporter-kmstand-151-534-2-5-tdi-65kw-2002.html
https://www.marktplaats.nl/a/auto-s/peugeot/a1298813052-private-lease-occasion-outlet-prive-lease.html
https://www.marktplaats.nl/a/auto-s/audi/m1458114563-audi-a4-2-0-tfsi-132kw-avant-multitronic-nl-auto.html
https://www.marktplaats.nl/a/auto-s/mercedes-benz/m1452983872-mercedes-a-klasse-2-0-cdi-a200-5drs-aut-2007-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457499260-renault-master-l3h2-2018-airco-camera-cruise-laadruimte-12.html
https://www.marktplaats.nl/a/auto-s/infiniti/m1458111256-infiniti-q50-3-5-hybrid-awd-2016-grijs.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/a1001658686-200-nw-en-gebruikte-bestelwagens-personenbusjes-pick-ups.html
https://www.marktplaats.nl/a/auto-s/ford/m1458111166-ford-ka-1-3-i-44kw-2007-zwart.html
https://www.marktplaats.nl/a/auto-s/land-rover/m1458110209-land-rover-discovery-4-3-0-tdv6-2010-grijs.html
https://www.marktplaats.nl/a/auto-s/bmw/m1455389317-bmw-320i-e46-sedan-bieden.html
https://www.marktplaats.nl/a/auto-s/bestelauto-s/m1457161395-renault-master-t35-2-3-dci-l3h2-130-pk-navi-airco-camera-pdc.html
https://www.marktplaats.nl/a/auto-s/renault/a1302508082-mooi-renault-megane-scenic-1-6-16v-aut-2005-2003-groen-airco.html
https://www.marktplaats.nl/a/auto-s/ford/m1457306473-ford-galaxy-2-0-tdci-85kw-dpf-2011-blauw.html
https://www.marktplaats.nl/a/auto-s/peugeot/m1456912876-peugeot-407-2-0-16v-sw-2006-grijs.html
https://www.marktplaats.nl/a/auto-s/hyundai/m1458105451-hyundai-atos-gezocht-hoge-prijs-tel-0653222206.html
https://www.marktplaats.nl/a/auto-s/volkswagen/m1458103618-volkswagen-polo-1-4-tsi-132kw-dsg-2012-wit.html
https://www.marktplaats.nl/a/auto-s/oldtimers/a1302743902-online-veiling-oldtimers-en-classic-cars-zedelgem-vavato.html
Good luck!

Using beautiful soup to scrape data from indeed

i am trying to use bs to scrape resume on indeed but i met some problems
here is the sample site: https://www.indeed.com/resumes?q=java&l=&cb=jt
here is my code:
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
def scrape_job_title(soup):
job = []
for div in soup.find_all(name='li', attrs={'class':'sre'}):
for a in div.find_all(name='a', attrs={'class':'app-link'}):
job.append(a['title'])
return(job)
scrape_job_title(soup)
it print out nothing: []
As you can see in the picture, I want to grab the job title "Java developer".

The class is app_link, not app-link. Additionally, a['title'] doesn't do what you want. Use a.contents[0] instead.
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
def scrape_job_title(soup):
job = []
for div in soup.find_all(name='li', attrs={'class':'sre'}):
for a in div.find_all(name='a', attrs={'class':'app_link'}):
job.append(a.contents[0])
return(job)
scrape_job_title(soup)

Try this to get all the job titles:
import requests
from bs4 import BeautifulSoup
URL = "https://www.indeed.com/resumes?q=java&l=&cb=jt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html5lib')
for items in soup.select('.sre'):
data = [item.text for item in items.select('.app_link')]
print(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get data in Webscraping Python - python

The site renders using javascript you have to use the API import requests r = requests.get('https://rxnav.nlm.nih.gov/REST/ndcstatus.json?caller=RxNav&ndc=51079045120') print(r.json()['ndcStatus']['rxcui'])

Related

Pulling p tags from multiple URLs

Struggling to scrape a telegram public channel with BeautifulSoup

How to loop through urls in python with beautifulsoup in python

How can I get the correct urls of ads?

Using beautiful soup to scrape data from indeed

Categories

Resources