Scrapy: how to get links to users?

Scrapy: how to get links to users? - python

I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>

Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.

I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)

css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()

Related

How do I get href links under a specific class with BeautifulSoup

This is an example of the type of block of HTML source code I'm targeting with BeautifulSoup
<div class="fighter_list left">
<meta itemprop="image" content="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG">
<img class="lazy" src="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" data-original="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" alt="Jason DeLucia" title="Jason DeLucia" />
<div class="fighter_result_data">
<a itemprop="url" href="/fighter/Jason-DeLucia-22"><span itemprop="name">Jason<br />DeLucia</span></a><br>
This is one of multiple blocks like this for each "fighter_list left" on the page.
I want to get all of the itemprop="url" href links that are in the "fighter_list left" class (i.e. /fighter/Jason-DeLucia-22)
When I try the below code I get nothing.
for link in html.find_all('a', class_="fighter_List left", itemprop="url"):
print(link.get('href'))
The closest I can get is getting every itemprop=url link on the page when I omit the class_= part.
But I only want the ones under the fighter_list left class.
This is the website https://www.sherdog.com/events/UFC-1-The-Beginning-7

You can use CSS selector for the task:
import requests
from bs4 import BeautifulSoup
url = "https://www.sherdog.com/events/UFC-1-The-Beginning-7"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.select('.fighter_list.left [itemprop="url"]'):
print(link["href"])
Prints:
/fighter/Jason-DeLucia-22
/fighter/Royce-Gracie-19
/fighter/Gerard-Gordeau-15
/fighter/Ken-Shamrock-4
/fighter/Royce-Gracie-19
/fighter/Kevin-Rosier-17
/fighter/Gerard-Gordeau-15

How to get only links from parsed html using python?

How can I get the links if the tag is in this form?
<div class="BNeawe vvjwJb AP7Wnd">Going Gourmet Catering (#goinggourmet) - Instagram</div></h3><div class="BNeawe UPmit AP7Wnd">www.instagram.com › goinggourmet</div>
I have tried the below code and it helped me get only URLs, but the URLs comes in this format.
/url?q=https://bespokecatering.sydney/&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAEQAg&usg=AOvVaw076QI0_4Yw4hNZ6iXHQZL-
/url?q=https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/%3Fextid%3DSEO----&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQtwJ6BAgEEAE&usg=AOvVaw2YQI1Bqwip72axc-Nh2_6e
/url?q=https://www.instagram.com/bespoke_catering/%3Fhl%3Den&sa=U&ved=2ahUKEwjTv6ueseHyAhUHb30KHYTYABwQFnoECAoQAg&usg=AOvVaw1QUCWYmxfSLb6Jx20hyXIR
I need only URLs from Facebook and Instagram, without any additional wordings, What I mean is I want only real link, not the redirected link.
I need something like this from above links,
'https://www.facebook.com/bespokecatering.sydney'
'https://www.instagram.com/bespoke_catering'
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
Any help is much appreciated.
I tried the below code, but it returns empty results or different results
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
urls = link['href']
print(urls)
for url in urls:
try:
j=url.split('=')[1]
k= '/'.join(j.split('/')[0:4])
#print(k)
except:
k = ''

You already have your <a> selected - Just loop over selection and print results via ['href']:
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(link['href'])
If you improve your question and add additional information as requested, we can answer more detailed.
EDIT
Answering your additional question with a simple example (smth you should provide in your question)
import requests
from bs4 import BeautifulSoup
result = '''
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
<div class="kCrYT">
</div>
'''
soup = BeautifulSoup(result, 'lxml')
div = soup.find_all('div',attrs={'class':'kCrYT'})
for w in div:
for link in w.select('a'):
print(dict(x.split('=') for x in requests.utils.urlparse(link['href']).query.split('&'))['q'].split('%3F')[0])
Result:
https://bespokecatering.sydney/
https://www.facebook.com/bespokecatering.sydney/videos/lockdown-does-not-mean-unfulfilled-cravings-order-our-weekly-favorites-order-her/892336708293067/
https://www.instagram.com/bespoke_catering/

Python beatuifulsoup: extract value from div class

I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'

Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24

if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks

Python / Beautifulsoup: HTML Path to the current element

For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen

Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml

Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)

Scraping div with a data- attribute using Python and BeautifulSoup

I have to scrape a web page using BeautifulSoup in python.So to extract the complete div which hass the relavent information and looks like the one below:
<div data-v-24a74549="" class="row row-mg-mod term-row">
I wrote soup.find('div',{'class':'row row-mg-mod term-row'}).
But it is returning nothing.I guess it is something to do with this data-v value.
Can someone tell the exact syntaxof scraping this type of data?

Give this a try:
from bs4 import BeautifulSoup
content = """
<div data-v-24a74549="" class="row row-mg-mod term-row">"""
soup = BeautifulSoup(content,'html.parser')
for div in soup.find_all("div", {"class" : "row"}):
print(div)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy: how to get links to users? - python

css: response.css('.member-item .flex--alignCenter a::attr(href)').getall()

Related

How do I get href links under a specific class with BeautifulSoup

How to get only links from parsed html using python?

Python beatuifulsoup: extract value from div class

Python / Beautifulsoup: HTML Path to the current element

Scraping div with a data- attribute using Python and BeautifulSoup

Categories

Resources