I've noticed that my code doesn't return the full html. Here is the code:
import requests
from bs4 import BeautifulSoup
keys = "blabla" + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query)
soup = BeautifulSoup(req.text, "html.parser")
for tweets in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(tweets.get("data-tweet-id"))
This doesn't print anything and the "li" tag is not even in the soup object, nor the "div class='stream'", even though the twitter search page looks like this:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" data-item-id="1211695349607174144"
id="stream-item-tweet-1211695349607174144"
data-item-type="tweet"
There is also a lot of other things which don't appear in my soup object.
Related
I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!
The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}
I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>
import requests
from bs4 import BeautifulSoup
url = 'https://www.officialcharts.com/charts/singles-chart'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
def chart_spider(max_pages):
page = 1
while page >= max_pages:
url = "https://www.officialcharts.com/charts/singles-chart"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {"class": "title"}):
href = "BAD HABITS" + link.title(href)
print(href)
page += 1
chart_spider(1)
Wondering how to make this print just the titles of the songs instead of the entire page. I want it to go through the top 100 charts and print all the titles for now. Thanks
Here's is a possible solution, which modify your code as little as possible:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
URL = 'https://www.officialcharts.com/charts/singles-chart'
def chart_spider():
source_code = requests.get(URL)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for title in soup.find_all('div', {"class": "title"}):
print(title.contents[1].string)
chart_spider()
The result is a list of all the titles found in the page, one per line.
If all you want is the titles for each song on the top 100,
this code:
import requests
from bs4 import BeautifulSoup
url='https://www.officialcharts.com/charts/singles-chart/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
titles = [i.text.replace('\n', '') for i in soup.find_all('div', class_="title")]
does what you are looking for.
You can do like this.
The Song title is present inside a <div> tag with class name as title.
Select all those <div> with .find_all(). This gives you a list of all <div> tags.
Iterate over the list and print the text of each div.
from bs4 import BeautifulSoup
import requests
url = 'https://www.officialcharts.com/charts/singles-chart/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
d = soup.find_all('div', class_='title')
for i in d:
print(i.text.strip())
Sample Output:
BAD HABITS
STAY
REMEMBER
BLACK MAGIC
VISITING HOURS
HAPPIER THAN EVER
INDUSTRY BABY
WASTED
.
.
.
I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)
My goal is to retrieve the ids of tweets in a twitter search as they are being posted. My code so far looks like this:
import requests
from bs4 import BeautifulSoup
keys = some_key_words + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query).text
soup = BeautifulSoup(req, "lxml")
for tweets in soup.findAll("li",{"class":"js-stream-item stream-item stream-item"}):
print(tweets)
However, this doesn't return anything. Is there a problem with the code itself or am I looking at the wrong place of the source code? I understand that the ids should be stored here:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item" **data-item-id**="1210306781806833664" id="stream-item-tweet-1210306781806833664" data-item-type="tweet">
from bs4 import BeautifulSoup
data = """
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" **data-item-id**="1210306781806833664"
id="stream-item-tweet-1210306781806833664"
data-item-type="tweet"
>
...
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(item.get("**data-item-id**"))
Output:
1210306781806833664
I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate