Can't access a tweet id with beautiful soup

Can't access a tweet id with beautiful soup - python

My goal is to retrieve the ids of tweets in a twitter search as they are being posted. My code so far looks like this:
import requests
from bs4 import BeautifulSoup
keys = some_key_words + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query).text
soup = BeautifulSoup(req, "lxml")
for tweets in soup.findAll("li",{"class":"js-stream-item stream-item stream-item"}):
print(tweets)
However, this doesn't return anything. Is there a problem with the code itself or am I looking at the wrong place of the source code? I understand that the ids should be stored here:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item" **data-item-id**="1210306781806833664" id="stream-item-tweet-1210306781806833664" data-item-type="tweet">

from bs4 import BeautifulSoup
data = """
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" **data-item-id**="1210306781806833664"
id="stream-item-tweet-1210306781806833664"
data-item-type="tweet"
>
...
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(item.get("**data-item-id**"))
Output:
1210306781806833664

Related

Beautiful soup doesn't return full html

I've noticed that my code doesn't return the full html. Here is the code:
import requests
from bs4 import BeautifulSoup
keys = "blabla" + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query)
soup = BeautifulSoup(req.text, "html.parser")
for tweets in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(tweets.get("data-tweet-id"))
This doesn't print anything and the "li" tag is not even in the soup object, nor the "div class='stream'", even though the twitter search page looks like this:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" data-item-id="1211695349607174144"
id="stream-item-tweet-1211695349607174144"
data-item-type="tweet"
There is also a lot of other things which don't appear in my soup object.

How do I extract text after <i class> tag?

I am trying to print out the text 'Dealer' from div class by using beautifulSoup, but I do not know how to extract it.
I tried to print the i class, but the text Dealer did not come out
url = 'https://www.carlist.my/used-cars-for-sale/proton/malaysia'
response = requests.get(url, params={'page_number': 1})
soup = BeautifulSoup(response.text, 'lxml')
articles = soup.find_all('article')[:25]
seller_type = articles[4].find('div', class_ = 'item push-quarter--ends listing__spec--dealer')
seller_type_text = articles[4].find('i', class_ = 'icon icon--secondary muted valign--top push-quarter--right icon--user-formal')
print(seller_type.prettify())
print()
print(seller_type_text)
This is the output that I got:
<div class="item push-quarter--ends listing__spec--dealer">
<i class="icon icon--secondary muted valign--top push-quarter--right icon--user-formal">
</i>
Dealer
<span class="flyout listing__badge listing__badge--trusted-seller inline--block valign--top push-quarter--left">
<i class="icon icon--thumb-up">
</i>
<span class="flyout__content flyout__content--tip visuallyhidden--portable">
This 'Trusted Dealer' has a proven track record of upholding the best car selling practices certified by Carlist.my
</span>
</span>
<!-- used car -->
<!-- BMW -->
</div>
<i class="icon icon--secondary muted valign--top push-quarter--right icon--user-formal"></i>
How do I print the word 'Dealer' right after i class and before the span class?
Can someone please help me?
Thanks a lot!

There is a faster way of using one of the compound class names of the i tag element along with next_sibling.
If you examine the html you can see "Dealer" is part of the parent div of the i tag, and follows the i tag; so, you can grab the i tag then use next_sibling
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.carlist.my/used-cars-for-sale/proton/malaysia')
soup = bs(r.content, 'lxml')
print(soup.select_one('.icon--user-formal').next_sibling)

Take a look at the contents property of your seller_type. You'll see that Dealer is at seller_type.contents[2]. In other words,
import requests
from bs4 import BeautifulSoup
url = 'https://www.carlist.my/used-cars-for-sale/proton/malaysia?profile_type=Dealer'
response = requests.get(url, params={'page_number': 1})
soup = BeautifulSoup(response.text, 'lxml')
articles = soup.find_all('article')[:25]
seller_type = articles[4].find('div', class_ = 'item push-quarter--ends listing__spec--dealer')
print(seller_type.contents[2])

import requests
from bs4 import BeautifulSoup
url = 'https://www.carlist.my/used-cars-for-sale/proton/malaysia?profile_type=Dealer'
response = requests.get(url, params={'page_number': 1})
soup = BeautifulSoup(response.text, 'lxml')
articles = soup.find_all('article')[:25]
seller_type = articles[4].find('div', class_ = 'item push-quarter--ends listing__spec--dealer')
print(seller_type.contents[2])

How to use BeautifulSoup to get real-time stock price on website?

I am working on a project to get the real-time stock price on http://www.jpmhkwarrants.com/en_hk/market-statistics/underlying/underlying-terms/code/1. I have searched online and tried several way to get the price, but still fail. Here is my code:
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'lxmll)
price = soup.find(id = "real_time_box").find({"span", "class":"price"})
print(price)
The output is "None". I know that the price is scripted in the function above but I have no idea how to get the price. Can it be solved by beautifulsoup or else module?

view the page source you will see html like this
<div class="table detail">
.....
<div class="tl">即市走勢 <span class="description">前收市價</span>
.....
<td>買入價(延遲*)<span>82.15</span></td>
the span that we want is in index 2, select it with
price = soup.select('.table.detail td span')[1]
print(price.text)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
from bs4 import BeautifulSoup
from urllib.request import urlopen
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'html.parser')
price = soup.select('.table.detail td span')[1]
print(price.text)
getStockPrice()
</code>
</div>

Scraping multiple data tags from HTML using beautiful Soup

I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}

import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate

wrong python html parsing

My code:
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
print(kardiz)
my output :
[<div class="views-row views-row-1 views-row-odd views-row-first">
<span class="views-field views-field-title"> <span class="field-content">Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri</span> </span>
<span class="views-field views-field-created"> <span class="field-content"><i class="fa fa-calendar"></i> Salı, Aralık 5, 2017 - 09:58 </span> </span> </div>]
But I want to get just " Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri ". How can I achieve that?

You can call .text on a result from BeautifulSoup. It takes the textual content of the elements found, skipping the tags of the elements.
e.g.
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
for result in icerik:
print(result.text)

You can try like this as well to get the title and link from that page. I used css selector to get them:
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("#content .field-content a"):
link = urljoin(url,item['href'])
print("Title: {}\nLink: {}\n".format(item.text,link))
Partial output:
Title: 2017-2018 Güz Dönemi Final Sınav Programı (TASLAK)
Link: http://yaz.tek.firat.edu.tr/tr/node/481
Title: NETAŞ İşyeri Eğitimi Mülakatları Hakkında Duyuru
Link: http://yaz.tek.firat.edu.tr/tr/node/480

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't access a tweet id with beautiful soup - python

Related

Beautiful soup doesn't return full html

How do I extract text after <i class> tag?

How to use BeautifulSoup to get real-time stock price on website?

Scraping multiple data tags from HTML using beautiful Soup

wrong python html parsing

Categories

Resources