I tried this beautiful soup code to scrape comments from facebook from this link : https://python.gotrained.com/scraping-facebook-posts-comments/ For the code to run apart from the main complete code given on the website, one needs to place username and password on a json structured credentials file and list of public facebook pages to scrape(example for both is given on the link). I followed the instructions and ran the code but got the follwong error:
INFO:root:[*] Logged in.
Traceback (most recent call last):
File "/Users/vivekrmk/Documents/Github_general/scrape_fb_beautiful_soup/facebook_scrapper_soup.py", line 215, in <module>
posts_data = crawl_profile(session, base_url, profile_url, 100)
File "/Users/vivekrmk/Documents/Github_general/scrape_fb_beautiful_soup/facebook_scrapper_soup.py", line 72, in crawl_profile
show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
AttributeError: 'NoneType' object has no attribute 'a'
When I commented lines 70 to 76 in the main code:
# show_more_posts_url = None
# if not posts_completed(scraped_posts, post_limit):
# show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
# profile_bs = get_bs(session, base_url+show_more_posts_url)
# time.sleep(3)
# else:
# break
I was able to get the output as a json with values in all the fields(ie post url, post text and media_url) except comments field- it was a blank list. Need help with the above so that I can scrape the comments also. Thanks in advance!
Related
I'm subscribed to skillshare but turns out skillshare's UI is a huge mess and unproductive for learning. So, I am seeking for a way to mass download course(single course) at once.
I found this github.
https://github.com/crazygroot/skillsharedownloader
And it has a google collab link as well at the bottom.
https://colab.research.google.com/drive/1hUUPDDql0QLul7lB8NQNaEEq-bbayEdE#scrollTo=xunEYHutBEv%2F
I'm getting the below error:
Traceback (most recent call last):
File "/root/Skillsharedownloader/ss.py", line 11, in <module>
dl.download_course_by_url(course_url)
File "/root/Skillsharedownloader/downloader.py", line 34, in download_course_by_url
raise Exception('Failed to parse class ID from URL')
Exception: Failed to parse class ID from URL
This is the course link that I'm using:
https://www.skillshare.com/en/classes/React-JS-Learn-by-examples/1186943986/
I have encountered a similar issue. The problem was that it requires the URL to look like https://www.skillshare.com/classes/.*?/(\d+).
If your copying the URL from the address bar, check the URL again and make sure it has the same format. The current one looks like https://www.skillshare.com/en/classes/xxxxx. So simply remove /en.
I'm trying to build an addon which can look via a specific tag through all the notes in my anki collection and when it finds the tag - pull out a word from the focus field, search jisho for that word and then add the meaning from jisho into the meanings field. I have tested the web scraper and it works but I'm struggling to interact with anki.
I have written the code below based on the anki documentation
def return_search(word):
html = f"https://jisho.org/word/{word}"
webpage = requests.get(html).content
soup = BeautifulSoup(webpage, "html.parser")
meanings_list = []
meanings = soup.find_all(attrs = {"class": "meaning-meaning"})
for count, item in enumerate(meanings):
meanings_list.append(f"{count+1}) {item.get_text()}")
meanings_list = '\n\n'.join(meanings_list)
return meanings_list
def testFunction() -> None:
ids = mw.col.find_cards("tag:jpzr")
for _id in ids:
note = mw.col.getNote(_id)
meaning_list = return_search(note["Focus"])
note["Meaning"] += meaning_list
note.flush()
# create a new menu item, "test"
action = QAction("test", mw)
# set it to call testFunction when it's clicked
qconnect(action.triggered, testFunction)
# and add it to the tools menu
mw.form.menuTools.addAction(action)
I get an error on line 27 which is the
note = mw.col.getNote(_id)
I don't know why it isn't accessing the notes correctly and anki's documentation is so lacking. This is the error message I get:
Caught exception:
Traceback (most recent call last):
File "C:\Users\aaron\AppData\Roaming\Anki2\addons21\myaddon\__init__.py", line 33, in testFunction
note = mw.col.getNote(_id)
File "anki\collection.py", line 309, in getNote
File "anki\notes.py", line 34, in __init__
File "anki\notes.py", line 40, in load
File "anki\rsbackend_gen.py", line 350, in get_note
File "anki\rsbackend.py", line 267, in _run_command
anki.rsbackend.NotFoundError
You're passing a card ID to a function which searches through note IDs. Instead of
note = mw.col.getNote(_id)
it should be
note = mw.col.get_card(_id).note()
I'm also just starting out in Anki add-on writing, so this is not at all a comprehensive answer. The problem in the code above is because the integer _id is the card id, not the note id; the card is the thing which you really see in the app, and the note is the data entry which governs at least one card.
So I'm learning using atbwp and I'm now doing a program where I open top 5 search results on a website.
It all works up until I have to get the href for each of the top results and open it. I get this error:
Traceback (most recent call last):
File "C:\Users\Asus\Desktop\pyhton\projects\emagSEARCH.py", line 33, in <module>
webbrowser.open(url)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 86, in open
if browser.open(url, new, autoraise):
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38-32\lib\webbrowser.py", line 603, in open
os.startfile(url)
TypeError: startfile: filepath should be string, bytes or os.PathLike, not NoneType
This is how the html looks:
The No-Brainer Set The Ordinary, Deciem
And this is the part of my code which won't work for some reason..:
Soup=bs4.BeautifulSoup(res.text,'html.parser')
results= Soup.select('.item-title')
numberTabs=min(5,len(results))
print('Opening top '+str(numberTabs)+' top results...')
for i in range(numberTabs):
url=results[i].get('href')
webbrowser.open(url)
It does what it should until the for loop. It looks pretty much exactly like the example program in the book, so I don't understand why it doesn't work. What am I doing wrong?
If u wanna extract the href under the a tag, then use this:
html = ' The No-Brainer Set The Ordinary, Deciem'
Soup=bs4.BeautifulSoup(html,'html.parser')
url = Soup.find('a')['href']
print(url)
webbrowser.open(url)
Output:
https://comenzi.farmaciatei.ro/ingrijire-personala/ingrijire-corp-si-fata/tratamente-/the-no-brainer-set-the-ordinary-deciem-p344003
U can do the same for all a tags in order to get all hrefs.
I am learning how to scrape web information. Below is a snippet of the actual code solution + output from datacamp.
On datacamp, this works perfectly fine, but when I try to run it on Spyder (my own macbook), it doesn't work...
This is because on datacamp, the URL has already been pre-loaded into a variable named 'response'.. however on Spyder, the URL needs to be defined again.
So, I first defined the response variable as response = requests.get('https://www.datacamp.com/courses/all') so that the code will point to datacamp's website..
My code looks like:
from scrapy.selector import Selector
import requests
response = requests.get('https://www.datacamp.com/courses/all')
this_url = response.url
this_title = response.xpath('/html/head/title/text()').extract_first()
print_url_title( this_url, this_title )
When I run this on Spyder, I got an error message
Traceback (most recent call last):
File "<ipython-input-30-6a8340fd3a71>", line 11, in <module>
this_title = response.xpath('/html/head/title/text()').extract_first()
AttributeError: 'Response' object has no attribute 'xpath'
Could someone please guide me? I would really like to know how to get this code working on Spyder.. thank you very much.
The value returned by requests.get('https://www.datacamp.com/courses/all') is a Response object, and this object has no attribute xpath, hence the error: AttributeError: 'Response' object has no attribute 'xpath'
I assume response from your tutorial source, probably has been assigned to another object (most likely the object returned by etree.HTML) and not the value returned by requests.get(url).
You can however do this:
from lxml import etree #import etree
response = requests.get('https://www.datacamp.com/courses/all') #get the Response object
tree = etree.HTML(response.text) #pass the page's source using the Response object
result = tree.xpath('/html/head/title/text()') #extract the value
print(response.url) #url
print(result) #findings
there. I'm building a simple scraping tool. Here's the code that I have for it.
from bs4 import BeautifulSoup
import requests
from lxml import html
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('Programming
4 Marketers-File-goes-here.json', scope)
site = 'http://nathanbarry.com/authority/'
hdr = {'User-Agent':'Mozilla/5.0'}
req = requests.get(site, headers=hdr)
soup = BeautifulSoup(req.content)
def getFullPrice(soup):
divs = soup.find_all('div', id='complete-package')
price = ""
for i in divs:
price = i.a
completePrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return completePrice
def getVideoPrice(soup):
divs = soup.find_all('div', id='video-package')
price = ""
for i in divs:
price = i.a
videoPrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return videoPrice
fullPrice = getFullPrice(soup)
videoPrice = getVideoPrice(soup)
date = datetime.date.today()
gc = gspread.authorize(credentials)
wks = gc.open("Authority Tracking").sheet1
row = len(wks.col_values(1))+1
wks.update_cell(row, 1, date)
wks.update_cell(row, 2, fullPrice)
wks.update_cell(row, 3, videoPrice)
This script runs on my local machine. But, when I deploy it as a part of an app to Heroku and try to run it, I get the following error:
Traceback (most recent call last):
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 219, in put_feed
r = self.session.put(url, data, headers=headers)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 82, in put
return self.request('PUT', url, params=params, data=data, **kwargs)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 69, in request
response.status_code, response.content))
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "AuthorityScraper.py", line 44, in
wks.update_cell(row, 1, date)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/models.py", line 517, in update_cell
self.client.put_feed(uri, ElementTree.tostring(feed))
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 221, in put_feed
if ex[0] == 403:
TypeError: 'RequestError' object does not support indexing
What do you think might be causing this error? Do you have any suggestions for how I can fix it?
There are a couple of things going on:
1) The Google Sheets API returned an error: "Invalid query parameter value for cell_id":
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
2) A bug in gspread caused an exception upon receipt of the error:
TypeError: 'RequestError' object does not support indexing
Python 3 removed __getitem__ from BaseException, which this gspread error handling relies on. This doesn't matter too much because it would have raised an UpdateCellError exception anyways.
My guess is that you are passing an invalid row number to update_cell. It would be helpful to add some debug logging to your script to show, for example, which row it is trying to update.
It may be better to start with a worksheet with zero rows and use append_row instead. However there does seem to be an outstanding issue in gspread with append_row, and it may actually be the same issue you are running into.
I encountered the same problem. BS4 works fine at a local machine. However, for some reason, it is way too slow in the Heroku server resulting into giving error.
I switched to lxml and it is working fine now.
Install it by command:
pip install lxml
A sample code snippet is given below:
from lxml import html
import requests
getpage = requests.get("https://url_here")
gethtmlcontent = html.fromstring(getpage.content)
data = gethtmlcontent.xpath('//div[#class = "class-name"]/text()')
#this is a sample for fetching data from the dummy div
data = data[0:n] # as per your requirement
#now inject the data into django tmeplate.