I tried this beautiful soup code to scrape comments from facebook from this link : https://python.gotrained.com/scraping-facebook-posts-comments/ For the code to run apart from the main complete code given on the website, one needs to place username and password on a json structured credentials file and list of public facebook pages to scrape(example for both is given on the link). I followed the instructions and ran the code but got the follwong error:
INFO:root:[*] Logged in.
Traceback (most recent call last):
File "/Users/vivekrmk/Documents/Github_general/scrape_fb_beautiful_soup/facebook_scrapper_soup.py", line 215, in <module>
posts_data = crawl_profile(session, base_url, profile_url, 100)
File "/Users/vivekrmk/Documents/Github_general/scrape_fb_beautiful_soup/facebook_scrapper_soup.py", line 72, in crawl_profile
show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
AttributeError: 'NoneType' object has no attribute 'a'
When I commented lines 70 to 76 in the main code:
# show_more_posts_url = None
# if not posts_completed(scraped_posts, post_limit):
# show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
# profile_bs = get_bs(session, base_url+show_more_posts_url)
# time.sleep(3)
# else:
# break
I was able to get the output as a json with values in all the fields(ie post url, post text and media_url) except comments field- it was a blank list. Need help with the above so that I can scrape the comments also. Thanks in advance!
there. I'm building a simple scraping tool. Here's the code that I have for it.
from bs4 import BeautifulSoup
import requests
from lxml import html
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('Programming
4 Marketers-File-goes-here.json', scope)
site = 'http://nathanbarry.com/authority/'
hdr = {'User-Agent':'Mozilla/5.0'}
req = requests.get(site, headers=hdr)
soup = BeautifulSoup(req.content)
def getFullPrice(soup):
divs = soup.find_all('div', id='complete-package')
price = ""
for i in divs:
price = i.a
completePrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return completePrice
def getVideoPrice(soup):
divs = soup.find_all('div', id='video-package')
price = ""
for i in divs:
price = i.a
videoPrice = (str(price).split('$',1)[1]).split('<', 1)[0]
return videoPrice
fullPrice = getFullPrice(soup)
videoPrice = getVideoPrice(soup)
date = datetime.date.today()
gc = gspread.authorize(credentials)
wks = gc.open("Authority Tracking").sheet1
row = len(wks.col_values(1))+1
wks.update_cell(row, 1, date)
wks.update_cell(row, 2, fullPrice)
wks.update_cell(row, 3, videoPrice)
This script runs on my local machine. But, when I deploy it as a part of an app to Heroku and try to run it, I get the following error:
Traceback (most recent call last):
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 219, in put_feed
r = self.session.put(url, data, headers=headers)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 82, in put
return self.request('PUT', url, params=params, data=data, **kwargs)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 69, in request
response.status_code, response.content))
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "AuthorityScraper.py", line 44, in
wks.update_cell(row, 1, date)
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/models.py", line 517, in update_cell
self.client.put_feed(uri, ElementTree.tostring(feed))
File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 221, in put_feed
if ex[0] == 403:
TypeError: 'RequestError' object does not support indexing
What do you think might be causing this error? Do you have any suggestions for how I can fix it?
There are a couple of things going on:
1) The Google Sheets API returned an error: "Invalid query parameter value for cell_id":
gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")
2) A bug in gspread caused an exception upon receipt of the error:
TypeError: 'RequestError' object does not support indexing
Python 3 removed __getitem__ from BaseException, which this gspread error handling relies on. This doesn't matter too much because it would have raised an UpdateCellError exception anyways.
My guess is that you are passing an invalid row number to update_cell. It would be helpful to add some debug logging to your script to show, for example, which row it is trying to update.
It may be better to start with a worksheet with zero rows and use append_row instead. However there does seem to be an outstanding issue in gspread with append_row, and it may actually be the same issue you are running into.
I encountered the same problem. BS4 works fine at a local machine. However, for some reason, it is way too slow in the Heroku server resulting into giving error.
I switched to lxml and it is working fine now.
Install it by command:
pip install lxml
A sample code snippet is given below:
from lxml import html
import requests
getpage = requests.get("https://url_here")
gethtmlcontent = html.fromstring(getpage.content)
data = gethtmlcontent.xpath('//div[#class = "class-name"]/text()')
#this is a sample for fetching data from the dummy div
data = data[0:n] # as per your requirement
#now inject the data into django tmeplate.
So I'm working on a Python script to extract text from an email and following these instructions to do so. This is the script thus far:
import imapclient
import pprint
import pyzmail
mymail = "my#email.com"
password = input("Password: ")
imapObj = imapclient.IMAPClient('imap.gmail.com' , ssl=True)
imapObj.login(mymail , password)
imapObj.select_folder('INBOX', readonly=False)
UIDs = imapObj.search(['SUBJECT Testing'])
rawMessages = imapObj.fetch([5484], ['BODY[]'])
message = pyzmail.PyzMessage.factory(rawMessages[5484]['BODY[]'])
However I'm getting this error:
message = pyzmail.PyzMessage.factory(rawMessages[5484]['BODY[]'])
KeyError: 5484
5484 being the ID for the email that the search function finds. I've also tried putting UIDs in instead of 5484, but that doesn't work either. Thanks in advance!
Thank you #Madalin Stroe .
I use python3.6.2 and pyzmail1.0.3 on Win10.
When I run
message = pyzmail.PyzMessage.factory(rawMessages[4]['BODY[]'])
The ERR shows like this:
Traceback (most recent call last):
File "PATH/TO/mySinaEmail.py", line 42, in <module>
message = pyzmail.PyzMessage.factory(rawMessages[4]['BODY[]'])
KeyError: 'BODY[]'
When I modified this to message = pyzmail.PyzMessage.factory(rawMessages[4][b'BODY[]']), it run well.
Try replacing ['BODY[]'] with [b'BODY[]']
I am using mechanize in python to log in a webpage.
Python code:
br = mechanize.Browser()
br.open("https://example.com/page1/")
formcount =0
for form in br.forms():
if form.attrs['class'] == 'standardForm':
br.select_form(nr=formcount)
break
formcount = formcount+1
print form
br.form['username_or_email']='username'
br.form['password']='password'
Then got TypeError for the line "br.form['username_or_email']='username'" as below:
Traceback (most recent call last):
...
br.form['username_or_email']='username'
TypeError: 'NoneType' object does not support item assignment
From the line "print form", we can see some form info as below,
<POST https://www.example.com/login/?next=https%3A//example.com/page1/ application/x-www-form-urlencoded
<IgnoreControl(<None>=<None>)>
<IgnoreControl(<None>=<None>)>
<IgnoreControl(<None>=<None>)>
<TextControl(username_or_email=)>
<PasswordControl(password=)>
<SubmitButtonControl(<None>=) (readonly)>>
May I know how can I provide right value to the form?
Thanks
try the below.
br = mechanize.Browser()
br.open("https://example.com/page1/")
br.select_form("enter the login form name here")
br["enter user name field here"] = 'username'
br["enter password field name here"] = 'password'
br.submit()
I believe that the login form is not being assigned to the br object as you are missing the select_form function
I'm trying to input login details into GMAIL with the below coode
from selenium import webdriver
import getpass
chromedriver = 'C:\Python34\Scripts\chromedriver'
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.google.com/adwords/')
signin = driver.find_element_by_class_name('ignore-channel')
signin.click()
email = input('Enter your Email ID : ')
password = getpass.getpass('Password :')
email = driver.find_element_by_id('Email')
email.send_keys(email)
passwd = driver.find_element_by_id('Passwd')
passwd.send_keys(password)
submit = driver.find_element_by_id('signIn')
submit.click()
tools = driver.find_elements_by_partial_link_text('Keyword')
tools[0].click()
When i enter the login/pass details. Python returns the below error
Traceback (most recent call last):
File "C:/Python34/SEO.py", line 18, in <module>
email.send_keys(email)
AttributeError: 'str' object has no attribute 'send_keys'
any idea where i might be wrong?
KJ
Check these lines from your code:
Line number 15
email = input('Enter your Email ID : ')
Line number 17
email = driver.find_element_by_id('Email')
Line number 18
email.send_keys(email)
So you are assigning your email string to the variable name 'email' and then again assigning the 'webelement' into the same variable name 'email'. So when the code tries to do sendkeys at line 18, it does not work.