I'm looking at scraping some data from Facebook using Python 2.7. My code basically augments by 1 changing the Facebook profile ID to then capture details returned by the page.
An example of the page I'm looking to capture the data from is graph.facebook.com/4.
Here's my code below:
import scraperwiki
import urlparse
import simplejson
source_url = "http://graph.facebook.com/"
profile_id = 1
while True:
try:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
data['id'] = result['id']
data['name'] = result['name']
data['first_name'] = result['first_name']
data['last_name'] = result['last_name']
data['link'] = result['link']
data['username'] = result['username']
data['gender'] = result['gender']
data['locale'] = result['locale']
print data['id'], data['name']
scraperwiki.sqlite.save(unique_keys=['id'], data=data)
#time.sleep(3)
except:
continue
profile_id +=1
I am using the scraperwiki site to carry out this check but no data is printed back to console despite the line 'print data['id'], data['name'] used just to check the code is working
Any suggestions on what is wrong with this code? As said, for each returned profile, the unique data should be captured and printed to screen as well as populated into the sqlite database.
Thanks
Any suggestions on what is wrong with this code?
Yes. You are swallowing all of your errors. There could be a huge number of things going wrong in the block under try. If anything goes wrong in that block, you move on without printing anything.
You should only ever use a try / except block when you are looking to handle a specific error.
modify your code so that it looks like this:
while True:
profile_id +=1
profile_url = urlparse.urljoin(source_url, str(profile_id))
results_json = simplejson.loads(scraperwiki.scrape(profile_url))
for result in results_json['results']:
print result
data = {}
# ... more ...
and then you will get detailed error messages when specific things go wrong.
As for your concern in the comments:
The reason I have the error handling is because, if you look for
example at graph.facebook.com/3, this page contains no user data and
so I don't want to collate this info and skip to the next user, ie. no
4 etc
If you want to handle the case where there is no data, then find a way to handle that case specifically. It is bad practice to swallow all errors.
Related
I'm on a blackbox penetration training, last time i asked a question about sql injection which so far im making a progress on it i was able to retrieve the database and the column.
This time i need to find the admin login, so i used dirsearch for that, i checked each webdirectories from dirsearch and sometimes it would show the same page as index.html.
So i'm trying to fix this by automating the process with a script:
import requests
url = "http://depedqc.ph";
webdirectory_path = "C:/PentestingLabs/Dirsearch/reports/depedqc.ph/scanned_webdirectory9-3-2022.txt";
index = requests.get(url);
same = index.content
for webdirectory in open(webdirectory_path, "r").readlines():
webdirectory_split = webdirectory.split();
result = result = [i for i in webdirectory_split if i.startswith(url)];
result = ''.join(result);
print(result);
response = requests.get(result);
if response.content == same:
print("same content");
Only problem is, i get this error:
Invalid URL '': No scheme supplied. Perhaps you meant http://?
Even though the printed result is: http://depedqc.ph/html
What am i doing wrong here? i appreciate a feedback
Hey guys I was wondering if there was a way to make my bot skip invalid urls after 1 try to continue with the for loop but continue doesn't seem to work
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
continue
stripped results is an array of an unknown amount of domains and Subdomains which is why I have the 'https://' part and tbh I'm not even sure whether my if statement is effective or not.
Any help would be greatly appreciated I don't want to get rate limited by discord anymore from sending so many invalid domains through. :(
This is easy. To check the validity of a URL there exist a python library, namely Validators. This library can be used to validate any URL for if it exist or not. Let's take it step by step.
Firstly,
Here is the documentation link for validators:
https://validators.readthedocs.io/en/latest/
How do you validate a link using validators?
It is simple. Let's work on command line for a moment.
This image shows it. This module gives out boolean result on if it is a valid link or not.
Here for the link of this question it gave out True and when it would be false then it would give you the error.
You can validate it using this syntax:
validators.url('Add your URL variable here')
Remember that this gives boolean value so code for it that way.
So you can use it this way...
I wouldn't be implementing it in your code as I want you to try it yourself once. I would help you with this if you are unable to do it.
Thank You! :)
Try this?
def check_valid(stripped_results):
global vstripped_results
vstripped_results = []
for tag in stripped_results:
conn = requests.head("https://" + tag)
conn2 = requests.head("http://" + tag)
status_code = conn.status_code
website_is_up = status_code == 200
if website_is_up:
vstripped_results.append(tag)
else:
#Do the thing here
I am trying to collect data from a website (using Python). In a webpage, there are multiple listings of software and in each listing. My data is within a tag (h5) and certain class ('price_software_details).
However, in some cases, tag along with the data is missing. I want to print 'NA' message if data and tag are missing else it should print the data.
I tried the code that I have mentioned below, though it's not working.
Help please!
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details') == True:
print(link.getText())
else:
print('NA')
Have you tried error handling (try, except)?
interest = soup.find(id='allsoftware')
for link in interest.findAll('h5'):
try:
item = link.find({'class':'price_software_details'})
print(item.get_text())
except:
print('NA')
You need to know soup.find() never be True.It only will be result or None.
interest = soup.find(id = 'allsoftware')
for link in interest.findAll('h5'):
if link.find(class_ = 'price_software_details'):
print(link.getText())
else:
print('NA')
Im new to python and figured that best way to learn is by practice, this is my first project.
So there is this fantasy football website. My goal is to create script which logins to site, automatically creates preselected team and submits it.
I have managed to get to submitting team part.
When I add a team member this data gets sent to server:
https://i.gyazo.com/e7e6f82ca91e19a08d1522b93a55719b.png
When I press save this list this data gets sent:
https://i.gyazo.com/546d49d1f132eabc5e6f659acf7c929e.png
Code:
import requests
with requests.Session() as c:
gameurl = 'here is link where data is sent'
BPL = ['5388', '5596', '5481', '5587',
'5585', '5514', '5099', '5249', '5566', '5501', '5357']
GID = '168'
UDID = '0'
ACT = 'draft'
ACT2 = 'save_draft'
SIGN = '18852c5f48a94bf3ee58057ff5c016af'
# eleven of those with different BPL since 11 players needed:
c.get(gameurl)
game_data = dict(player_id = BPL[0], action = ACT, id = GID)
c.post(gameurl, data = game_data)
# now I need to submit my list of selected players:
game_data_save = dict( action = ACT2, id = GID, user_draft_id = UDID, sign = SIGN)
c.post(gameurl, data = game_data_save)
This code works pretty fine, but the problem is, that 'SIGN' is unique for each individual game and I have no idea how to get this data without using Chromes inspect option.
How can I get this data simply running python code?
Because you said you can find it using devtools I'm assuming SIGN is written somewhere in the DOM.
In that case you can use requests.get().text to get the HTML of the page and parse it with a tool like lxml or HTMLParser
Solved by posting all data without 'SIGN' and in return I got 'SIGN' in html.
My question involves learning how to retrieve my entire list of friends using Facebook's Python API. The current result returns an object with limited number of friends and a link to the 'next' page. How do I use this to fetch the next set of friends ? (Please post the link to possible duplicates) Any help would be much appreciated. In general, I need to learn about the pagination involved the API usage.
import facebook
import json
ACCESS_TOKEN = "my_token"
g = facebook.GraphAPI(ACCESS_TOKEN)
print json.dumps(g.get_connections("me","friends"),indent=1)
Sadly the documentation of pagination is an open issue since almost 2 years. You should be able to paginate like this (based on this example) using requests:
import facebook
import requests
ACCESS_TOKEN = "my_token"
graph = facebook.GraphAPI(ACCESS_TOKEN)
friends = graph.get_connections("me","friends")
allfriends = []
# Wrap this block in a while loop so we can keep paginating requests until
# finished.
while(True):
try:
for friend in friends['data']:
allfriends.append(friend['name'].encode('utf-8'))
# Attempt to make a request to the next page of data, if it exists.
friends=requests.get(friends['paging']['next']).json()
except KeyError:
# When there are no more pages (['paging']['next']), break from the
# loop and end the script.
break
print allfriends
Update: There's a new generator method available which implements above behavior and can be used to iterate over all friends like this:
for friend in graph.get_all_connections("me", "friends"):
# Do something with this friend.
Meanwhile I was searching answer here is much better approach:
import facebook
access_token = ""
graph = facebook.GraphAPI(access_token = access_token)
totalFriends = []
friends = graph.get_connections("me", "/friends&summary=1")
while 'paging' in friends:
for i in friends['data']:
totalFriends.append(i['id'])
friends = graph.get_connections("me", "/friends&summary=1&after=" + friends['paging']['cursors']['after'])
At end point you will get one response where data will be empty and then there will be no 'paging' key so at that time it will break and all the data will be stored.
I couldn't find this anywhere, these answers seem super complicated and just no way I would even use an SDK if I had do stuff like that when Paging from a simple POST is so easy to start with, however:
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)
my_account = AdAccount('act_23423423423423423')
# In the below, I added the limit to the max rows, 250.
# Also more importantly, paging. the SDK has a really sneaky way of doing this,
# enclose the request in a list() the results end up the same, but this will make the script request new objects until there are no more
#I tested this example and compared to Graph API and as of right now, 1/22 9:47AM, I get 81 from Graph and 81 here.
fields = ['name']
params = {'limit':250}
ads = list(my_account.get_ads(
fields = fields,
params = params,
))
Trick from the docs: "NOTE: We wrap the return value of get_ad_accounts with list() because get_ad_accounts returns an EdgeIterator object (located in facebook_business.adobjects) and we want to get the full list right away instead of having the iterator lazily loading accounts."
https://github.com/facebook/facebook-python-business-sdk
in this example you off set / pagination by one at the time, i think my while loop is simple since it only looking for the pagination key"next" to be none, if doesnt exists means we finish looping, and you will have your results in a list.
in this example i am just looking for all the people call jacob
import requests
import facebook
token = access_token="your token goes here"
fb = facebook.GraphAPI(access_token=token)
limit = 1
offset = 0
data = {"q": "jacob",
"type": "user",
"fields": "id",
"limit": limit,
"offset": offset}
req = fb.request('/search', args=data, method='GET')
users = []
for item in req['data']:
users.append(item["id"])
pag = req['paging']
while pag.get("next") is not None:
offset += limit
data["offset"] = offset
req = fb.request('/search', args=data, method='GET')
for item in req['data']:
users.append(item["id"])
pag = req.get('paging')
print users