How to retrieve all post comments/likes via Facebook OpenGraph - python

I am trying to retrieve comments and likes for specific posts through Facebook's opengraph API. While I do get some information back, it does not always match the comments/likes count mentioned in the post. I guess this can be attributed to the access permissions of the token I'm using. However, I have noticed that results vary depending on the request limit I use, and sometimes I also get duplicate entries between requests.
For example, post 10376464573_150423345118848 has about 14000 likes as of this writing, but I can only retrieve a maximum of around 5000. With the default limit of 25 I can get up to 3021 likes. A value of 100 gives 4501, while limits of 1000, 2000, 3000 and 5000 all return the same number of likes, 4959 (the absolute values don't make too much sense of course, they are just there for comparison).
I have noticed similar results on a smaller scale for comments.
I'm using a simple python script to fetch pages. It goes through the data following the pagination links provided by Facebook, writing each page retrieved to a separate file. Once an empty reply is encountered it stops.
With small limits (e.g. the default of 25), I notice that the number of results returned is monotically decreasing as I go through the pagination links, which seems really odd.
Any thoughts on what could be causing this behavior and how to work around it?

If you are looking for a list of the names of each and every like / comment on a particular post I think you will run up against the API limit (even with pagination).
If you are merely looking for an aggregate number of likes, comments, shares, or link clicks, you'll want to simply use the summary=true param provided in the posts endpoint. Kind of like this:
try:
endpoint = 'https://graph.facebook.com/v2.5/'+postid+'/comments?summary=true&access_token='+apikey
response = requests.get(endpoint)
fb_data = response.json()
return fb_data
You can also retrieve all of the posts of any particular page and their summary data points:
{page_id}/posts?fields=message,likes.limit(1).summary(true)

You can retrieve comments and like count or other information of a particular post using url or api below.
https://graph.facebook.com/{0}/comments?access_token={1}&limit={2}&fields=from,message,message_tags,created_time,id,attachment,like_count,comment_count,parent&order=chronological&filter=stream'.format(post_id,access_token,limit)
As here order specified as chronological, you need to use after parameter as well in the same url whose value one can get in paging.cursor.after section of the first response.

Related

Scraping ASPX after login with Python but every login gives you a different URL

I'm trying to get the exam result data from my college website for every Roll No. in my class.
Normally you can POST url (www.example.com/login.aspx)with login information, and GET a fixed url after login(www.example.com/home.aspx).
But the page I'm trying to get has a different URL for every Roll no. entered. The URL of login page look like this: "www.example.com/View.aspx". After login, the URL of the result page looks like: "www.example.com/ovengine.aspx?enc=BunchOfNumbersandAlphabets". And those numbers and alphabets are different for each roll number.
So I can't put a URL in my code to get the final result. I don't know how to get the page that comes automatically after the login, without mentioning it's URL.
But the page I'm trying to get has a different URL for every Roll no. entered
No, it is the same URL, and the URL has a parameter. You see this in URL's all the time.
So, for a temperature site it might look like
www.TheWeatherSite.com/?City=Rome
So, the above URL is always the same, but the web site "city" parameter is for the City of Rome. The web code behind can thus use/get/grab/consume that parameter in the code behind. That way we don't create a web page for EACH weather for each city.
so you create ONE page, and then and then PASS the web page a city value that the code behind can consume and use. (say query temperature data from a database for city = above value).
And thus you have to know ahead of time what city you want the weather for. Of course this approach is great since you don't have to create a new web site page to just show/display the weather in a given city.
You are in effect passing a value to some code behind that will run, and use that passed value.
The same goes for your example URL. You note there is ONE parameter called "enc".
So, the web site code behind would:
Grab, get, set the users ID. However, the users ID would be from the security system and the authentication provider. Unless you logged in as that particular user, then you not get that user id.
So, both a user ID (limited to the internal code).
And the "enc" value as the parameter in the URL you have would be required.
So, note in the above sql, we VERY likely need both a studentID and ALSO the "enc" value that some OTHER code from another page gets/grabs from the database.
Now that funny "GUID" (please do google what a GUID is), from a programmers point of view WOULD be sufficient to pull this one row of data from the database, but by ALSO using in the query the users logged on internal id?
Well, then only a given logged on user would be able to see their own set of values that belong to them.
In other words?
Only a drunken un-employed Rodeo clown would JUST require that GUID for pulling out that data. Since if that was the case, then any user could type in that GUID and see others peoples marks. However, there is "some" security by using a GUID, since a user could never guess that value.
If they used "city" like my first URL and parameter example? Then yes, you could guess and know the city value to type in. Or they could have used say student name, or even student number - those you COULD guess with relative ease.
But, for such data, no doubt the user adopted something MUCH more difficult then a starting number like a row number or PK id from a database. So, when the code added the results to that table? They also added a GUID of some type and saved that as a row in the database also.
So you NOT only need JUST the GUID, but that URL will ONLY work for a given pair of values. (the student ID - which is ONLY internal to the code and pulled FROM the authenticated provider. That was this line of code:
= Membership.GetUser.ProviderUserKey
So that above value is going to be the users logon internal ID.
The enc (external) exposed value in the web URL as a parameter, and ALSO the internal logged on value. So the code behind (asp.net) would look something like this:
Dim strSQL As String
strSQL = "SELECT * from tblStudentMarks where StudentID = #pID " &
" AND TestResultsGID = #GID"
Dim cmdSQL As New SqlCommand(strSQL, GetCon)
cmdSQL.Parameters.Add("#pID", SqlDbType.Int).Value = Membership.GetUser.ProviderUserKey
cmdSQL.Parameters.Add("#GID", SqlDbType.VarChar).Value = Request.QueryString("enc")
Dim dReader As New SqlDataAdapter(cmdSQL)
Dim rstData As DataTable
dReader.Fill(rstData)
Note the code:
Request.QueryString("enc")
That allows the code behind to get/grab the parameter (enc) from the URL. But, as I stated, it is high unlikely that JUST the "enc" number is required here. It is possible that ONLY this value is required to pull the data from the row, but then that would be a security hole the size of a open barn door.
Think of your on-line banking.
www.mybank.com/?CustomerNumber=1234
Well, if we JUST use the above CustomerNumber as the means to pull bank data, then I could go to the site and type in YOUR number, or someone's else's number.
So, for this to work?
You will need to obtain a list of enc values (that messy funny long string). Without that parameter then you not be able to set the parameter in the URL.
However, as I stated, you ALSO very likely need some internal "user" logon id that is NOT included in the public exposed URL to ALSO grab that one row of data from the database.
And, even more important? Such web pages usually cannot be hit UNLESS you are a logged in as an authenticated user. In other words that web page will ONLY be dished out to logged in users - if you not logged in, then the server security will automatic NOT dish out the web page unless you are logged in user.
So, for this to work, you need to contact the web site developers, and obtain that list of "enc" values. Once you have that list, then you can generate some code to process that list and insert the correct parameter in the URL. However, you also need to ask if that URL and parameter value will work for JUST you the logged in user, or if that this URL and parameter ONLY works for a give logged in user. Without these values, and without knowing if the URL and parameter will work for any user? (which I doubt it would), then just using a URL to get these values will not work.
It would be even BETTER to have the web site folks create a web service that you can call and in one command it would return all of the data you need anyway, as opposed to over and over having to send the "enc" value, which you don't have anyway.

REST API url design

I have a REST API that has a database with table with two columns, product_id and server_id, that it serves product_ids to specific servers which request the data(based on the server_id from table).
Let's say I have three servers with server_ids 1,2 and 3.
My design is like this: /products/server_id/1 and with GET request I get json list of product_ids with server_id = 1, similarly /products/server_id/2 would output list of product_ids for server_id = 2.
Should I remove these routes and make a requirement to send POST request with instructions to receive product_ids for specific server_id in /products route only?
For example sending payload {"server_id":1} would yield a response of list of product_ids for server_id = 1.
Should I remove these routes and make a requirement to send POST request with instructions to receive product_ids for specific server_id in /products route only?
Not usually, no.
GET communicates to general purpose components that the semantics of the request message are effectively read only (see "safe"). That affordance alone makes a number of things possible; for instance, spiders can crawl and index your API, just as they would for a web site. User agents can "pre-fetch" resources, and so on.
All of that goes right out the window when you decide to use POST.
Furthermore, the URI itself serves a number of useful purposes - caches use the URI as the primary key for matching a request. Therefore we can reduce the load on the origin server by re-using representations have have been stored using a specific identifier. We can also perform magic like sticking that URI into an email message, without the context of any specific HTTP request, and the receiver of the message will be able to GET that identifier and fetch the resource we intend.
Again, we lose all of that when the identifying information is in the request payload, rather than in the identifier metadata where it belongs.
That said, we sometimes do use the payload for identifying information, as a work around: for example, if we need so much identifying information that we start seeing 414 URI Too Long responses, then we may need to change our interaction protocol to use a POST request with the identifying information in the payload (losing, as above, the advantages of using GET).
An online example of this might be something like an HTML validator, that accepts a candidate document and returns a representation of the problems found. That's effectively a read only action, but in the general case an HTML document is too long to comfortably fit in the target-uri of an HTTP request.
So we punt.
In a hypermedia api, like those used on the world wide web, we can get away with it, because the HTTP method to use is provided by the server as part of the metadata of the form itself. You as the client don't need to know the server's preferred semantics, you just need to know how to process the form data.
For instance, as I type this answer into my browser, I don't need to know what the target URI is, or what HTTP method is going to be used, because the browser already knows what to do (based on the HTML and whatever scripts are running "on demand").
In REST APIs, POST requests should only be used in order to create new resource, so in order to retrieve data from server, the best practice is to perform a GET request.
If you want to load products 1,2,4,8 on server 9 for example, you can use this kind of request :
GET https://website/servers/9/products/1,2,4,8
On server side, if products value contains a coma separated list, then return an array with all results, if not return just an array with only one item in order to keep consistency between calls.
In case you need to get all products, you can keep only the following url :
GET https://website/servers/9/products
As there is no id provided in products parameter, then the server should return all existing products for requested server parameter.
Note : in case of big amount of results, they must by paginated.

Youtube analytics API content owner-based queries not returning data

I'm trying to get ad revenue data from the Youtube analytics API. It seems that no queries I make with id=contentOwner==<CONTENT_OWNER_ID> return data: I get a 200 response back with all the column name information, but no rows and no actual data. This occurs even for metrics like comments, which does return data when I use id=channel==<CHANNEL_ID> (i.e., id=channel==<CHANNEL_ID>&metrics=comments&filters=video==<VIDEO_ID> returns the number of comments for that video; id=contentOwner==<CONTENT_OWNER_ID>&metrics=comments&filters=video==<VIDEO_ID> does not.) The problem occurs both in my Python code and in the query explorer (https://developers.google.com/youtube/analytics/v1/reference/reports/query#try-it).
Am I doing something wrong? Is it a secret permissions issue, even though I'm getting a 200 back? Is it a bug?

How can I get the count of likes and dislikes of any video on Youtube?

I want to create a database of YouTube videos with counts of likes and dislikes clustered on the basis of genres. So I need a data-set of each and every video on YouTube. So far the data API supports queries fired only for a single URL. Further I am not sure if Data API supports going through each and every video which seems unfeasible. Is there any way I can get the task done. Should I try to crawl, even tough I am not sure if that is legal?
Also I am using a web based architecture for it.
Thanks for any help.
YouTube imposes a soft limit of about 500. There is no current way to get more than that through the API.
Full details: https://code.google.com/p/gdata-issues/issues/detail?id=4282
Relevant Excerpt:
"We can't provide more than ~500 search results for any arbitrary YouTube query via the API without the quality of the search results severely degrading (duplicates, etc.).
The v1/v2 GData API was updated back in November to limit the number of search results returned to 500. If you specify a start-index of 500 or more, you won't get back any results.
This was supposed to have also gone into effect for the v3 API (which uses a different method of paging through results) but it apparently was not pushed out, so it is still possible to retrieve up to 1000 search results in v3—the last 500 of which are usually of bad quality.
The change to limit v3 to 500 search results will be pushed out sometime in the near future. There will no longer be nextPageTokens returned once you hit 500 results.
I understand that the totalResults that are returned is much higher than 500 in all of these cases, but that is not the same thing as saying that we can effectively return all X million possible results. It's meant as an estimate of the total size of the set of videos that match a query and normally isn't very useful."

tweepy count limited to 200?

I'm currently trying to retrieve the followers of some big account with a lot of followers.
I'm using Tweepy and this piece of code (with cursor):
follower_cursors = tweepy.Cursor(api.followers, id = id_var,count=5000)
for friend in follower_cursors.items():
Ok if I don't specify count it seems that by default only 20 results are shown per page, but as from Twitter API documentation it can provide 5000 followers I tried to set it to the maximum.
However this doesn't seem to be taken into account and each page contains a maximum of 200 entries, which is a real problem as you will trigger the rate limit much more easily.
What m'I doing wrong? Is there a way to make Tweepy requests pages of 5000 IDs, to minimize requets and overide this default max value of 200?
Thanks!
You could use cursor for pages instead of items, and then process the items per page:
for page in Cursor(api.user_timeline).pages():
# page is a list of statuses
process_page(page)
# or iterate over items in `page`
I don't see a limit in the tweepy Cursor for results returned, so it should return as many as it gets.
Previous answer:
The max per-page result is enforced by the Twitter API, not by tweepy. You're supposed to paginate over the list of 200-per-call results, which Cursor is already doing for you. If there were 5000 followers, then with the max 200 results per query, you're using only 25 calls. You'd still have 4975 calls left to do other things.
To exceed the 5000-per-hour rate limit, you'd need to be doing at least 83 calls per minute or 1.4 calls per second.
Note that 'read limits' are per-application but 'write limits' are per user. So you could split your task between two or more apps* if they are read intensive.
Consider using the Streaming API instead, if it's more appropriate for your needs.
*: Though I'm sure Twitter has controls in place to prevent abuse.

Categories