How to use mwapi library to get a wikipedia page? - python

I have been trying to figure out the documentation of the mwapi library (MediaWiki API) and I cannot figure out how to simply request a page based on a search query or keyword. I know I should use get() but filling in the parameters with keywords yield errors. Does anyone know how this works to look up something like "Earth Wind and Fire"?
Documentation can be found here:
http://pythonhosted.org/mwapi
and here is the only example they have of get() being used
import mwapi
session = mwapi.Session('https://en.wikipedia.org')
print(session.get(action='query', meta='userinfo'))
{'query': {'userinfo': {'anon': '', 'name': '75.72.203.28', 'id': 0}}, 'batchcomplete': ''}
print(session.get(action='query', prop='revisions', revids=32423425))
{'query': {'pages': {'1429626': {'ns': 0, 'revisions': [{'user': 'Wknight94', 'parentid': 32276615, 'comment': '/* References */ Removing less-specific cat', 'revid': 32423425, 'timestamp': '2005-12-23T00:07:17Z'}], 'title': 'Grigol Ordzhonikidze', 'pageid': 1429626}}}, 'batchcomplete': ''}

Maybe this code will help you understand the API:
import json # Used only to pretty-print dictionaries.
import mwapi
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.6) Gecko/2009011913 Firefox'
session = mwapi.Session('https://en.wikipedia.org', user_agent=USER_AGENT)
query = session.get(action='query', titles='Earth Wind and Fire')
print('query returned:')
print(json.dumps(query, indent=4))
pages = query['query']['pages']
if pages:
print('\npages:')
for pageid in pages:
data = session.get(action='parse', pageid=pageid, prop='text')
print(json.dumps(data, indent=4))
Output:
query returned:
{
"batchcomplete": "",
"query": {
"pages": {
"313370": {
"pageid": 313370,
"ns": 0,
"title": "Earth Wind and Fire"
}
}
}
}
pages:
{
"parse": {
"title": "Earth Wind and Fire",
"pageid": 313370,
"text": {
"*": "<div class=\"redirectMsg\"><p>Redirect to:</p><ul class=\"redirectText\"><li>Earth, Wind & Fire</li></ul></div><div class=\"mw-parser-output\">\n\n<!-- \nNewPP limit report\nParsed by mw1279\nCached time: 20171121014700\nCache expiry: 1900800\nDynamic content: false\nCPU time usage: 0.000 seconds\nReal time usage: 0.001 seconds\nPreprocessor visited node count: 0/1000000\nPreprocessor generated node count: 0/1500000\nPost\u2010expand include size: 0/2097152 bytes\nTemplate argument size: 0/2097152 bytes\nHighest expansion depth: 0/40\nExpensive parser function count: 0/500\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 0.000 1 -total\n-->\n</div>\n<!-- Saved in parser cache with key enwiki:pcache:idhash:313370-0!canonical and timestamp 20171121014700 and revision id 16182229\n -->\n"
}
}
}

Related

How to import JSON file with embedded array into Mongodb using Compass [duplicate]

I am working on MongoDB in python [pymongo]. I want to insert an array of multiple fields in a document. For example: In the below structure of a collection, I want to insert array of Places Visited in all documents. I do not know what it is called in the world of Mongo.So that I may insert it. How to insert an array in a document? Can some one help?
collectionName
{
"_id" : "4564345343",
"name": "Bunty",
"Basic Intro": "A.B.C.D.",
"Places Visited": [
"1" : "Palace of Dob",
"2" : "Palace of Victoria",
"3" : "Sahara Desert"
]
}
{
"_id" : "45657865745",
"name": "Humty",
"Basic Intro": "B.C.D.",
"Places Visited": [
"1" : "Palace of Pakistan",
"2" : "Palace of U.S.A."
"3" : "White House"
]
}
This should give you the idea how to do it
import pymongo
client = pymongo.MongoClient('yourHost', 30000) # adjust to your needs
db = client.so
coll = db.yourcollection
# show initial data
for doc in coll.find():
print(doc)
# update data
places_visited = [
"Palace of Dob",
"Palace of Victoria",
"Sahara Desert"
]
coll.update({}, { "$set": { "Places Visited": places_visited } }, multi=True)
# show updated data
for doc in coll.find():
print(doc)
which for your sample data should give output similar to this
daxaholic$ python3 main.py
{'name': 'Bunty', 'Basic Intro': 'A.B.C.D.', '_id': '4564345343'}
{'name': 'Humty', 'Basic Intro': 'B.C.D.', '_id': '45657865745'}
{'name': 'Bunty', 'Places Visited': ['Palace of Dob', 'Palace of Victoria', 'Sahara Desert'], 'Basic Intro': 'A.B.C.D.', '_id': '4564345343'}
{'name': 'Humty', 'Places Visited': ['Palace of Dob', 'Palace of Victoria', 'Sahara Desert'], 'Basic Intro': 'B.C.D.', '_id': '45657865745'}
For further information see the docs about update

Eve: how to use different endpoints to access the same collection with different filters

I have an Eve app publishing a simple read-only (GET) interface. It is interfacing a MongoDB collection called centroids, which has documents like:
[
{
"name":"kachina chasmata",
"location":{
"type":"Point",
"coordinates":[-116.65,-32.6]
},
"body":"ariel"
},
{
"name":"hokusai",
"location":{
"type":"Point",
"coordinates":[16.65,57.84]
},
"body":"mercury"
},
{
"name":"caƱas",
"location":{
"type":"Point",
"coordinates":[89.86,-31.188]
},
"body":"mars"
},
{
"name":"anseris cavus",
"location":{
"type":"Point",
"coordinates":[95.5,-29.708]
},
"body":"mars"
}
]
Currently, (Eve) settings declare a DOMAIN as follows:
crater = {
'hateoas': False,
'item_title': 'crater centroid',
'url': 'centroid/<regex("[\w]+"):body>/<regex("[\w ]+"):name>',
'datasource': {
'projection': {'name': 1, 'body': 1, 'location.coordinates': 1}
}
}
DOMAIN = {
'centroids': crater,
}
Which will successfully answer to requests of the form http://hostname/centroid/<body>/<name>. Inside MongoDB this represents a query like: db.centroids.find({body:<body>, name:<name>}).
What I would like to do also is to offer an endpoint for all the documents of a given body. I.e., a request to http://hostname/centroids/<body> would answer the list of all documents with body==<body>: db.centroids.find({body:<body>}).
How do I do that?
I gave a shot by including a list of rules to the DOMAIN key centroids (the name of the database collection) like below,
crater = {
...
}
body = {
'item_title': 'body craters',
'url': 'centroids/<regex("[\w]+"):body>'
}
DOMAIN = {
'centroids': [crater, body],
}
but didn't work...
AttributeError: 'list' object has no attribute 'setdefault'
Got it!
I was assuming the keys in the DOMAIN structure was directly related to the collection Eve was querying. That is true for the default settings, but it can be adjusted inside the resources datasource.
I figured that out while handling an analogous situation as that of the question: I wanted to have an endpoint hostname/bodies listing all the (unique) values for body in the centroids collection. To that, I needed to set an aggregation to it.
The following settings give me exactly that ;)
centroids = {
'item_title': 'centroid',
'url': 'centroid/<regex("[\w]+"):body>/<regex("[\w ]+"):name>',
'datasource': {
'source': 'centroids',
'projection': {'name': 1, 'body': 1, 'location.coordinates': 1}
}
}
bodies = {
'datasource': {
'source': 'centroids',
'aggregation': {
'pipeline': [
{"$group": {"_id": "$body"}},
]
},
}
}
DOMAIN = {
'centroids': centroids,
'bodies': bodies
}
The endpoint, for example, http://127.0.0.1:5000/centroid/mercury/hokusai give me the name, body, and coordinates of mercury/hokusai.
And the endpoint http://127.0.0.1:5000/bodies, the list of unique values for body in centroids.
Beautiful. Thumbs up to Eve!

Use search function on a website with Python Requests(ebay)

I'm trying to create a Python program using the Requests library that searches ebay for an item that they enter. Rather than hard-coding the url, is it possible to use requests library to perform an Ebay search (or a search on any website)?
I believe what you want here is to input a text in a search element. According to realpython:
The requests library is the de facto standard for making HTTP requests in Python.
I would recommend to use selenium to control the website's source code such as inputting a text in an element and press a button on the website.
However, if you still want to use requests then try to find their api endpoint which handle the searching part and use POST method to get data from it.
resp = requests.post(url)
I created and Ebay developer account to access the API then wrote a small script to search eBay for historical pricing on an item. Save it an call is search.py and call it like this:
./search.py "ebay item you are looking for"
You can change the itemFilter to your liking, currently it is set for solditems for since 10-10-2019. The complete list is here: https://developer.ebay.com/devzone/finding/callref/types/ItemFilterType.html
The comments at the bottom show the complete set of fields returned from Ebay, you can pick and choose the fields you like and add them to a print statement.
Also, this script will return for than the first page of items and each page costs you one of your 5,000 developer queries for the day. I am unable to get it to work with the sandbox, not matter what I try. I believe the Ebay sandbox is broken.
#!/usr/local/bin/python3
from ebaysdk.finding import Connection
import sys
DEBUG = False
#search_keywords = "2019 Hot Wheels Dumbo"
search_keywords = sys.argv[1]
print ("Search Keywords: " + search_keywords)
# Function accepts keywords for query and pageNumber of search to pull
# Ebay will only return 100 items per search
def build_request( keywords, pageNumber):
# Create a request structure
# Item Filter List https://developer.ebay.com/devzone/finding/callref/types/ItemFilterType.html
request = {
'keywords': keywords,
'itemFilter': [
{'name': 'condition', 'value': 'new' ,
'name': 'SoldItemsOnly', 'value': True ,
'name': 'EndTimeFrom', 'value': '2019-10-10T00:00:00.000Z' }
],
'paginationInput': {
'entriesPerPage': 100, # EBay limits API Calls to 100 items per page
'pageNumber': pageNumber
},
'sortOrder': 'PricePlusShippingLowest',
}
return (request)
# Connect using yaml file to EBAY-US production site
# put in __main__ just in case we turn this into a module later
if __name__ == '__main__':
api = Connection(config_file='ebay.yaml', debug=False, siteid="EBAY-US")
#api = Connection(config_file='ebay.yaml', debug=False, domain="api.sandbox.ebay.com", siteid="EBAY-US")
# Run the request
query=build_request(search_keywords, 1)
query['paginationInput']['pageNumber'] = 1
response = api.execute('findCompletedItems', query)
if DEBUG:
print (response.dict()) #Use this to see the dictionary structure
# Display how many entries and results are returned
print("API Call: findCompletedItems")
print("----------------------------")
print(f"totalEntries: {response.reply.paginationOutput.totalEntries}, totalPages: {response.reply.paginationOutput.totalPages}")
maxpage = int(str(response.reply.paginationOutput.totalPages)) + 0
# Display item information fields from the request, see below for all possible fields
for item in response.reply.searchResult.item:
print(f"Date: {item.listingInfo.endTime} Title: {item.title}, Price: {item.sellingStatus.currentPrice.value} Shipping: {item.shippingInfo.shippingServiceCost.value}")
# Now run the request for each page and change the page in the request each time
for page in range (2,maxpage):
print ("**** PAGE: "+str(page) +" of "+ str(maxpage)+ " ****")
# Rebuild the Request and Update the Page Number
# Run the request
query['paginationInput']['pageNumber'] = page
response = api.execute('findCompletedItems', query)
# Display item information fields from the request, see below for all possible fields
for item in response.reply.searchResult.item:
print(f"Date: {item.listingInfo.endTime} Title: {item.title}, Price: {item.sellingStatus.currentPrice.value} Shipping: {item.shippingInfo.shippingServiceCost.value}")
#{'ack': 'Success', 'version': '1.13.0', 'timestamp': '2019-10-16T01:28:25.891Z',
#
#searchResult': {'item': [{'itemId': '123719989207', 'title': '2019 HOT WHEELS 2 SET CORVETTE STINGRAY SUPER CHROMES 5/5 TREASURE HUNT PAIR', 'globalId': 'EBAY-US', 'primaryCategory': {'categoryId': '180506', 'categoryName': 'Contemporary Manufacture'}, 'galleryURL': 'https://thumbs4.ebaystatic.com/m/mFuyRQgYjSutGli33dqsqcA/140.jpg', 'viewItemURL': 'https://www.ebay.com/itm/2019-HOT-WHEELS-2-SET-CORVETTE-STINGRAY-SUPER-CHROMES-5-5-TREASURE-HUNT-PAIR-/123719989207', 'paymentMethod': 'PayPal', 'autoPay': 'false', 'postalCode': '54650', 'location': 'Onalaska,WI,USA', 'country': 'US', 'shippingInfo': {'shippingServiceCost': {'_currencyId': 'USD', 'value': '6.0'}, 'shippingType': 'Flat', 'shipToLocations': 'Worldwide', 'expeditedShipping': 'false', 'oneDayShippingAvailable': 'false', 'handlingTime': '2'}, 'sellingStatus': {'currentPrice': {'_currencyId': 'USD', 'value': '9.0'}, 'convertedCurrentPrice': {'_currencyId': 'USD', 'value': '9.0'}, 'sellingState': 'Ended'}, 'listingInfo': {'bestOfferEnabled': 'false', 'buyItNowAvailable': 'false', 'startTime': '2019-04-02T22:14:03.000Z', 'endTime': '2019-10-02T18:44:49.000Z', 'listingType': 'StoreInventory', 'gift': 'false', 'watchCount': '2'}, 'returnsAccepted': 'false', 'condition': {'conditionId': '1000', 'conditionDisplayName': 'New'}, 'isMultiVariationListing': 'false', 'topRatedListing': 'false'},
#
#
#{'itemId': '153679182310', 'title': "Hot Wheels 2019 Super Treasure Hunt '68 Mercury Cougar Loose 1/64 STH Green", 'globalId': 'EBAY-US', 'primaryCategory': {'categoryId': '73252', 'categoryName': 'Collections & Lots'}, 'galleryURL': 'https://thumbs3.ebaystatic.com/m/mEN9EsbCJY0wb6WzXjO8hNg/140.jpg', 'viewItemURL': 'https://www.ebay.com/itm/Hot-Wheels-2019-Super-Treasure-Hunt-68-Mercury-Cougar-Loose-1-64-STH-Green-/153679182310', 'paymentMethod': 'PayPal', 'autoPay': 'false', 'location': 'Malaysia', 'country': 'MY', 'shippingInfo': {'shippingServiceCost': {'_currencyId': 'USD', 'value': '9.0'}, 'shippingType': 'Flat', 'shipToLocations': 'Worldwide', 'expeditedShipping': 'false', 'oneDayShippingAvailable': 'false', 'handlingTime': '15'}, 'sellingStatus': {'currentPrice': {'_currencyId': 'USD', 'value': '9.9'}, 'convertedCurrentPrice': {'_currencyId': 'USD', 'value': '9.9'}, 'bidCount': '1', 'sellingState': 'Ended'}, 'listingInfo': {'bestOfferEnabled': 'false', 'buyItNowAvailable': 'false', 'startTime': '2019-10-10T04:13:32.000Z', 'endTime': '2019-10-15T04:13:32.000Z', 'listingType': 'Auction', 'gift': 'false', 'watchCount': '1'}, 'returnsAccepted': 'false', 'condition': {'conditionId': '3000', 'conditionDisplayName': 'Used'}, 'isMultiVariationListing': 'false', 'topRatedListing': 'false'}],
#
#'_count': '100'}, 'paginationOutput': {'pageNumber': '3', 'entriesPerPage': '100', 'totalPages': '40', 'totalEntries': '3966'}}
You can scrape eBay using BeautifulSoup web scraping library.
In order not to enter the full URL of the request, you can set params in which the necessary request parameters will be indicated and the input of the question itself for the search:
query = input('Your query is: ')
params = {
'_nkw': query, # search query
'_pgn': 1 # page number
#'LH_Sold': '1' # shows sold items
}
If using requests library the request might be blocked as default user-agent in requests library is a python-requests so website understands that's it's a bot or a script that sends a request. Check what's your user-agent.
An additional step besides providing browser user-agent could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
query = input('Your query is: ')
params = {
'_nkw': query, # search query
'_pgn': 1 # page number
#'LH_Sold': '1' # shows sold items
}
data = []
while True:
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
link = products.select_one(".s-item__link")["href"]
data.append({
"title" : title,
"price" : price,
"link" : link
})
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
Your query is: shirt # query entry example
Extracting page: 1
----------
[
{
"title": "Men's Polo Shirt 100% Cotton Knockout Jeans NVY WHT 220 Stripe MEDIUM Free Ship",
"price": "$11.99",
"link": "https://www.ebay.com/itm/133992813518?hash=item1f329813ce:g:tWMAAOSwXBxhTP7Q&amdata=enc%3AAQAHAAAAwJ9%2BDbqKGCoZye6JelYY1tJHQWotUalKHQJ%2FixwyplnvOC60SofXkLVsNgRfoX09uOZLerjkBtwcW%2FQQa1wmJ6%2BYVEEagzH1GAK6Bx4rX%2BRNnj9g6SlvB2WagWETpbmrLdiFHGTIRvAL2EvfXDRqPFnEGWZ2nk%2BM0zEkiGzp%2F4ADUbPslGui3zTDJsIgVpXjAHzL2EUH3s7tiOxtd3qVTXxaE095evq5YrBgkJFJu4KB5o%2F%2BCiCURfy7xR%2FbTU7mnQ%3D%3D%7Ctkp%3ABlBMUJavlrOEYQ"
},
{
"title": "5 Pack Oroblu Micromodal Perfect Line Round Neck Short Sleeve T-Shirt",
"price": "$192.00",
"link": "https://www.ebay.com/itm/275287531865?hash=item40186a6159:g:OtUAAOSweKFiZr2S&amdata=enc%3AAQAHAAAAsMRLg1VeYAIKHTiXXdD8xv56DpaeH6jc3EhFP26RJ66bqmlzXHQrMMxuo78x6S2i8DfxvuzjbXrpmYYdyRLhzgQCoaauMNvRwVNuhx11qorNlPoHrig%2BdIGG2RB4xHmXdB2fjOciLCsdYkL23jaH23ehXakQu%2BrBzER%2F2v94Sdg%2BkchjwWmRidsv0kPfLRcpiy%2BOeDBHEas4i9EQY%2F0VAzLGj2U%2FwLdcqjqSjgngj%2BRr%7Ctkp%3ABlBMUJavlrOEYQ"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It`s a paid API with a free plan that handles blocks and parsing on their backend.
Example code that paginates through all pages with input query:
from serpapi import EbaySearch
import os, json
query = input('Your query is: ')
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": query, # search query
"_pgn": 1 # page number
#"LH_Sold": "1" # shows sold items
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
link = organic_result.get("link")
price = organic_result.get("price")
data.append({
"price" : price,
"link" : link
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"price": {
"raw": "$25.99",
"extracted": 25.99
},
"link": "https://www.ebay.com/itm/285018595898?hash=item425c6ea23a:g:mT0AAOSwBjljAFsl&amdata=enc%3AAQAHAAAAkI1P1C%2BE2boIutliCMWXCADm%2BXyUp2a6Q1qOjpifaAIo6%2FWD0yHCd8Mejyfc2jc%2BQ5zzVcITrcWM0XxIfiSUILMZFsMewB154skl5re5%2FS8W9kRrabjRdy%2BoC6aQoS%2FWGq%2F6A%2BZWQ1GQkcd5Tstamu%2FgzZKoL6VYfO4YpC4oO4Im23h0wiIfI0%2BxPG8uuFRMPw%3D%3D%7Ctkp%3ABk9SR_i1vbKEYQ"
},
{
"price": {
"raw": "$14.16",
"extracted": 14.16
},
"link": "https://www.ebay.com/itm/234347615312?hash=item369034d450:g:hvYAAOSwNspg0TAH&amdata=enc%3AAQAHAAAA0B1m3DPC4q0R4AQp6MO8rXnKt6qFIX2p%2BaypmySYXkIvi6XE3FHzpbtN%2B%2Bvd9P3TZPYu3fuQVl5kH0ZYDO5eqtnjh1EcZ%2Fb9rZMlMx6r6RcH%2B5wOY7X65bvRcmQ7OUmoaNGAMOZpOc4hg8vHj2afxCa%2FR7F3jDr1KjnHk%2BKnln3opoiqAVMFIoXv338f70KZw8CDd%2Fg9xU0jQlzgxDpDwSL6Y6OMz0oKxh4T%2BRUMKHj03VE5E9%2B8VKzPUMWAQ%2BZWuZyGMpWxwzn%2BomggywV5RhI%3D%7Ctkp%3ABk9SR_i1vbKEYQ"
},
# ...
]

Python/JSON - does order matter and why is this JSON post to a REST API failing?

I'm posting to the office 365 rest API and am creating the dump as per below:
def CreateEvent(auth, cal_id, subject, start_time, end_time, attendees, content):
create_url = 'https://outlook.office365.com/api/v1.0/me/calendars/{0}/events'.format(cal_id)
headers = {"Content-type": "application/json", "Accept": "application/json"}
data = {"Subject":"","Attendees": [],"End": {},"Start": {},"Body": {}}
data["Subject"] = subject
data["StartTimeZone"] = "GMT Standard Time"
data["Start"] = start_time
data["EndTimeZone"] = "GMT Standard Time"
data["End"] = end_time
data["Attendees"] = attendees
data["Body"]["ContentType"] = "Text"
data["Body"]["Content"] = content
content_data = json.dumps(data)
#return data
response = requests.post(create_url,data,headers=headers,auth=auth)
return response
This produces an unordered dump, which i believe shouldn't cause any issues?
however, when i post manually using Y i get a 201 and the event is created, when i post using the function which produces the below dump, i get a 400
y="""
{
"Subject": "TESTTTT",
"Body": {
"ContentType": "HTML",
"Content": "I think it will meet our requirements!"
},
"Start": "2016-12-02T11:30:00Z",
"StartTimeZone": "GMT Standard Time",
"End": "2016-12-02T11:45:00Z",
"EndTimeZone": "GMT Standard Time",
"Attendees": [
{
"EmailAddress": {
"Name": "Alex ",
"Address": "alex#test.com"
},
"Type": "Required"
}
]
}
"""
what my function returns and give a 400
{
'Body': {
'Content': 'test data',
'ContentType': 'Text'
},
'End': '2016-12-02T06:00:00Z',
'StartTimeZone': 'GMT Standard Time',
'EndTimeZone': 'GMT Standard Time',
'Start': '2016-12-02T02:00:00Z',
'Attendees': [{
'EmailAddress': {
'Name': 'Alex ',
'Address': 'alex#test.com'
},
'Type': 'Required'
}],
'Subject': 'Maintenance: test'
}
At a glance, I believe you just need to change
response = requests.post(create_url,data,headers=headers,auth=auth)
to
response = requests.post(create_url,content_data,headers=headers,auth=auth)
You were correct in calling the json.dumps() method to serialize the dictionary. Just pass that string to the server instead.

Simple Python social media scrape of Public information

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...
IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?
In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.

Categories