python referencing json array - python

all,
I have a script I am building to find all open pull requests and compare the sha hash, however I can't seem to find them....
for repo in g.get_user().get_repos():
print (repo.full_name)
json_pulls = requests.get('https://api.github.com/repos/' + repo.full_name + '/pulls?state=open+updated=<' + str(cutoff_date.date())+ '&sort=created&order=asc')
if (json_pulls.ok):
for item in json_pulls.json():
for c in item.items():
#print(c["0"]["title"])
#print (json.dumps(state))
print(c)
The code cycles through the existing repos and list the pull requests and I get the output:
But, I can't for the life of me figure out how to collect individual fields...
I tried using the references:
print(c['title']) - not defined is the error
print(c['0']['title']) -a tubular error
what I am looking for is a simple list for each request....
title
id
state
base / sha
head / sha
Can someone please point out what I am doing wrong in referencing the json items in my python script as it is driving me crazy.
The full code as is with your help of course ... :
# py -m pip install <module> to install the imported modules below.
#
#
# Import stuff
from github import Github
from datetime import datetime, timedelta
import requests
import json
import simplejson
#
#
#declare stuff
# set the past days to search in the range
PAST = 5
# get the cut off date for the repos 10 days ago
cutoff_date = datetime.now() - timedelta(days=PAST)
#print (cutoff_date.date())
# Repo oauth key for my repo
OAUTH_KEY = "(get from github personal keys)"
# set base URL for API query
BASE_URL = 'https://api.github.com/repos/'
#
#
# BEGIN STUFF
# First create a Github instance:
g = Github(login_or_token=OAUTH_KEY, per_page=100)
# get all repositories for my account that are open and updated in the last
no. of days....
for repo in g.get_user().get_repos():
print (repo.full_name)
json_pulls = requests.get('https://api.github.com/repos/' + repo.full_name
+ '/pulls?state=open+updated=<' + str(cutoff_date.date())+
'&sort=created&order=asc')
if (json_pulls.ok):
for item in json_pulls.json():
print(item['title'], item['id'], item['state'], item['base']['sha'],
item['head']['sha'])
The repo site is a simple site, with two repos, and 1 or 2 pull requests to play against.
The idea of the script, when it is done, it to cycle through all the repos, find the pull requests that are older than x days and open, locates the sha for the branch (and sha for the master branches, to skip..... ) remove the branches that are not master branches, thus removing old code and pull requests to keep the repos tidy....

json_pulls.json() returns a list of dictionaries, so you can just do:
for item in json_pulls.json():
print (item['title'], item['id'], item['state'], item['base']['sha'], item['head']['sha'])
There's no need to iterate over item.items().

for key, value in item.items():
if (key == 'title'):
print(value)
#Do stuff with the title here
To add: Json should return as a dict, and the for command in python will default search for a key->val pair when called on dictionary.items().

Related

Error accessing Youtube Data API using Python

I am following the tutorial found here https://www.geeksforgeeks.org/youtube-data-api-set-1/. After I run the below code, I am getting a "No module named 'apiclient'" error. I also tried using "from googleapiclient import discovery" but that gave an error as well. Does anyone have alternatives I can try out?
I have already imported pip install --upgrade google-api-python-client
Would appreciate any help/suggestions!
Here is the code:
from apiclient.discovery import build
# Arguments that need to passed to the build function
DEVELOPER_KEY = "your_API_Key"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
# creating Youtube Resource Object
youtube_object = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
developerKey = DEVELOPER_KEY)
def youtube_search_keyword(query, max_results):
# calling the search.list method to
# retrieve youtube search results
search_keyword = youtube_object.search().list(q = query, part = "id, snippet",
maxResults = max_results).execute()
# extracting the results from search response
results = search_keyword.get("items", [])
# empty list to store video,
# channel, playlist metadata
videos = []
playlists = []
channels = []
# extracting required info from each result object
for result in results:
# video result object
if result['id']['kind'] == "youtube# video":
videos.append("% s (% s) (% s) (% s)" % (result["snippet"]["title"],
result["id"]["videoId"], result['snippet']['description'],
result['snippet']['thumbnails']['default']['url']))
# playlist result object
elif result['id']['kind'] == "youtube# playlist":
playlists.append("% s (% s) (% s) (% s)" % (result["snippet"]["title"],
result["id"]["playlistId"],
result['snippet']['description'],
result['snippet']['thumbnails']['default']['url']))
# channel result object
elif result['id']['kind'] == "youtube# channel":
channels.append("% s (% s) (% s) (% s)" % (result["snippet"]["title"],
result["id"]["channelId"],
result['snippet']['description'],
result['snippet']['thumbnails']['default']['url']))
print("Videos:\n", "\n".join(videos), "\n")
print("Channels:\n", "\n".join(channels), "\n")
print("Playlists:\n", "\n".join(playlists), "\n")
if __name__ == "__main__":
youtube_search_keyword('Geeksforgeeks', max_results = 10)
With this information it's hard to say what is the problem. But sometimes I've been banging my head to wall when installing something with pip (Python2) and then trying to import module in Python3 or vice versa.
So if you are running your script with Python3, try install package by using pip3 install --upgrade google-api-python-client
Try the YouTube docs here:
https://developers.google.com/youtube/v3/code_samples
They worked for me on a recently updated Slackware_64 14.2
I use them with Python 3.8. Since there may also be a version 2 of Python installed, I make sure to use this in the Interpreter line:
!/usr/bin/python3.8
Likewise with pip, I use pip3.8 to install dependencies
I installed Python from source. python3.8 --version Python 3.8.2
You can also look at this video here:
https://www.youtube.com/watch?v=qDWtB2q_09g
It sort of explains how to use YouTube's API explorer. You can copy code samples directly from there. The video above covers Android, but the same concept applies to Python regarding using YouTube's API Explorer.
I concur with the previous answer regarding version control.

checking if download link works in python

Im currently using the following code to download gz file. The url of the gz file will be constructed from pieces of information provided by the user:
generalUrl = theWebsiteURL + "/" + packageName
So generalURl can contain something like: http://www.example.com/blah-0.1.0.tar.gz
res = requests.get(generalUrl)
res.raise_for_status()
The problem I have here is; I have a list of websites for the variable called theWebsiteURL. I need to check all of these websites to see which ones have the package in packageName available for download. I would prefer not to download the package during the confirmation.
Once the code goes through the list of websites to discover which ones have the package, I then want to pick the first website from the list of websites that were found to have the package and automatically download the package from it.
something like this:
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.get(genUrl)
res.raise_for_status()
if raise_for_status == "200"
goodWebsites.append(genURL)
This is where my imagination stops. I need assistance completing this. Not even sure I'm going about it the right way.
You can try to send a HEAD request first in order to check that the URL is valid, and only then download the package via a GET request.
#!/usr/bin/env python2.7
listOfWebsites = [ website1, website2, website3, website4, and so on ]
goodWebsites = []
for eachWebsite in listOfWebsites:
genURL = eachWebsite + "/" + packageName
res = requests.head(genUrl)
if res.ok:
goodWebsites.append(genURL)

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Download entire history of a Wikipedia page

I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)
The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))
Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines

SyntaxError: Missing parentheses in call to 'print' [duplicate]

This question already has answers here:
What does "SyntaxError: Missing parentheses in call to 'print'" mean in Python?
(11 answers)
Closed 5 years ago.
I've been trying to scrape some twitter's data, but when ever I run this code I get the error SyntaxError: Missing parentheses in call to 'print'.
Can someone please help me out with this one?
Thanks for your time :)
"""
Use Twitter API to grab user information from list of organizations;
export text file
Uses Twython module to access Twitter API
"""
import sys
import string
import simplejson
from twython import Twython
#WE WILL USE THE VARIABLES DAY, MONTH, AND YEAR FOR OUR OUTPUT FILE NAME
import datetime
now = datetime.datetime.now()
day=int(now.day)
month=int(now.month)
year=int(now.year)
#FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API
t = Twython(app_key='APP_KEY', #REPLACE 'APP_KEY' WITH YOUR APP KEY, ETC., IN THE NEXT 4 LINES
app_secret='APP_SECRET',
oauth_token='OAUTH_TOKEN',
oauth_token_secret='OAUTH_TOKEN_SECRET')
#REPLACE WITH YOUR LIST OF TWITTER USER IDS
ids = "4816,9715012,13023422, 13393052, 14226882, 14235041, 14292458, 14335586, 14730894,\
15029174, 15474846, 15634728, 15689319, 15782399, 15946841, 16116519, 16148677, 16223542,\
16315120, 16566133, 16686673, 16801671, 41900627, 42645839, 42731742, 44157002, 44988185,\
48073289, 48827616, 49702654, 50310311, 50361094,"
#ACCESS THE LOOKUP_USER METHOD OF THE TWITTER API -- GRAB INFO ON UP TO 100 IDS WITH EACH API CALL
#THE VARIABLE USERS IS A JSON FILE WITH DATA ON THE 32 TWITTER USERS LISTED ABOVE
users = t.lookup_user(user_id = ids)
#NAME OUR OUTPUT FILE - %i WILL BE REPLACED BY CURRENT MONTH, DAY, AND YEAR
outfn = "twitter_user_data_%i.%i.%i.txt" % (now.month, now.day, now.year)
#NAMES FOR HEADER ROW IN OUTPUT FILE
fields = "id screen_name name created_at url followers_count friends_count statuses_count \
favourites_count listed_count \
contributors_enabled description protected location lang expanded_url".split()
#INITIALIZE OUTPUT FILE AND WRITE HEADER ROW
outfp = open(outfn, "w")
outfp.write(string.join(fields, "\t") + "\n") # header
#THE VARIABLE 'USERS' CONTAINS INFORMATION OF THE 32 TWITTER USER IDS LISTED ABOVE
#THIS BLOCK WILL LOOP OVER EACH OF THESE IDS, CREATE VARIABLES, AND OUTPUT TO FILE
for entry in users:
#CREATE EMPTY DICTIONARY
r = {}
for f in fields:
r[f] = ""
#ASSIGN VALUE OF 'ID' FIELD IN JSON TO 'ID' FIELD IN OUR DICTIONARY
r['id'] = entry['id']
#SAME WITH 'SCREEN_NAME' HERE, AND FOR REST OF THE VARIABLES
r['screen_name'] = entry['screen_name']
r['name'] = entry['name']
r['created_at'] = entry['created_at']
r['url'] = entry['url']
r['followers_count'] = entry['followers_count']
r['friends_count'] = entry['friends_count']
r['statuses_count'] = entry['statuses_count']
r['favourites_count'] = entry['favourites_count']
r['listed_count'] = entry['listed_count']
r['contributors_enabled'] = entry['contributors_enabled']
r['description'] = entry['description']
r['protected'] = entry['protected']
r['location'] = entry['location']
r['lang'] = entry['lang']
#NOT EVERY ID WILL HAVE A 'URL' KEY, SO CHECK FOR ITS EXISTENCE WITH IF CLAUSE
if 'url' in entry['entities']:
r['expanded_url'] = entry['entities']['url']['urls'][0]['expanded_url']
else:
r['expanded_url'] = ''
print r
#CREATE EMPTY LIST
lst = []
#ADD DATA FOR EACH VARIABLE
for f in fields:
lst.append(unicode(r[f]).replace("\/", "/"))
#WRITE ROW WITH DATA IN LIST
outfp.write(string.join(lst, "\t").encode("utf-8") + "\n")
outfp.close()
It seems like you are using python 3.x, however the code you are running here is python 2.x code. Two ways to solve this:
Download python 2.x on Python's website and use it to run your script
Add parentheses around your print call at the end by replacing print r by print(r) at the end (and keep using python 3)
But today, a growing majority of python programmers are using python 3, and the official python wiki states the following:
Python 2.x is legacy, Python 3.x is the present and future of the
language
If I were you, I'd go with the second option and keep using python 3.
Looks like you trying to run Python 2 code in Python 3, where print is function and required parentheses:
print(foo)
You just need to add pranethesis to your print statmetnt to convert it to a function, like the error says:
print expression -> print(expression)
In Python 2, print is a statement, but in Python 3, print is a function. So you could alternatively just run your code with Python 2. print(expression) is backwards compatible with Python 2.
Also, why are you capitalizing all your comments? It's annoying. Your code also violates PEP 8 in several ways. Get an editor like PyCharm (it's free) that can automatically detect errors like this.
You didn't leave a space between # and your comment
You didn't leave spaces between = and other tokens
Within python 2, print has been a statement, not a function. That means you can use it without parentheses. In python 3, that has changed. It is a function there and you need to use print(foo) instead of print foo.

Categories