I am currently trying to download a large number of NY Times articles using their API, based on Python 2.7. To do so, I was able to reuse a piece of code i found online:
[code]from nytimesarticle import articleAPI
api = articleAPI('...')
articles = api.search( q = 'Brazil',
fq = {'headline':'Brazil', 'source':['Reuters','AP', 'The New York Times']},
begin_date = '20090101' )
def parse_articles(articles):
'''
This function takes in a response to the NYT api and parses
the articles into a list of dictionaries
'''
news = []
for i in articles['response']['docs']:
dic = {}
dic['id'] = i['_id']
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
dic['headline'] = i['headline']['main'].encode("utf8")
dic['desk'] = i['news_desk']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
dic['section'] = i['section_name']
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
dic['source'] = i['source']
dic['type'] = i['type_of_material']
dic['url'] = i['web_url']
dic['word_count'] = i['word_count']
# locations
locations = []
for x in range(0,len(i['keywords'])):
if 'glocations' in i['keywords'][x]['name']:
locations.append(i['keywords'][x]['value'])
dic['locations'] = locations
# subject
subjects = []
for x in range(0,len(i['keywords'])):
if 'subject' in i['keywords'][x]['name']:
subjects.append(i['keywords'][x]['value'])
dic['subjects'] = subjects
news.append(dic)
return(news)
def get_articles(date,query):
'''
This function accepts a year in string format (e.g.'1980')
and a query (e.g.'Amnesty International') and it will
return a list of parsed articles (in dictionaries)
for that year.
'''
all_articles = []
for i in range(0,100): #NYT limits pager to first 100 pages. But rarely will you find over 100 pages of results anyway.
articles = api.search(q = query,
fq = {'headline':'Brazil','source':['Reuters','AP', 'The New York Times']},
begin_date = date + '0101',
end_date = date + '1231',
page = str(i))
articles = parse_articles(articles)
all_articles = all_articles + articles
return(all_articles)
Download_all = []
for i in range(2009,2010):
print 'Processing' + str(i) + '...'
Amnesty_year = get_articles(str(i),'Brazil')
Download_all = Download_all + Amnesty_year
import csv
keys = Download_all[0].keys()
with open('brazil-mentions.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(Download_all)
Without the last bit (starting with "... import csv" this seems to be working fine. If I simply print my results, ("print Download_all") I can see them, however in a very unstructured way. Running the actual code i however get the message:
File "C:\Users\xxx.yyy\AppData\Local\Continuum\Anaconda2\lib\csv.py", line 148, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'abstract'
Since I am quite a newbie at this, I would highly appreciate your help in guiding me how to download the news articles into a csv file in a structured way.
Thanks a lot in advance!
Best regards
Where you have:
keys = Download_all[0].keys()
This takes the column headers for the CSV from the dictionary for the first article. The problem is that the article dictionaries do not all have the same keys, so when you reach the first one that has the extra abstract key, it fails.
It looks like you'll have problems with abstract and snippet which are only added to the dictionary if they exist in the response.
You need to make keys equal to the superset of all possible keys:
keys = Download_all[0].keys() + ['abstract', 'snippet']
Or, ensure that every dict has a value for every field:
def parse_articles(articles):
...
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
else:
dic['abstract'] = ""
...
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
else:
dic['snippet'] = ""
Related
So this is a somewhat of a continuation from a previous post of mine except now I have API data to work with. I am trying to get keys Type and Email as columns in a data frame to come up with a final number. My code:
jsp_full=[]
for p in payloads:
payload = {"payload": {"segmentId":p}}
r = requests.post(url,headers = header, json = payload)
#print(r, r.reason)
time.sleep(r.elapsed.total_seconds())
json_data = r.json() if r and r.status_code == 200 else None
json_keys = json_data['payload']['supporters']
json_package = []
jsp_full.append(json_package)
for row in json_keys:
SID = row['supporterId']
Handle = row['contacts']
a_key = 'value'
list_values = [a_list[a_key] for a_list in Handle]
string = str(list_values).split(",")
data = {
'SupporterID' : SID,
'Email' : strip_characters(string[-1]),
'Type' : labels(p)
}
json_package.append(data)
t2 = round(time.perf_counter(),2)
b_key = "Email"
e = len([b_list[b_key] for b_list in json_package])
t = str(labels(p))
#print(json_package)
print(f'There are {e} emails in the {t} segment')
print(f'Finished in {t2 - t1} seconds')
excel = pd.DataFrame(json_package)
excel.to_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(t, str(today)), sheet_name=t)
This part works all well and good. Each payload in the API represents a different segment of people so I split them out into different files. However, I am at a point where I need to combine all records into a single data frame hence why I append out to jsp_full. This is a list of a list of dictionaries.
Once I have that I would run the balance of my code which is like this:
S= pd.DataFrame(jsp_full[0], index = {0})
Advocacy_Supporters = S.sort_values("Type").groupby("Type", as_index=False)["Email"].first()
print(Advocacy_Supporters['Email'].count())
print("The number of Unique Advocacy Supporters is :")
Advocacy_Supporters_Group = Advocacy_Supporters.groupby("Type")["Email"].nunique()
print(Advocacy_Supporters_Group)
Some sample data:
[{'SupporterID': '565f6a2f-c7fd-4f1b-bac2-e33976ef4306', 'Email': 'somebody#somewhere.edu', 'Type': 'd_Student Ambassadors'}, {'SupporterID': '7508dc12-7647-4e95-a8b8-bcb067861faf', 'Email': 'someoneelse#email.somewhere.edu', 'Type': 'd_Student Ambassadors'},...`
My desired output is a dataframe that looks like so:
SupporterID Email Type
565f6a2f-c7fd-4f1b-bac2-e33976ef4306 somebody#somewhere.edu d_Student Ambassadors
7508dc12-7647-4e95-a8b8-bcb067861faf someoneelse#email.somewhere.edu d_Student Ambassadors
Any help is greatly appreciated!!
So because this code creates an excel file for each segment, all I did was read back in the excels via a for loop like so:
filesnames = ['e_S Donors', 'b_Contributors', 'c_Activists', 'd_Student Ambassadors', 'a_Volunteers', 'f_Offline Action Takers']
S= pd.DataFrame()
for i in filesnames:
data = pd.read_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(i, str(today)),sheet_name= i, engine = 'openpyxl')
S= S.append(data)
This did the trick since it was in a format I already wanted.
I am trying to extract some YouTube videos with details, and when I created a dataframe from my dictionary, I faced this error. so can someone help me??
def youtube_search(q, max_results=50,order="relevance", token=None, location=None, location_radius=None):
search_response = youtube.search().list(
q=q,
type="video",
pageToken=token,
order = order,
part="id,snippet", # Part signifies the different types of data you want
maxResults=max_results,
location=location,
locationRadius=location_radius).execute()
title = []
channelId = []
channelTitle = []
categoryId = []
videoId = []
viewCount = []
likeCount = []
dislikeCount = []
commentCount = []
category = []
tags = []
videos = []
for search_result in search_response.get("items", []):
if search_result["id"]["kind"] == "youtube#video":
title.append(search_result['snippet']['title'])
videoId.append(search_result['id']['videoId'])
response = youtube.videos().list(
part='statistics, snippet',
id=search_result['id']['videoId']).execute()
channelId.append(response['items'][0]['snippet']['channelId'])
channelTitle.append(response['items'][0]['snippet']['channelTitle'])
categoryId.append(response['items'][0]['snippet']['categoryId'])
viewCount.append(response['items'][0]['statistics']['viewCount'])
likeCount.append(response['items'][0]['statistics']['likeCount'])
dislikeCount.append(response['items'][0]['statistics']['dislikeCount'])
if 'commentCount' in response['items'][0]['statistics'].keys():
commentCount.append(response['items'][0]['statistics']['commentCount'])
else:
commentCount.append([])
if 'tags' in response['items'][0]['snippet'].keys():
tags.append(response['items'][0]['snippet']['tags'])
else:
tags.append([])
#Not every video has likes/dislikes enabled so they won't appear in JSON response
try:
likeCount.append(response['items'][0]['statistics']['likeCount'])
except:
#Good to be aware of Channels that turn off their Likes
print("Video titled {0}, on Channel {1} Likes Count is not available".format(stats['items'][0]['snippet']['title'],
stats['items'][0]['snippet']['channelTitle']))
print(response['items'][0]['statistics'].keys())
#Appends "Not Available" to keep dictionary values aligned
likeCount.append("Not available")
try:
dislikeCount.append(response['items'][0]['statistics']['dislikeCount'])
except:
#Good to be aware of Channels that turn off their Likes
print("Video titled {0}, on Channel {1} Dislikes Count is not available".format(stats['items'][0]['snippet']['title'],
stats['items'][0]['snippet']['channelTitle']))
print(response['items'][0]['statistics'].keys())
dislikeCount.append("Not available")
#youtube_dict = {'tags':tags,'channelId': channelId,'channelTitle': channelTitle,'categoryId':categoryId,'title':title,'videoId':videoId,'viewCount':viewCount,'likeCount':likeCount,'dislikeCount':dislikeCount,'commentCount':commentCount,'favoriteCount':favoriteCount}
youtube_dict = {'tags':tags,'channelTitle': channelTitle,
'title':title,'videoId':videoId,'viewCount':viewCount,
'likeCount':likeCount, 'dislikeCount':dislikeCount, 'commentCount':commentCount, }
return youtube_dict
q = "covid19 vaccine"
test = youtube_search(q, max_results=100,order="relevance", token=None, location=None, location_radius=None)
import pandas as pd
df = pd.DataFrame(data=test)
df.head()
ValueError: arrays must all be same length. I tried to add
df = pd.DataFrame.from_dict(data=test, orient='index'), but I doesn't work too, I faced an other error
TypeError: init() got an unexpected keyword argument 'orient'
Any help would be much appreciated.
I'm trying to parse a JSON of a sites stock.
The JSON: https://www.ssense.com/en-us/men/sneakers.json
So I want to take some keywords from the user. Then I want to parse the JSON using these keywords to find the name of the item and (in this specific case) return the ID, SKU and the URL.
So for example:
If I inputted "Black Fennec" I want to parse the JSON and find the ID,SKU, and URL of Black Fennec Sneakers (that have an ID of 3297299, a SKU of 191422M237006, and a url of /men/product/ps-paul-smith/black-fennec-sneakers/3297299 )
I have never attempted doing anything like this. Based on some guides that show how to parse a JSON I started out with this:
r = requests.Session()
stock = r.get("https://www.ssense.com/en-us/men/sneakers.json",headers = headers)
obj json_data = json.loads(stock.text)
However I am now confused. How do I find the product based off the keywords and how do I get the ID,Url and the SKU or it?
Theres a number of ways to handle the output. not sure what you want to do with it. But this should get you going.
EDIT 1:
import requests
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
# if you only want to return the first match, uncomment next line
#break
I also have it setup to store it into a dataframe, and or a list too. Just to give some options of where to go with it.
import requests
import pandas as pd
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
temp_df = pd.DataFrame([[name, id_var, sku, url]], columns=['name','id','sku','url'])
results = results.append(temp_df)
products_found = products_found.append(name)
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
if products_found == []:
print ('Nothing found')
EDIT 2: Here is another way to do it by converting the json to a dataframe, then filtering by those rows that have the keyword in the name (this is actually a better solution in my opinion)
import requests
import pandas as pd
from pandas.io.json import json_normalize
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
products_df = json_normalize(products)
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
results = products_df[products_df['name'].str.contains(keyword, case = False)]
#print (results[['name', 'id', 'sku', 'url']])
products_found = list(results['name'])
if products_found == []:
print ('Nothing found')
else:
print ('Found: '+ str(products_found))
I am working with the Twitter Search API which returns a dictionary of dictionaries. My goal is to create a dataframe from a list of keys in the response dictionary.
Example of API response here: Example Response
I have a list of keys within the Statuses dictionary
keys = ["created_at", "text", "in_reply_to_screen_name", "source"]
I would like to loop through each key value returned in the Statuses dictionary and put them in a dataframe with the keys as the columns.
Currently have code to loop through a single key individually and assign to list then append to dataframe but want a way to do more than one key at a time. Current code below:
#w is the word to be queired
w = 'keyword'
#count of tweets to return
count = 1000
#API call
query = twitter.search.tweets(q= w, count = count)
def data_l2 (q, k1, k2):
data = []
for results in q[k1]:
data.append(results[k2])
return(data)
screen_names = data_l3(query, "statuses", "user", "screen_name")
data = {'screen_names':screen_names,
'tweets':tweets}
frame=pd.DataFrame(data)
frame
I will share a more generic solution that I came up with, as I was working with the Twitter API. Let's say you have the ID's of tweets that you want to fetch in a list called my_ids :
# Fetch tweets from the twitter API using the following loop:
list_of_tweets = []
# Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []
for each_id in my_ids:
try:
list_of_tweets.append(api.get_status(each_id))
except Exception as e:
cant_find_tweets_for_those_ids.append(each_id)
Then in this code block we isolate the json part of each tweepy status object that we have downloaded and we add them all into a list....
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
my_list_of_dicts.append(each_json_tweet._json)
...and we write this list into a txt file:
with open('tweet_json.txt', 'w') as file:
file.write(json.dumps(my_list_of_dicts, indent=4))
Now we are going to create a DataFrame from the tweet_json.txt file (I have added some keys that were relevant to my use case that I was working on, but you can add your specific keys instead):
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:
all_data = json.load(json_file)
for each_dictionary in all_data:
tweet_id = each_dictionary['id']
whole_tweet = each_dictionary['text']
only_url = whole_tweet[whole_tweet.find('https'):]
favorite_count = each_dictionary['favorite_count']
retweet_count = each_dictionary['retweet_count']
created_at = each_dictionary['created_at']
whole_source = each_dictionary['source']
only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
source = only_device
retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
if retweeted_status == 'Original tweet':
url = only_url
else:
retweeted_status = 'This is a retweet'
url = 'This is a retweet'
my_demo_list.append({'tweet_id': str(tweet_id),
'favorite_count': int(favorite_count),
'retweet_count': int(retweet_count),
'url': url,
'created_at': created_at,
'source': source,
'retweeted_status': retweeted_status,
})
tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count',
'retweet_count', 'created_at',
'source', 'retweeted_status', 'url'])
I'm using YouTube API with Python. I can already gather all the comments of a specific video, including the name of the authors, the date and the content of the comments.
I can also with a separate piece of code, extract the personal information (age,gender,interests,...) of a specific author.
But I can not use them in one place. i.e. I need to gather all the comments of a video, with the name of the comments' authors and having the personal information of all those authors.
in follow is the code that I developed. But I get an 'RequestError' which I dont know how to handle and where is the problem.
import gdata.youtube
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
f = open('test1.csv','w')
f.writelines(['UserName',',','Age',',','Date',',','Comment','\n'])
def GetAndPrintVideoFeed(string1):
yt_service = gdata.youtube.service.YouTubeService()
user_entry = yt_service.GetYouTubeUserEntry(username = string1)
X = PrintentryEntry(user_entry)
return X
def PrintentryEntry(entry):
# print required fields where we know there will be information
Y = entry.age.text
return Y
def GetComment(next1):
yt_service = gdata.youtube.service.YouTubeService()
nextPageFeed = yt_service.GetYouTubeVideoCommentFeed(next1)
for comment_entry in nextPageFeed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next2 = nextPageFeed.GetNextLink().href
GetComment(next2)
video_id = '8wxOVn99FTE'
comment_feed = yt_service.GetYouTubeVideoCommentFeed(video_id=video_id)
for comment_entry in comment_feed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next1 = comment_feed.GetNextLink().href
GetComment(next1)
I think you need a better understanding of the Youtube API and how everything relates together. I've written wrapper classes which can handle multiple types of Feeds or Entries and "fixes" gdata's inconsistent parameter conventions.
Here are some snippets showing how the scraping/crawling can be generalized without too much difficulty.
I know this isn't directly answering your question, It's more high level design but it's worth thinking about if you're going to be doing a large amount of youtube/gdata data pulling.
def get_feed(thing=None, feed_type=api.GetYouTubeUserFeed):
if feed_type == 'user':
feed = api.GetYouTubeUserFeed(username=thing)
if feed_type == 'related':
feed = api.GetYouTubeRelatedFeed(video_id=thing)
if feed_type == 'comments':
feed = api.GetYouTubeVideoCommentFeed(video_id=thing)
feeds = []
entries = []
while feed:
feeds.append(feed)
feed = api.GetNext(feed)
[entries.extend(f.entry) for f in feeds]
return entries
...
def myget(url,service=None):
def myconverter(x):
logfile = url.replace('/',':')+'.log'
logfile = logfile[len('http://gdata.youtube.com/feeds/api/'):]
my_logger.info("myget: %s" % url)
if service == 'user_feed':
return gdata.youtube.YouTubeUserFeedFromString(x)
if service == 'comment_feed':
return gdata.youtube.YouTubeVideoCommentFeedFromString(x)
if service == 'comment_entry':
return gdata.youtube.YouTubeVideoCommentEntryFromString(x)
if service == 'video_feed':
return gdata.youtube.YouTubeVideoFeedFromString(x)
if service == 'video_entry':
return gdata.youtube.YouTubeVideoEntryFromString(x)
return api.GetWithRetries(url,
converter=myconverter,
num_retries=3,
delay=2,
backoff=5,
logger=my_logger
)
mapper={}
mapper[api.GetYouTubeUserFeed]='user_feed'
mapper[api.GetYouTubeVideoFeed]='video_feed'
mapper[api.GetYouTubeVideoCommentFeed]='comment_feed'
https://gist.github.com/2303769 data/service.py (routing)