Python Praw ways to store data for calling later? - python

Is a dictionary the correct way to be doing this? Ideally this will be more then 5+ deep. Sorry my only language experience is powershell there I would just make an array of object. Im not looking for someone to write the code I just wanna know if there is a better way?
Thanks
Cody
My Powershell way:
[$title1,$title2,$title3]
$titleX.comment = "comment here"
$titleX.comment.author = "bob"
$titleX.comment.author.karma = "200"
$titleX.comment.reply = "Hey Bob love your comment."
$titleX.comment.reply.author = "Alex"
$titleX.comment.reply.reply = "I disagree"
#
Python code Borken:
import praw
d = {}
reddit = praw.Reddit(client_id='XXXX',
client_secret='XXXX',
user_agent='android:com.example.myredditapp:'
'v1.2.3 (by /u/XXX)')
for submission in reddit.subreddit('redditdev').hot(limit=2):
d[submission.id] = {}
d[submission.id]['comment'] = {}
d[submission.id]['title']= {}
d[submission.id]['comment']['author']={}
d[submission.id]['title'] = submission.title
mySubmission = reddit.submission(id=submission.id)
mySubmission.comments.replace_more(limit=0)
for comment in mySubmission.comments.list():
d[submission.id]['comment'] = comment.body
d[submission.id]['comment']['author'] = comment.author.name
print(submission.title)
print(comment.body)
print(comment.author.name)
print(d)
File "C:/git/tensorflow/Reddit/pull.py", line 23, in <module>
d[submission.id]['comment']['author'] = comment.author.name
TypeError: 'str' object does not support item assignment
#
{'6xg24v': {'comment': 'Locking this version. Please comment on the [original post](https://www.reddit.com/r/changelog/comments/6xfyfg/an_update_on_the_state_of_the_redditreddit_and/)!', 'title': 'An update on the state of the reddit/reddit and reddit/reddit-mobile repositories'}}

I think your approach using a dictionary is okay, but you might also solve this by using a data structure for your posts: Instead of writing
d[submission.id] = {}
d[submission.id]['comment'] = {}
d[submission.id]['title']= {}
d[submission.id]['comment']['author']={}
d[submission.id]['title'] = submission.title
you could create a class Submission like this:
class Submission(object):
def __init__(self, id, author, title, content):
self.id = id
self.author = author
self.title = title
self.content = content
self.subSubmissions = {}
def addSubSubmission(self,submission):
self.subSubmission[submission,id] = submission
def getSubSubmission(self,id):
return self.subSubmission[id]
by using you could change your code to this
submissions = {}
for sm in reddit.subreddit('redditdev').hot(limit=2):
submissions[sm.id] = Submission(sm.id, sm.author, sm.title, sm.content)
# I am not quite sure what these lines are supposed to do, so you might be able to improve these, too
mySubmission = reddit.submission(id=sm.id)
mySubmission.comments.replace_more(limit=0)
for cmt in mySubmission.comments.list():
submissions[sm.id].addSubSubmission(Submission(cmt.id, cmt.title, cmt.author, cmt.body))
By using this apporach you are also able to export the code to readout the comments/subSubmissions into an extra function which can call itself recursively, so that you can read infitive depths of the comments.

Related

How to speed up writing in a database?

I have a function which search for json files in a directory, parse the file and write data in the database. My problem is writing in database, because it take around 30 minutes. Any idea how can I speed up writting in a database? I have few quite big files to parse, but parsing the file is not a problem it take around 3 minutes. Currently I am using sqlite but in the future I will change it to PostgreSQL.
Here is my function:
def create_database():
with transaction.atomic():
directory = os.fsencode('data/web_files/unzip')
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open('data/web_files/unzip/{}'.format(filename.strip()), encoding="utf8") as f:
data = json.load(f)
cve_items = data['CVE_Items']
for i in range(len(cve_items)):
database_object = DataNist()
try:
impact = cve_items[i]['impact']['baseMetricV2']
database_object.severity = impact['severity']
database_object.exp_score = impact['exploitabilityScore']
database_object.impact_score = impact['impactScore']
database_object.cvss_score = impact['cvssV2']['baseScore']
except KeyError:
database_object.severity = ''
database_object.exp_score = ''
database_object.impact_score = ''
database_object.cvss_score = ''
for vendor_data in cve_items[i]['cve']['affects']['vendor']['vendor_data']:
database_object.vendor_name = vendor_data['vendor_name']
for description_data in cve_items[i]['cve']['description']['description_data']:
database_object.description = description_data['value']
for product_data in vendor_data['product']['product_data']:
database_object.product_name = product_data['product_name']
database_object.save()
for version_data in product_data['version']['version_data']:
if version_data['version_value'] != '-':
database_object.versions_set.create(version=version_data['version_value'])
My models.py:
class DataNist(models.Model):
vendor_name = models.CharField(max_length=100)
product_name = models.CharField(max_length=100)
description = models.TextField()
date = models.DateTimeField(default=timezone.now)
severity = models.CharField(max_length=10)
exp_score = models.IntegerField()
impact_score = models.IntegerField()
cvss_score = models.IntegerField()
def __str__(self):
return self.vendor_name + "-" + self.product_name
class Versions(models.Model):
data = models.ForeignKey(DataNist, on_delete=models.CASCADE)
version = models.CharField(max_length=50)
def __str__(self):
return self.version
I will appreciate if you can give me any advice how can I improve my code.
Okay, given the structure of the data, something like this might work for you.
This is standalone code aside from that .objects.bulk_create() call; as commented in the code, the two classes defined would actually be models within your Django app.
(By the way, you probably want to save the CVE ID as an unique field too.)
Your original code had the misassumption that every "leaf entry" in the affected version data would have the same vendor, which may not be true. That's why the model structure here has a separate product-version model that has vendor, product and version fields. (If you wanted to optimize things a little, you might deduplicate the AffectedProductVersions even across DataNists (which, as an aside, is not a perfect name for a model)).
And of course, as you had already done in your original code, the importing should be run within a transaction (transaction.atomic()).
Hope this helps.
import json
import os
import types
class DataNist(types.SimpleNamespace): # this would actually be a model
severity = ""
exp_score = ""
impact_score = ""
cvss_score = ""
def save(self):
pass
class AffectedProductVersion(types.SimpleNamespace): # this too
# (foreign key to DataNist here)
vendor_name = ""
product_name = ""
version_value = ""
def import_item(item):
database_object = DataNist()
try:
impact = item["impact"]["baseMetricV2"]
except KeyError: # no impact object available
pass
else:
database_object.severity = impact.get("severity", "")
database_object.exp_score = impact.get("exploitabilityScore", "")
database_object.impact_score = impact.get("impactScore", "")
if "cvssV2" in impact:
database_object.cvss_score = impact["cvssV2"]["baseScore"]
for description_data in item["cve"]["description"]["description_data"]:
database_object.description = description_data["value"]
break # only grab the first description
database_object.save() # save the base object
affected_versions = []
for vendor_data in item["cve"]["affects"]["vendor"]["vendor_data"]:
for product_data in vendor_data["product"]["product_data"]:
for version_data in product_data["version"]["version_data"]:
affected_versions.append(
AffectedProductVersion(
data_nist=database_object,
vendor_name=vendor_data["vendor_name"],
product_name=product_data["product_name"],
version_name=version_data["version_value"],
)
)
AffectedProductVersion.objects.bulk_create(
affected_versions
) # save all the version information
return database_object # in case the caller needs it
with open("nvdcve-1.0-2019.json") as infp:
data = json.load(infp)
for item in data["CVE_Items"]:
import_item(item)

Getting wrong result from JSON - Python 3

Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!

Why can't I pickle this list?

The purpose of this form is to let users enter a lot of places (comma separated) and it'll retrieve the phone, name, website. Have it working in a python IDE, no problem, but having issues putting it into my webapp.
I'm getting the error Exception Value: Can't pickle local object 'GetNums.<locals>.get_data' at the line where a is assigned. I checked the type of inputText and verified that it is indeed a list. So, I'm not sure why it won't pickle.
def GetNums(request):
form = GetNumsForm(request.POST or None)
if form.is_valid():
inputText = form.cleaned_data.get('getnums')
# all experimental
inputText = inputText.split(',')
def get_data(i):
#DON'T FORGET TO MOVE THE PRIMARY KEY LATER TO SETTINGS
r1 = requests.get('https://maps.googleapis.com/maps/api/place/textsearch/json?query=' + i + '&key=GET_YOUR_OWN')
a = r1.json()
pid = a['results'][0]['place_id']
r2 = requests.get('https://maps.googleapis.com/maps/api/place/details/json?placeid=' + pid + '&key=GET_YOUR_OWN')
b = r2.json()
phone = b['result']['formatted_phone_number']
name = b['result']['name']
try:
website = b['result']['website']
except:
website ='No website found'
return ' '.join((phone, name, website))
v = str(type(inputText))
with Pool(5) as p:
a = (p.map(get_data, inputText))
# for line in p.map(get_data, inputText):
# print(line)
#code assist by http://stackoverflow.com/a/34512870/5037442
#end experimental
return render(request, 'about.html', {'v': a})
It's actually barfing when trying to pickle get_data, which is a nested function/closure.
Move get_data out of GetNums (and agh rename it to snake_case please) and it should work.

Wikipedia Infobox parser with Multi-Language Support

I am trying to develop an Infobox parser in Python which supports all the languages of Wikipedia. The parser will get the infobox data and will return the data in a Dictionary.
The keys of the Dictionary will be the property which is described (e.g. Population, City name, etc...).
The problem is that Wikipedia has slightly different page contents for each language. But the most important thing is that the API response structure for each language can also be different.
For example, the API response for 'Paris' in English contains this Infobox:
{{Infobox French commune |name = Paris |commune status = [[Communes of France|Commune]] and [[Departments of France|department]] |image = <imagemap> File:Paris montage.jpg|275px|alt=Paris montage
and in Greek, the corresponding part for 'Παρίσι' is:
[...] {{Πόλη (Γαλλία) | Πόλη = Παρίσι | Έμβλημα =Blason paris 75.svg | Σημαία =Mairie De Paris (SVG).svg | Πλάτος Σημαίας =120px | Εικόνα =Paris - Eiffelturm und Marsfeld2.jpg [...]
In the second example, there isn't any 'Infobox' occurrence after the {{. Also, in the API response the name = Paris is not the exact translation for Πόλη = Παρίσι. (Πόλη means city, not name)
Because of such differences between the responses, my code fails.
Here is the code:
class WikipediaInfobox():
# Class to get and parse the Wikipedia Infobox Data
infoboxArrayUnprocessed = [] # Maintains the order which the data is displayed.
infoboxDictUnprocessed = {} # Still Contains Brackets and Wikitext coding. Will be processed more later...
language="en"
def getInfoboxDict(self, infoboxRaw): # Get the Infobox in Dict and Array form (Unprocessed)
if infoboxRaw.strip() == "":
return {}
boxLines = [line.strip().replace(" "," ") for line in infoboxRaw.splitlines()]
wikiObjectType = boxLines[0]
infoboxData = [line[1:] for line in boxLines[1:]]
toReturn = {"wiki_type":wikiObjectType}
for i in infoboxData:
key = i.split("=")[0].strip()
value = ""
if i.strip() != key + "=":
value=i.split("=")[1].strip()
self.infoboxArrayUnprocessed.append({key:value})
toReturn[key]=value
self.infoboxDictUnprocessed = toReturn
return toReturn
def getInfoboxRaw(self, pageTitle, followRedirect = False, resetOld=True): # Get Infobox in Raw Text
if resetOld:
infoboxDict = {}
infoboxDictUnprocessed = {}
infoboxArray = []
infoboxArrayUnprocessed = []
params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "%s" % urllib.quote(pageTitle.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://" + self.language + ".wikipedia.org/w/api.php?%s" % qs
tree = etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')
if len(revs) == 0:
return ""
if "#REDIRECT" in revs[-1].text and followRedirect == True:
redirectPage = revs[-1].text[revs[-1].text.find("[[")+2:revs[-1].text.find("]]")]
return self.getInfoboxRaw(redirectPage,followRedirect,resetOld)
elif "#REDIRECT" in revs[-1].text and followRedirect == False:
return ""
infoboxRaw = ""
if "{{Infobox" in revs[-1].text: # -> No Multi-language support:
infoboxRaw = revs[-1].text.split("{{Infobox")[1].split("}}")[0]
return infoboxRaw
def __init__(self, pageTitle = "", followRedirect = False): # Constructor
if pageTitle != "":
self.language = guess_language.guessLanguage(pageTitle)
if self.language == "UNKNOWN":
self.language = "en"
infoboxRaw = self.getInfoboxRaw(pageTitle, followRedirect)
self.getInfoboxDict(infoboxRaw) # Now the parsed data is in self.infoboxDictUnprocessed
Some parts of this code was found on this blog...
I don't want to reinvent the wheel, so maybe someone has a nice solution for multi-language support and neat parsing of the Infobox section of Wikipedia.
I have seen many alternatives, like DBPedia or some other parsers that MediaWiki recommends, but I haven't found anything that suits my needs, yet. I also want to avoid scraping the page with BeautifulSoup, because it can fail on some cases, but if it is necessary it will do.
If something isn't clear enough, please ask. I want to help as much as I can.
Wikidata is definitely the first choice these days if you want to get structured data, anyway if in the future you need to parse data from wikipedia articles, especially as you are using Python, I can recommand mwparserfromhell which is a python library aimed at parsing wikitext and that has an option to extract templates and their attributes. That won't directly fix your issue as the multiple templates in multiple languages will definitely be different but that might be useful if you continue trying to parse wikitext.

How to gather personal information (age,gender..) of all the authors of the comments on a specific video, with Python YouTube API

I'm using YouTube API with Python. I can already gather all the comments of a specific video, including the name of the authors, the date and the content of the comments.
I can also with a separate piece of code, extract the personal information (age,gender,interests,...) of a specific author.
But I can not use them in one place. i.e. I need to gather all the comments of a video, with the name of the comments' authors and having the personal information of all those authors.
in follow is the code that I developed. But I get an 'RequestError' which I dont know how to handle and where is the problem.
import gdata.youtube
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
f = open('test1.csv','w')
f.writelines(['UserName',',','Age',',','Date',',','Comment','\n'])
def GetAndPrintVideoFeed(string1):
yt_service = gdata.youtube.service.YouTubeService()
user_entry = yt_service.GetYouTubeUserEntry(username = string1)
X = PrintentryEntry(user_entry)
return X
def PrintentryEntry(entry):
# print required fields where we know there will be information
Y = entry.age.text
return Y
def GetComment(next1):
yt_service = gdata.youtube.service.YouTubeService()
nextPageFeed = yt_service.GetYouTubeVideoCommentFeed(next1)
for comment_entry in nextPageFeed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next2 = nextPageFeed.GetNextLink().href
GetComment(next2)
video_id = '8wxOVn99FTE'
comment_feed = yt_service.GetYouTubeVideoCommentFeed(video_id=video_id)
for comment_entry in comment_feed.entry:
string1 = comment_entry.author[0].name.text.split("/")[-1]
Z = GetAndPrintVideoFeed(string1)
string2 = comment_entry.updated.text.split("/")[-1]
string3 = comment_entry.content.text.split("/")[-1]
f.writelines( [str(string1),',',Z,',',string2,',',string3,'\n'])
next1 = comment_feed.GetNextLink().href
GetComment(next1)
I think you need a better understanding of the Youtube API and how everything relates together. I've written wrapper classes which can handle multiple types of Feeds or Entries and "fixes" gdata's inconsistent parameter conventions.
Here are some snippets showing how the scraping/crawling can be generalized without too much difficulty.
I know this isn't directly answering your question, It's more high level design but it's worth thinking about if you're going to be doing a large amount of youtube/gdata data pulling.
def get_feed(thing=None, feed_type=api.GetYouTubeUserFeed):
if feed_type == 'user':
feed = api.GetYouTubeUserFeed(username=thing)
if feed_type == 'related':
feed = api.GetYouTubeRelatedFeed(video_id=thing)
if feed_type == 'comments':
feed = api.GetYouTubeVideoCommentFeed(video_id=thing)
feeds = []
entries = []
while feed:
feeds.append(feed)
feed = api.GetNext(feed)
[entries.extend(f.entry) for f in feeds]
return entries
...
def myget(url,service=None):
def myconverter(x):
logfile = url.replace('/',':')+'.log'
logfile = logfile[len('http://gdata.youtube.com/feeds/api/'):]
my_logger.info("myget: %s" % url)
if service == 'user_feed':
return gdata.youtube.YouTubeUserFeedFromString(x)
if service == 'comment_feed':
return gdata.youtube.YouTubeVideoCommentFeedFromString(x)
if service == 'comment_entry':
return gdata.youtube.YouTubeVideoCommentEntryFromString(x)
if service == 'video_feed':
return gdata.youtube.YouTubeVideoFeedFromString(x)
if service == 'video_entry':
return gdata.youtube.YouTubeVideoEntryFromString(x)
return api.GetWithRetries(url,
converter=myconverter,
num_retries=3,
delay=2,
backoff=5,
logger=my_logger
)
mapper={}
mapper[api.GetYouTubeUserFeed]='user_feed'
mapper[api.GetYouTubeVideoFeed]='video_feed'
mapper[api.GetYouTubeVideoCommentFeed]='comment_feed'
https://gist.github.com/2303769 data/service.py (routing)

Categories