obtain current status of Wikipedia articles?

obtain current status of Wikipedia articles? - python

I am using Python and MySQL to query mediawiki database to get the current status of articles (i.e. whether the article is FA, GA, GAN etc.) but have been unable to do so.
I know current status is stored in the old_text field of the text table. I was trying to something like:
loc = select (locate('currentstatus', old_text))
query = ('select substring(old_text, '%s', 20) from wikidb where page_id = 1234' % (loc))
but unfortunately loc gives the first occurrence of currentstatus and not the last which is not very 'current' since the newest/latest status is on the bottom.
I am not sure how to fix it or if I am using the right approach.

For Wikipedia, it would be more to the point to examine the categories the article is in. Or if processing raw wikitext, look for the corresponding template:
Featured articles (FA) are in [[category:Featured articles]] and use {{featured article}}, which references [[template:featured article]]
Good articles (GA) are in [[category:Good articles]] and use {{good article}}, which references [[template:good article]]
Both those categories are hidden, so you would have to enable the preference for displaying hidden categories, or traverse the category contents to see if the article is there.
Other article classes (A, B, C, FL, Start, Stub, List, undefined) are assessed on the corresponding talk page using one or more WikiProject templates. There is no standard.

Related

Is there a function to only extract/return the value of a field with Pymongo?

I'm using MongoDB and want to work with it in Python which is needed for my project. I wanted to extract only the value of a specific field with Pymongo. In my case, I tried returning the name of a charging station which is saved in a database as a document with the attributes name, standard, location, charging capacity, operator
I only found a website which solved my problem in Mongosh by just using db.products.findOne().collectionname.
For a better understanding of my problem, please visit this website which describes my problem pretty good: https://database.guide/how-to-return-just-the-value-in-mongodb/
So I naturally tried using this method. But it didn't work for me with Pymongo...
chargers = db.chargers
result = chargers.findOne().name
print(result)
I received this error as a result in the terminal after running the .py file.
So my question is: Is there a method for Pymongo to return only the value of a field in a document?
E.g. the name of a product or in my case a charger.

ok here is something you can give a try:
change:
result = chargers.findOne().name
to:
result = chargers.findOne({"name": "the value of the document name"}, {"name"})
change the "value of the document name" to the filter, like you have a document which has a field {name: "abc"}, so type abc in place of "the value of the document name".
then do this for clean results:
result = str(result["name"])
print(result)

Replace a value on one data set with a value from another data set with a dynamic lookup

This question relates primarly to Alteryx, however if it can be done in Python, or R in Alteryx workflow using the R tool then that would work as well.
I have two data sets.
Address (contains address information: Line1, Line2, City, State, Zip)
USPS (contains USPS abbreviations: Street to ST, Boulevard to BLVD, etc.)
Goal: Look at the string on the Address data set for Line1. IF it CONTAINS one of the types of streets in the USPS data set, I want to replace that part of the string with its proper abbreviation which is in a different column of the USPS data set.
Example, 123 Main Street would become 123 Main St
What I have tried:
Imported the two data sets.
Union the two data sets with the instruction of Output All Fields for When Fields Differ.
Added a formula, but this is where I am getting stuck. So far it reads:
if [Addr1] Contains(Sting, Target)
Not sure how to have it look in the USPS for one of the values. I am also not certain if this sort of dynamic lookup can take place.
If this can be done in python (I know very basic Python so I don't have code for this yet because I do not know where to start other than importing the data) I can use python within Alteryx.
Any assistance would be great. Please let me know if you need additional information.
Thank you in advance.

Use the Find Replace tool in Alteryx. This tool is akin to a lookup. Furthermore, use the Alteryx Community as a go to for these types of questions.
Input the Address dataset into the top anchor of the Find Replace tool and the USPS dataset into the bottom anchor. You'll want to find any part of the address field using the lookup field and replace it with the abbreviation field. If you need to do this across several fields in the Address dataset, then you could replicate this logic or you could use a Record ID tool, Transpose, run this logic on one field, and then Cross Tab back to the original schema. It's an advanced recipe that you'll want to master in Alteryx.
https://help.alteryx.com/current/FindReplace.htm

The overall logic that can be used is here: Using str_detect (or some other function) and some way to loop through a list to essentially perform a vlookup
However, in order to expand to Alteryx, you would need to add the Alteryx R tool. Also, some of the code would need to be changed to use the syntax that Alteryx likes.
read in the data with:
read.Alteryx('#Link Number', mode = 'data.frame')
After, the above linked question will provide the overall framework for the logic. Reiterated here:
usps[] = lapply(usps, as.character)
##Copies the original address data to a new column that will
##be altered. Preserves the orignal formatting for rollback
##if necessary
vendorData$new_addr1 = as.character(vendorData$Addr1)
##Loops through the dictionary replacing all of the common names
##with their USPS approved abbreviations for the Addr1 field.
for(i in 1:nrow(usps)) {
vendorData$new_addr1 = str_replace_all(
vendorData$new_addr1,
pattern = paste0("\\b", usps$Abbreviation[i], "\\b"),
replacement = usps$USPS_Abbrv_updated[i]
)
}
Finally, in order to be able to see the output, we would need to write a statement that will output it in one of the 5 output slots the R tool has. Here is the code for that:
write.Alteryx(data, #)

How can I query for multiple properties not known in advance using Expando?

I'm making an application in which a user can create categories to put items in them. The items share some basic properties, but the rest of them are defined by the category they belong to. The problem is that both the category and it's special properties are created by the user.
For instance, the user may create two categories: books and buttons. In the 'book' category he may create two properties: number of pages and author. In the buttons category he may create different properties: number of holes and color.
Initially, I placed these properties in a JsonProperty inside the Item. While this works, it means that I query the Datastore just by specifying the category that I am looking for and then I have to filter the results of the query in the code. For example, if I'm looking for all the books whose author is Carl Sagan, I would query the Item class with category == books and the loop through the results to keep only those that match the author.
While I don't really expect to have that many items per category (probably in the hundreds, unlikely to get to one thousand), this looks inefficient. So I tried to use ndb.Expando to make those special properties real properties that are indexed. I did this, adding the corresponding special properties to the item when putting it to the Datastore. So if the user creates an Item in the 'books' category and previously created in that category the special property 'author', an Item is saved with the special property expando_author = author in it. It worked as I expected until this point (dev server).
The real problem though became visible when I did some queries. While they worked in the dev server, they created composite indexes for each special/expando property, even if the query filters were equality only. And while each category can have at most five properties, it is evident that it can easily get out of control.
Example query:
items = Item.query()
for p in properties:
items = items.filter(ndb.GenericProperty(p)==properties[p])
items.fetch()
Now, since I don't know in advance what the properties will be (though I will limit it to 5), I can't build the indexes before uploading the application, and even if I knew it would probably mean having more indexes that I'm comfortable with. Is Expando the wrong tool for what I'm trying to do? Should I just keep filtering the results in the code using the JsonProperty? I would greatly appreciate any advice I can get.
PD. To make this post shorter I omitted a few details about what I did, if you need to know something I may have left out just ask in the comments.

Consider storing category's properties in a single list property prefixed with category property name.
Like (forget me I forgot exact Python syntax, switched to Go)
class Item():
props = StringListProperty()
book = Item(category='book', props=['title:Carl Sagan'])
button = Item(category='button', props=['wholes:5'])
Then you can do have a single composite index on category+props and do queries like this:
def filter_items(category, propName, propValue):
Item.filter(Item.category == category).filter(Item.props==propName+':'+propValue)
And you would need a function on Item to get property values cleaned up from prop names.

Add a field to existing document in CouchDB

I have a database with a bunch of regular documents that look something like this (example from wiki):
{
"_id":"some_doc_id",
"_rev":"D1C946B7",
"Subject":"I like Plankton",
"Author":"Rusty",
"PostedDate":"2006-08-15T17:30:12-04:00",
"Tags":["plankton", "baseball", "decisions"],
"Body":"I decided today that I don't like baseball. I like plankton."
}
I'm working in Python with couchdb-python and I want to know if it's possible to add a field to each document. For example, if I wanted to have a "Location" field or something like that.
Thanks!

Regarding IDs
Every document in couchdb has an id, whether you set it or not. Once the document is stored you can access it through the doc._id field.
If you want to set your own ids you'll have to assign the id value to doc._id. If you don't set it, then couchdb will assign a uuid.
If you want to update a document, then you need to make sure you have the same id and a valid revision. If say you are working from a blog post and the user adds the Location, then the url of the post may be a good id to use. You'd be able to instantly access the document in this case.
So what's a revision
In your code snippet above you have the doc._rev element. This is the identifier of the revision. If you save a document with an id that already exists, couchdb requires you to prove that the document is still the valid doc and that you are not trying to overwrite someone else's document.
So how do I update a document
If you have the id of your document, you can just access each document by using the db.get(id) function. You can then update the document like this:
doc = db.get(id)
doc['Location'] = "On a couch"
db.save(doc)
I have an example where I store weather forecast data. I update the forecasts approximately every 2 hours. A separate process is looking for data that I get from a different provider looking at characteristics of tweets on the day.
This looks something like this.
doc = db.get(id)
doc_with_loc = GetLocationInformationFromOtherProvider(doc) # takes about 40 seconds.
doc_with_loc["_rev"] = doc["_rev"]
db.save(doc_with_loc) # This will fail if weather update has also updated the file.
If you have concurring processes, then the _rev will become invalid, so you have to have a failsave, eg. this could do:
doc = db.get(id)
doc_with_loc = GetLocationInformationFromAltProvider(doc)
update_outstanding = true
while update_outstanding:
doc = db.get(id) //reretrieve this to get
doc_with_loc["_rev"] = doc["_rev"]
update_outstanding = !db.save(doc_with_loc)
So how do I get the Ids?
One option suggested above is that you actively set the id, so you can retrieve it. Ie. if a user sets a given location that is attached to a URL, use the URL. But you may not know which document you want to update - or even have a process that finds all the document that don't have a location and assign one.
You'll most likely be using a view for this. Views have a mapper and a reducer. You'll use the first one, forget about the last one. A view with a mapper does the following:
It returns a simplyfied/transformed way of looking at your data. You can return multiple values per data or skip some. It gives the data you emit a key, and if you use the _include_docs function it will give you the document (with _id and rev alongside).
The simplest view is the default view db.view('_all_docs') this will return all documents and you may not want to update all of them. Views for example will be stored as a document as well when you define these.
The next simple way is to have view that only returns items that are of the type of the document. I tend to have a _type="article in my database. Think of this as marking that a document belongs to a certain table if you had stored them in a relational database.
Finally you can filter elements that have a location so you'd have a view where you can iterate over all those docs that still need a location and identify this in a separate process. The best documentation on writing view can be found here.

Sorting entities and filtering ListProperty without incurring in exploding indexes

I'm developing a simple Blogging/Bookmarking platform and I'm trying to add a tags-explorer/drill-down feature a là delicious to allow users to filter the posts specifying a list of specific tags.
Something like this:
Posts are represented in the datastore with this simplified model:
class Post(db.Model):
title = db.StringProperty(required = True)
link = db.LinkProperty(required = True)
description = db.StringProperty(required = True)
tags = db.ListProperty(str)
created = db.DateTimeProperty(required = True, auto_now_add = True)
Post's tags are stored in a ListProperty and, in order to retrieve the list of posts tagged with a specific list of tags, the Post model exposes the following static method:
#staticmethod
def get_posts(limit, offset, tags_filter = []):
posts = Post.all()
for tag in tags_filter:
if tag:
posts.filter('tags', tag)
return posts.fetch(limit = limit, offset = offset)
This works well, although I've not stressed it too much.
The problem raises when I try to add a "sorting" order to the get_posts method to keep the result ordered by "-created" date:
#staticmethod
def get_posts(limit, offset, tags_filter = []):
posts = Post.all()
for tag in tags_filter:
if tag:
posts.filter('tags', tag)
posts.order("-created")
return posts.fetch(limit = limit, offset = offset)
The sorting order adds an index for each tag to filter, leading to the dreaded exploding indexes problem.
One last thing that makes this thing more complicated is that the get_posts method should provide some pagination mechanism.
Do you know any Strategy/Idea/Workaround/Hack to solve this problem?

Queries involving keys use indexes
just like queries involving
properties. Queries on keys require
custom indexes in the same cases as
with properties, with a couple of
exceptions: inequality filters or an
ascending sort order on key do not
require a custom index, but a
descending sort order on
Entity.KEY_RESERVED_PROPERTY_key_
does.
So use a sortable date string for the primary key of the entity:
class Post(db.Model):
title = db.StringProperty(required = True)
link = db.LinkProperty(required = True)
description = db.StringProperty(required = True)
tags = db.ListProperty(str)
created = db.DateTimeProperty(required = True, auto_now_add = True)
#classmethod
def create(*args, **kw):
kw.update(dict(key_name=inverse_millisecond_str() + disambig_chars()))
return Post(*args, **kw)
...
def inverse_microsecond_str(): #gives string of 8 characters from ascii 23 to 'z' which sorts in reverse temporal order
t = datetime.datetime.now()
inv_us = int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)) #no y2k for >100 yrs
base_100_chars = []
while inv_us:
digit, inv_us = inv_us % 100, inv_us / 100
base_100_str = [chr(23 + digit)] + base_100_chars
return "".join(base_100_chars)
Now, you don't even have to include a sort order in your queries, although it won't hurt to explicitly sort by key.
Things to remember:
This won't work unless you use the "create" here for all your Posts.
You'll have to migrate old data
No ancestors allowed.
The key is stored once per index, so it is worthwhile to keep it short; that's why I'm doing the base-100 encoding above.
This is not 100% reliable because of the possibility of key collisions. The above code, without disambig_chars, nominally gives reliability of the number of microseconds between transactions, so if you had 10 posts per second at peak times, it would fail 1/100,000. However, I'd shave off a couple orders of magnitude for possible app engine clock tick issues, so I'd actually only trust it for 1/1000. If that's not good enough, add disambig_chars; and if you need 100% reliability, then you probably shouldn't be on app engine, but I guess you could include logic to handle key collisions on save().

What if you inverted the relationship? Instead of a post with a list of tags you would have a tag entity with a list of posts.
class Tag(db.Model):
tag = db.StringProperty()
posts = db.ListProperty(db.Key, indexed=False)
To search for tags you would do tags = Tag.all().filter('tag IN', ['python','blog','async'])
This would give you hopefully 3 or more Tag entities, each with a list of posts that are using that tag. You could then do post_union = set(tags[0].posts).intersection(tags[1].posts, tags[2].posts) to find the set of posts that have all tags.
Then you could fetch those posts and order them by created (I think). Posts.all().filter('__key__ IN', post_union).order("-created")
Note: This code is off the top of my head, I can't remember if you can manipulate sets like that.
Edit: #Yasser pointed out that you can only do IN queries for < 30 items.
Instead you could have the key name for each post start with the creation time. Then you could sort the keys you retrieved via the first query and just do Posts.get(sorted_posts).
Don't know how this would scale to a system with millions of posts and/or tags.
Edit2: I meant set intersection, not union.

This question sounds similar to:
Data Modelling Advice for Blog Tagging system on Google App Engine
Mapping Data for a Google App Engine Blog Application:
parent->child relationships in appengine python (bigtable)
As pointed by Robert Kluin in the last one, you could also consider using a pattern similar to "Relation Index" as described in this Google I/O presentation.
# Model definitions
class Article(db.Model):
title = db.StringProperty()
content = db.StringProperty()
class TagIndex(db.Model):
tags = db.StringListProperty()
# Tags are child entities of Articles
article1 = Article(title="foo", content="foo content")
article1.put()
TagIndex(parent=article1, tags=["hop"]).put()
# Get all articles for a given tag
tags = db.GqlQuery("SELECT __key__ FROM Tag where tags = :1", "hop")
keys = (t.parent() for t in tags)
articles = db.get(keys)
Depending on how many Page you expect back by Tags query, sorting could either be made in memory or by making the date string representation part of Article key_name
Updated with StringListProperty and sorting notes after Robert Kluin and Wooble comments on #appengine IRC channel.

One workaround could be this:
Sort and concatenate a post's tags with a delimiter like | and store them as a StringProperty when storing a post. When you receive the tags_filter, you can sort and concatenate them to create a single StringProperty filter for the posts. Obviously this would be an AND query and not an OR query but thats what your current code seems to be doing as well.
EDIT: as rightly pointed out, this would only match exact tag list not partial tag list, which is obviously not very useful.
EDIT: what if you model your Post model with boolean placeholders for tags e.g. b1, b2, b3 etc. When a new tag is defined, you can map it to the next available placeholder e.g. blog=b1, python=b2, async=b3 and keep the mapping in a separate entity. When a tag is assigned to a post, you just switch its equivalent placeholder value to True.
This way when you receive a tag_filter set, you can construct your query from the map e.g.
Post.all().filter("b1",True).filter("b2",True).order('-created')
can give you all the posts which have tags python and blog.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.