Extract template arguments in Python from MediaWiki's API wikitext - python

Is there a way to extract parts of text from MediaWikia's API? For example, this link dumps all the content into XML format:
http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content&format=xml
But there isn't much structure to it, even in the json format.
I'd like to get the text of Writer1_1, Penciler1_1, etc. Perhaps I'm not making my parameters right, so maybe there are other options I could output.
You can see the content in a more user-readable way here.

I'm sure the regex and final splitting could be more efficient, but this gets the job done for what you asked.
import urllib2
import re
data = urllib2.urlopen('http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content')
regex = re.compile('(Writer1_1|Penciler1_1)')
for line in data.read().split('|'):
if regex.search(line):
#assume everything after = is the full name
print ' '.join(line.split()[2:])

Related

Printing out a specific part of text from requests

I've been trying to scrape out data from a profile to have a set of information whether something changed, here's a snippet of what overall code would probably would look like:
import requests
response = requests.get('https://twitter.com/elonmusk')
print(response.text[30907:30957])
#need to print out "sensitive_media_settings_enabled":{"value":false}
I need to have "sensitive_media_settings_enabled":{"value":false} printed out in the shell, how can I do this?
Like Ali said in a comment, a better approach to this is to use a regular expression to find and extract the string you're looking for. When I tried this, the index start and stop were at 43539 and 43589 respectively.
Here's how you could do it with regex
import re
import requests
response = requests.get('https://twitter.com/elonmusk')
reg_expression = r'"sensitive_media_settings_enabled":{"value":(true|false)}'
result = re.search(reg_expression, response.text)
print(result[0])
prints "sensitive_media_settings_enabled":{"value":false}

Python finding exact string in .html file

I have a .html file which gets dynamically filled depending on what actions are taken in the program, however I am having an issue when searching for an exact string, the issue is that although I know the file is not blank, the loop doesn't return anything and thinks its blank.
I have searched and read many other SO questions and tried many of them, including 'blah' in line, re.findall, and with open() all the time they return only blank, I'm thinking I need HTML parsing or similar?
Can anyone shed any light on this for me?
f = open(outApp + '_report.html', 'r+')
for line in f:
#check the for loop works
self.progressBox.AppendText(line)
if 'mystring' in line:
#do stuff
The string I wish to find is My country which is wrapped in h2 tags
It is definitely shouldn't be done without special HTML parser.
Google about any python HTML parser you want. For basic usage they are all easy. For example lxml. In pseudo-code your task would be:
from some_cool_lib import SomeCoolHTMLParser
parser = SomeCoolHTMLParser()
doc = parser.parse(path_to_my_html_file)
h2_elements = doc.findall('h2')
for h2 in h2_elements:
if h2.text == 'My country':
# do stuff

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Stripping (XML?) markup from a document using python

I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help
DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.
This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text
As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML
Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

Categories