Download the posts from a Google group before resetting

Download the posts from a Google group before resetting - python

I want to reset one of my groups (a class discussion), but I would like to retain the discussion for reference. There aren't many posts (maybe 50), and I could do it by hand, but is there a way to do that through google apps scripts or python?
I found a few possibilities, but neither in a language I'm familiar with (though I might be able to translate):
this link: http://saturnboy.com/2010/03/scraping-google-groups/
this Perl code:
#!/usr/bin/perl
# groups2csv.pl
# Google Groups results exported to CSV suitable for import into Excel.
# Usage: perl groups2csv.pl < groups.html > groups.csv
# The CSV Header.
print qq{"title","url","group","date","author","number of articles"\n};
# The base URL for Google Groups.
my $url = "http://groups.google.com";
# Rake in those results.
my($results) = (join '', <>);
# Perform a regular expression match to glean individual results.
while ( $results =~ m!<a href=(/groups[^\>]+?rnum=[0-9]+)>(.+?)</a>.*?
<br>(.+?)<br>.*?<a href="?/groups.+?class=a>(.+?)</a> - (.+?) by
(.+?)\s+.*?\(([0-9]+) article!mgis ) {
my($path, $title, $snippet, $group, $date, $author, $articles) =
($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||'');
$title =~ s!"!""!g; # double escape " marks
$title =~ s!<.+?>!!g; # drop all HTML tags
print qq{"$title","$url$path","$group","$date","$author","$articles"\n\n};
}

Take a look at the HTTrack utility mentioned in this webapps question and in this forum discussion.
Note I'm assuming you don't actually want to screen scrape and process data but merely have a copy of the discussion for future reference.
EDIT: If you actually want to screen scrape, you can do this too but writing a script to do it can be a significant time sink. Screen scraping is more about extracting specific pieces of data from an html document than it is grabbing the entire html document. An example where you might need to screen scrape would be if you were looking at the jeopardy website and wanted to grab individual questions, their point values, who answered them right, which game they occurred in, etc for insertion into a database.

Related

Python scraping an unstructured PDF

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.
Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X
description
Defect fixes
description
table with defect descriptions
rest of the document is irrelevant in this case
I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?
import fitz
filename = '~\releasenotes.pdf'
doc = fitz.open(filename)
print (doc) # Just to see what comes out
(and now what should I do next ?)

A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)
import re #regular expression library
doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''
ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))
Let me unpack that last line:
re.findall will produce a list of items in your document that matches the search string
'2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.
Hope this fits the bill?

Word & Python - Create Table of Contents

I'm using the pywin32.client extension for python and building a Word document. I have tried a pretty good host of methods to generate a ToC but all have failed.
I think what I want to do is call the ActiveDocument object and create one with something like this example from the MSDN page:
Set myRange = ActiveDocument.Range(Start:=0, End:=0)
ActiveDocument.TablesOfContents.Add Range:=myRange, _
UseFields:=False, UseHeadingStyles:=True, _
LowerHeadingLevel:=3, _
UpperHeadingLevel:=1
Except in Python it would be something like:
wordObject.ActiveDocument.TableOfContents.Add(Range=???,UseFiles=False, UseHeadingStyles=True, LowerHeadingLevel=3, UpperHeadingLevel=1)
I've built everything so far using the 'Selection' object (example below) and wish to add this ToC after the first page break.
Here's a sample of what the document looks like:
objWord = win32com.client.Dispatch("Word.Application")
objDoc = objWord.Documents.Open('pathtotemplate.docx') #
objSel = objWord.Selection
#These seem to work but I don't know why...
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.Add(1,True)
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.NumberStyle = 57
objSel.Style = objWord.ActiveDocument.Styles("Heading 1")
objSel.TypeText("TITLE PAGE AND STUFF")
objSel.InsertParagraph()
objSel.TypeText("Some data or another"
objSel.TypeParagraph()
objWord.Selection.InsertBreak()
####INSERT TOC HERE####
Any help would be greatly appreciated! In a perfect world I'd use the default first option which is available from the Word GUI but that seems to point to a file and be harder to access (something about templates).
Thanks

Manually, edit your template in Word, add the ToC (which will be empty initially) any intro stuff, header/footers etc., then at where you want your text content inserted (i.e. after the ToC) put a uniquely named bookmark. Then in your code, create a new document based on the template (or open the template then save it to a different name), search for the bookmark and insert your content there. Save to a different filename.
This approach has all sorts of advantages - you can format your template in Word rather than by writing all the code details, and so you can very easily edit your template to update styles when someone says they want the Normal font to be bigger/smaller/pink you can do it just by editing the template. Make sure to use styles in your code and only apply formatting when it is specifically different from the default style.
Not sure how you make sure the ToC is actually generated, might be automatically updated on every save.

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.

handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)

You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.

Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...

Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

'Unstarring' posts using Google Reader API

Does anybody know how to remove stars for articles starred in Google Reader using its unofficial API?
I found this one but it doesn't work:
http://www.niallkennedy.com/blog/2005/12/google-reader-api.html
Neither does the pyrfeed module in Python, I get the IOError exception every time.

Try using:
r=user%2F[user ID]%2Fstate%2Fcom.google%2Fstarred
instead of
a=user%2F[user ID]%2Fstate%2Fcom.google%2Fstarred
when invoking edit-tag.

I don't have Python code for this (I have Java), but the problem you're stumbling with is pretty much independent from the language you use, and it is always good to be able to see some code where you need to have all the details. You just need to do the requests I do, and verify some of the details I highlight and check if it might be your problem.
You can use this to remove the star for a given post (note that this service supports more than one item at the same time if you need that):
String authToken = getGoogleAuthKey();
// I use Jsoup for the requests, but you can use anything you
// like - for jsoup you usually just need to include a jar
// into your java project
Document doc = Jsoup.connect("http://www.google.com/reader/api/0/edit-tag")
// this is important for permission - more details on how to get this ahead in the text
.header("Authorization", _AUTHPARAMS + authToken)
.data(
// you don't need the userid, the '-' will suffice
// "r" means remove. you can also use "a" to add
// you have lots of other options besides starred. e.g: read
"r", "user/-/state/com.google/starred",
"async", "true",
// the feed, but don't forget the beginning: feed/
"s", "feed/http://www.gizmodo.com/index.xml",
// there are 2 id formats, easy to convert - more info ahead in the text
"i", "tag:google.com,2005:reader/item/1a68fb395bcb6947",
// another token - this one for allow editing - more details on how to get this ahead in the text
"T", "//wF1kyvFPIe6JiyITNnMWdA"
)
// I also send my API key, but I don't think this is mandatory
.userAgent("[YOUR_APP_ID_GOES_HERE].apps.googleusercontent.com")
.timeout(10000)
// VERY IMPORTANT - don't forget the post! (using get() will not work)
.post();
You can check my answer in this other question for some more implementation details (the ones referred to on the comments).
To list all the starred items inside a feed, you can use http://www.google.com/reader/api/0/stream/items/ids or http://www.google.com/reader/atom/user/-/state/com.google/starred . You can use these ids to call the above mentioned API for removing the star.
These last 2 are a lot easier to use. You can check details on the API on these unoffical (but nicely structured) resources: http://www.chrisdadswell.co.uk/android-coding-example-authenticating-clientlogin-google-reader-api/ , http://code.google.com/p/pyrfeed/wiki/GoogleReaderAPI , http://blog.martindoms.com/2009/10/16/using-the-google-reader-api-part-2
Hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download the posts from a Google group before resetting - python

Related

Python scraping an unstructured PDF

Word & Python - Create Table of Contents

BioPython Pubmed Eutils url?

How to transform hyperlink codes into normal URL strings?

'Unstarring' posts using Google Reader API

Categories

Resources