Update table in Confluence through Python - python

I created a Confluence page some time ago, where there is one table. Periodically I need to add additional row and to put some text there. I would like to automate it. Besides, in the last column I need to upload some files.
I wrote a code that can get (download) table from Confluence page. But I am thinking, how to write new information in new row in that table in Confluence through Python. From research that I have carried out I decided that I need to update that table in Python (as dataframe) that I have got from Confluence. After that upload new table into Confluence. Is this idea correct?
I suppose that modified table (dataframe) in Python I need to put as
conf.update_page(page_id, page_content),
However, I get an error like "Object of type DataFrame is not JSON serializable". Could you help me, please? I do not know, how to solve it. Perhaps, something is incorrect.
As I have written before, I need to attach some documents to the last column of the table. I completely do not understand how to do it. There are some functions that can attach something to Confluence, but I need to attach files in the last column and the last (created) row. Do I need to do it in dataframe (modified table) in Python or should I do it in Confluence after uploading modified table there? If so, I do not understand, how to explain to Python to put it exactly in the last column as I only download table to Python from Confluence. Moreover, when
Below there is a code, by which I got table from Confluence.
from atlassian import Confluence
import pandas as pd
conf_site = 'https://confluence.company.com/'
conf_user = "login"
conf_pass = "password"
page_id = 0000000000
conf = Confluence(url=conf_site, username=conf_user, password=conf_pass)
page = conf.get_page_by_id(page_id, expand='body.view')
page_content = page['body']['view']['value']
table = pd.read_html(page_content)
table = table[0]
This table was open in Python as dataframe and I created new row in it and put essential information. However, I do not comprehend, how to add it to Confluence page. I got an error written above.

I've done a similar project, except I was creating a Conf page instead. I found the easiest way to do it is to just create the html code yourself
This plugin lets you view the source code for a Confluence Page. It's an XML based language which makes it quite easy to use in python. I suggest reading up on this as it was quite useful I found:
https://marketplace.atlassian.com/apps/1210722/confluence-source-editor?tab=overview&hosting=server
This is how I was attaching files and then getting the link.
I then use BeautifulSoup4 to find the spot in the code to put the link (Although I created a template for myself)
confluence.attach_file(spur_path, name=None, content_type=None, page_id=pid, title=None, space="spacename", comment="Automatically added")
#Adding plots to HTML code
conf_spur_plot = "http://wiki:8090/download/attachments/" + pid + "/" + spur_filename
soup.find("a", {"class": "spurplot"})["href"] = conf_spur_plot
Quick example of getting and updating a confluence page
#login to confluence
confluence = Confluence(
url='http://confluence',
username='user',
password='pass')
#get page id of the page
pid = confluence.get_page_id(space, title)
#get page from page id
conf_page = confluence.get_page_by_id( expand='body.storage', page_id=pid)
#Check contents
contents = conf_page["body"]["storage"]["value"]
#Stringify the contents
#You can turn this into a beautifulsoup4 object and can then use all the bs4 #methods which can help a lot
conf_string = str(contents)
html_example = "<p>This is an example<p>"
#Add the html example to the contents
conf_string += html_example
#Update the page with the new body
confluence.update_page(
page_id=pid,
title="Example",
body=conf_string)
This below code is just an example of BS4 usage with confluence.
This code creates a BS4 object with the contents from confluence, parsed as XML
Then it finds the 3rd (index = 2) table and gets all the rows except the header row.
conf_html = BeautifulSoup(contents, 'xml')
child_soup = conf_html.find_all("table")[2]
all_rows = child_soup.find_all("tr")[1:]
Edit: As an example:
<ac:structured-macro ac:name="html" ac:schema-version="1">
<ac:plain-text-body>
<![CDATA[{0}]]>
</ac:plain-text-body>
</ac:structured-macro>
The above code is the confluence xml for the html macro. For my program I simply do html_macro.format(example_data). And then I can add that to my confluence page with update_page(), and the html macro with my example data is rendered.
<ac:layout>
<ac:layout-section ac:type="two_equal">
<ac:layout-cell>
<table>
<tbody>
<tr>
<th style="text-align: left;">Description</th>
<th>Test Report</th>
</tr>
<tr>
<th style="text-align: left;"PN</th>
<td>
<p>
<a class="link-here" href="">PN2</a>
</p>
</td>
</tr>
</tbody>
</table>
</ac:layout-cell>
Another example of a table in confluence. After creating the Soup object:
soup.find("a", {"class": "link-here"})["href"] = link_to_plot
This line uses BS4's function to find an a tag with class: "link-here" and then change it's href value.
I manually had to put in classes and ID's in my template xml file to do this. Classes and tags should save on confluence, so if you add them once, it should work every next time.
Hope this helps

Related

Scraping PFR Football Data with Python for a Beginner

background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.
specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.
sample code:
#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"
#html from the given url
html=urlopen(url)
# make soup object of html
soup = BeautifulSoup(html)
# we see that soup is a beautifulsoup object
type(soup)
#
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense").findAll('th')]
column_headers #our column headers
attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
when changing it to find, the error becomes:
AttributeError: 'NoneType' object has no attribute 'find'
I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.
Thank you,
your missing a "}" in the dict after the word "defense". Try below and see if it works.
column_headers = [th.getText() for th in
soup.findAll('table', {"id": "defense"}).findAll('th')]
First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.
The other problem is that the table with id "defense" is commented out in the html on that page:
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
<table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense & Fumbles Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.
Like this:
from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

Python web scraping - how to get resources with beautiful soup when page loads contents via JS?

So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.
Here's the code:
import requests
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
It looks like this data is loaded via an ajax call:
You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php
import requests
import urllib
from bs4 import BeautifulSoup
params = {
"type":"team-detail",
"league":"ncb",
"stat_id":"3083",
"season_id":"312",
"cat_type":"2",
"view":"stats_v1",
"is_previous":"0",
"date":"04/06/2015"
}
content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
Using your web inspector you can also view the parameters that are passed along with the POST request.
Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.
If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.
The table on that website is being created via javascript, and so does not exist when you simply throw the source code at BeautifulSoup.
Either you need to start poking around with your web inspector of choice, and find out where the javascript is getting the data from - or you should use something like selenium to run a complete browser instance.
Since the table data is loaded dynamically, there be some lag is updating the table data due multiple reason like network delay. So you can wait for time by giving a delay and reading the data.
Check if table data i.e. length is null, if so read the table data after some delay. This will help .
Looked at the url you have used. Since you are using class selector for the table. make sure that it is present other places in the HTML

scraping a table and getting more info from a link

I am using python and beautifulsoup to scrape a table...I have a pretty good handle on getting most of the information I need. shortened table of what I am trying to scrape.
<tr> <td>Joseph Carter Abbott</td> <td>1868–1872</td> <td>North Carolina</td> <td>Republican</td>
</tr>
<tr> <td>James Abdnor</td> <td>1981–1987</td> <td>South Dakota</td> <td>Republican</td> </tr> <tr> <td>Hazel Abel</td> <td>1954</td> <td>Nebraska</td> <td>Republican</td>
</tr>
http://en.wikipedia.org/wiki/List_of_former_United_States_senators
I want Name, Description, Years, State, Party.
The Description is the first paragraph of text on each persons page. I know how to get this independently, but I have no idea on how to integrate it with Name, Years, State, Party because I have to navigate to a different page.
oh and I need to write it to a csv.
Thanks!
Just to expound on #anrosent's answer: sending a request mid-parsing is one of the best and most consistent ways of doing this. However, your function that gets the description has to behave properly as well, because if it returns a NoneType error, the whole process is turned into disarray.
The way I did this on my end is this (note that I'm using the Requests library and not urllib or urllib2 as I'm more comfortable with that -- feel free to change it to your liking, the logic is the same anyway):
from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv
ofile = open("presidents.csv", "wb")
f = csv.writer(ofile)
f.writerow(["Name","Description","Years","State","Party"])
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators"
r = rq.get(base_url)
soup = bsoup(r.content)
all_tables = soup.find_all("table", class_="wikitable")
def get_description(url):
r = rq.get(url)
soup = bsoup(r.content)
desc = soup.find_all("p")[0].get_text().strip().encode("utf-8")
return desc
complete_list = []
for table in all_tables:
trs = table.find_all("tr")[1:] # Ignore the header row.
for tr in trs:
tds = tr.find_all("td")
first = tds[0].find("a")
name = first.get_text().encode("utf-8")
desc = get_description("http://en.wikipedia.org%s" % first["href"])
years = tds[1].get_text().encode("utf-8")
state = tds[2].get_text().encode("utf-8")
party = tds[3].get_text().encode("utf-8")
f.writerow([name, desc, years, state, party])
ofile.close()
However, this attempt ends at the line just after David Barton. If you check the page, maybe it has something to do with him occupying two lines to himself. This is up to you to fix. Traceback is as follows:
Traceback (most recent call last):
File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module>
name = first.get_text().encode("utf-8")
AttributeError: 'NoneType' object has no attribute 'get_text'
Also, notice how my get_description function is before the main process. This is obviously because you have to define the function first. Finally, my get_description function is not nearly perfect enough, as it can fail if by some chance the first p tag in the individual pages is not the one you want.
Sample of result:
Pay attention to the erroneous lines, like Maryon Allen's description. This is for you to fix as well.
Hope this points you in the right direction.
If you're using BeautifulSoup, you won't be navigating to the other page in the stateful, browser-like sense so much as just making another request for the other page with the url like wiki/name. So your code might look like
import urllib, csv
with open('out.csv','w') as f:
csv_file = csv.writer(f)
#loop through the rows of the table
for row in senator_rows:
name = get_name(row)
... #extract the other data from the <tr> elt
senator_page_url = get_url(row)
#get description from HTML text of senator's page
description = get_description(get_html(senator_page_url))
#write this row to the CSV file
csv_file.writerow([name, ..., description])
#quick way to get the HTML text as string for given url
def get_html(url):
return urllib.urlopen(url).read()
Note that in python 3.x you'll be importing and using urllib.request instead of urllib, and you'll have to decode the bytes the read() call will return.
It sounds like you know how to fill in the other get_* functions I left in there, so I hope this helps!

How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.

How to fetch some data conditionally with Python and Beautiful Soup

Sorry if you feel like this has been asked but I have read the related questions and being quite new to Python I could not find how to write this request in a clean manner.
For now I have this minimal Python code:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.atpworldtour.com/Rankings/Singles.aspx")
filename = "rankings.html"
FILE = open(filename,"w")
html = br.response().read();
soup = BeautifulSoup(html);
links = soup.findAll('a', href=re.compile("Players"));
for link in links:
print link['href'];
FILE.writelines(html);
It retrieves all the link where the href contains the word player.
Now the HTML I need to parse looks something like this:
<tr>
<td>1</td>
<td>Federer, Roger (SUI)</td>
<td>10,550</td>
<td>0</td>
<td>19</td>
</tr>
The 1 contains the rank of the player.
I would like to be able to retrieve this data in a dictionary:
rank
name of the player
link to the detailed page (here /Tennis/Players/Top-Players/Roger-Federer.aspx)
Could you give me some pointers or if this is easy enough help me to build the piece of code ? I am not sure about how to formulate the request in Beautiful Soup.
Anthony
Searching for the players using your method will work, but will return 3 results per player. Easier to search for the table itself, and then iterate over the rows (except the header):
table=soup.find('table', 'bioTableAlt')
for row in table.findAll('tr')[1:]:
cells = row.findAll('td')
#retreieve data from cells...
To get the data you need:
rank = cells[0].string
player = cells[1].a.string
link = cells[1].a['href']

Categories