How to fix complicated HTML encoding for URL in python script?

How to fix complicated HTML encoding for URL in python script? - python

I have a nightmare situation on my hands (or maybe it is easy, I don't know)...So I have a small function that runs in a rather large python script...I have everything worked out in the larger script, and at the end the script will call our web map services and show the parcels in question...We have 20K parcels, and ONLY 10 of them have a '%' in the Deedholder name. So this works over 99% of the time, but there is always that 1% (or way less in this case)
The problem is, in the rare situation where there is a percent sign in the deedholder name, when I supply a url it cannot find the query. So I have tested a ton of names, and it only will not work when there is a percent sign in the name.
So the prefix will look like this:
'https://cedar.integritygis.com/default.aspx?ql=Parcel&qf=REALDATA_DEEDHOLDER&qv='
and the name is added to the end, which looks like this:
'COOPER MICHAEL A & DEBRA K'
My code can easily replace the spaces with '%20' and the & with '%26'...etc. But what do I do when THIS is the deedholder name:
'SIEBELS LAWRENCE J (75%) & LOUISE F TRUST (25%)'
I cannot successfully get this query to work. Here is my test code with just the function in question:
import webbrowser, time
def FixURL(string):
## string = string.replace('%','~')
print string
fix_dict = {' ':'%20','!':'%21','"':'%22','#':'%23','$':'%24',
'&':'%26',"'":'%27','(':'%28',')':'%29',
'*':'%2A','+':'%2b','.':'%2E','/':'%2F',':':'%3A',
';':'%3B','?':'%3F','#':'%40','{':'%7B','{':'%7D'}
for k,v in fix_dict.iteritems():
if k in string:
string = string.replace(k,v)
## return string.replace('~','%25')
return string
if __name__ == '__main__':
# testing
easy = FixURL('COOPER MICHAEL A & DEBRA K')
prefix = 'https://cedar.integritygis.com/default.aspx?ql=Parcel&qf=REALDATA_DEEDHOLDER&qv='
url = '{}{}'.format(prefix,easy)
print easy
webbrowser.open(url)
time.sleep(15) # give it time to work
hard = FixURL('SIEBELS LAWRENCE J (75%) & LOUISE F TRUST (25%)')
print hard
url = '{}{}'.format(prefix,hard)
webbrowser.open(url)
I cannot figure out how to 'trick' it...You can see my unsucessful attempts are commented out. Does anyone have a fix? One thing I am thinking of doing is removing the space from the dictionary and using '%20'.join(string.split()) and testing each item in the list for replacement values for the url...Does have any ideas? It seems I have been squeezed by Python yet again. Thanks.
EDIT:
I have since scratched the entire function and am just urllib.quote(). this as a test:
import webbrowser, urllib, time
prefix = 'https://cedar.integritygis.com/default.aspx?ql=Parcel&qf=REALDATA_DEEDHOLDER&qv='
easy = urllib.quote('COOPER MICHAEL A & DEBRA K')
url = '{}{}'.format(prefix,easy)
print easy
webbrowser.open(url)
time.sleep(15) # give it time to work
hard = urllib.quote('SIEBELS LAWRENCE J (75%) & LOUISE F TRUST (25%)')
print hard
url = '{}{}'.format(prefix,hard)
webbrowser.open(url)
This is suppposed to zoom to the parcels owned by the name supplied...The first one works, the second does not because of the % in the parenthesis (I think). I get the 'ol query returned no results error.

You can use python's standard urllib to do this.
http://docs.python.org/2/library/urllib.html#utility-functions
Look at the utility functions. urllib.quote will probably do the job.

Related

Why isn't this function to scrape a table with Selenium working as intended?

I'm trying to scrape the table on this website with Selenium (you need to make a quick account and login to see it). To get the first column, I use the following code, which works:
first_column = driver.find_elements_by_xpath("//div[#class='top-line-table']//tbody//td[2]")
for i in first_column:
print(i.text)
Prints:
Bernie Sanders
Joe Biden
Michael Bloomberg
Elizabeth Warren
However, when I try making it into a function that I can input the column I want into, it only returns the first value, "Bernie Sanders". Here's my code to define, call, and print the function:
def scrape_column (column):
raw = driver.find_elements_by_xpath(f"//div[#class='top-line-table']//tbody//td[{column}]")
for ii in raw:
return[ii.text]
candidates = scrape_column("2")
print(candidates)
I don't know why it only returns the first value, I've tried a lot of things and it still doesn't work. Help is much appreciated!

The reason is that when you return the lifecycle of your function ends. It has returned a value and that's it. If you want to get all candidate names as list, do something like this:
def scrape_column (column):
names = []
raw = driver.find_elements_by_xpath(f"//div[#class='top-line-table']//tbody//td[{column}]")
for ii in raw:
names.append(ii.text)
return names
candidates = scrape_column("2")
print('\n'.join(candidates))
Look up join, it is a pretty neat stuff.

Python: print/get first sentence of each paragraph

This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
This code prints:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.
BUT I ONLY want it to print:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.
Thanks for help

Split the text on that dot; for a single split, using str.partition() is faster than str.split() with a limit:
text = soup.find_all('p')[0].get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
If you only need to process the first <p> element, use soup.find() instead:
text = soup.find('p').get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
For your given URL, however, the sample text is found as the second paragraph:
>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'

def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
paragraph = soup.find_all('p')[0].get_text()
phrase_list = paragraph.split('.')
print(phrase_list[0])

split the paragraph at the first period. Argument 1 species the MAXSPLIT and saves your time from unneccessary extra splitting.
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
my_paragraph = soup.find_all('p')[0].get_text()
my_list = my_paragraph.split('.', 1)
print(my_list[0])

you can use find('.'), it return the index of the first occurence of what you're looking for.
So if the paragraph is stored in a variable called paragraph
sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])
Obviously here is missing the control part like check if the string contained in paragraph variable has '.' etc.. anyway find() return -1 if it does not find the substring you're looking for.

Using Steam WebAPI to get total time playing a game

I'm trying to get access to the amount of TF2 time played using the Steam API. I'm currently using:-
http://api.steampowered.com/ISteamUserStats/GetUserStatsForGame/v0002/?appid=440&key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&steamid=xxxxxxxxxxxxxxxxx&format=xml
And then filter through the XML and extracting the time played relating to each of the classes (e.g. pyro (Pyro.accum.iPlayTime), etc). This worked ok but I think missing the MVM classes made my final value incorrect (my code, in Python, returned 977 when online sites say over 1600 hours). Adding the MVM classes (plus possibly others) may provide th ecorrect result but it's making the code very long winded.
So I was wondering if there is a call in the Steam Web API that will just give me the total time played without having to go though all the extracting and adding?
I have looked through the Steam Web API Developer page, but can't find any reference to what I'm after.
Added Code:
if __name__ == '__main__':
import urllib2
import xml.etree.ElementTree as ET
import datetime
timeKISA = 0
playerStatsKISA = urllib2.urlopen('http://api.steampowered.com/ISteamUserStats/GetUserStatsForGame/v0002/?appid=440&key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&steamid=xxxxxxxxxxxxxxxxx&format=xml')
statsKISA = playerStatsKISA.read()
theStatsKISA = ET.fromstring(statsKISA)
for stat in theStatsKISA.findall("./stats/stat"):
if stat.find('name').text.startswith('Scout.accum.iPlayTime') or \
stat.find('name').text.startswith('Soldier.accum.iPlayTime') or \
stat.find('name').text.startswith('Engineer.accum.iPlayTime') or \
stat.find('name').text.startswith('Medic.accum.iPlayTime') or \
stat.find('name').text.startswith('Spy.accum.iPlayTime') or \
stat.find('name').text.startswith('Sniper.accum.iPlayTime') or \
stat.find('name').text.startswith('Demoman.accum.iPlayTime') or \
stat.find('name').text.startswith('Heavy.accum.iPlayTime') or \
stat.find('name').text.startswith('Pyro.accum.iPlayTime'):
timeKISA = timeKISA + int(stat.find('value').text)
finalTimeKISA = timeKISA / 60 / 60
KISATime = ('KISA_Time=' + str(finalTimeKISA) + ' hours')
print KISATime
Thank you.
Markus

Pulling my comments into an answer,
It is my understanding that the *.accum.iPlayTime fields are cumulative for your place time as that class regardless of game mode or map. Based on my own stats (and a glance at a few others on my friend list), this matches exactly what Steam Community reports that play time is. Additionally, it is reporting that your playtime matches these fields on your TF2 Achievements page.
A couple notes:
The summary page on a player's profile does not seem to match the actual stats that the achievements page shows. I'm unsure if this is a Steam Community issue or a summary of additional fields. However, the achievements page, which has detailed break downs of play time by class, longest life, etc. is using the same data as the GetUserStatsForGame API call.
You have a very minor formating issue in your code. The very last print KISATime is indented one to many times and therefore prints the KISA_Time = line multiple times. If you pull it out of the for loop, you will only get the print line once.
If you change your finalTimeKISA = timeKISA / 60 / 60 to be decimal 60.0 you will get the decimal answers. Otherwise, as is, you will only receive an integer answer.

Retrieving a lot url addresses

Edit: Just for clarification I am using python, and would like to do this within python.
I am in the middle of collecting data for a research project at our university. Basically I need to scrape a lot of information from a website that monitors the European Parliament. Here is an example of how the url of one site looks like:
http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0190&language=EN
The numbers after the reference part of the address refers to:
A7 = Parliament in session (previous parliaments are A6 etc.),
2010 = year,
0190 = number of the file.
What I want to do is to create a variable that has all the urls for different parliaments, so I can loop over this variable and scrape the information from the websites.
P.S: I have tried this:
number = range(1,190,1)
for i in number:
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-" + str(number[i]) +"&language=EN"
results = search_url
print results
but this gives me the following error:
Traceback (most recent call last):
File "", line 7, in
IndexError: list index out of range

If I understand correctly, you just want to be able to loop over the parliments?
i.e. you want A7, A6, A5...?
If that's what you want a simple loop could handle it:
for p in xrange(7,0, -1):
parliment = "A%d" % p
print p
for the other values similar loops would work just as well:
for year in xrange(2010, 2000, -1):
print year
for filenum in xrange(100,200):
fnum = "%.4d" % filenum
print fnum
You could easily nest your loops in the proper order to generate the combination(s) you need. HTH!
Edit:
String formatting is super useful, and here's how you can do it with your example:
# Just create a string with the format specifier in it: %.4d - a [d]ecimal with a
# precision/width of 4 - so instead of 3 you'll get 0003
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"
# This creates a Python generator. They're super powerful and fun to use,
# and you can iterate over them, just like a collection.
# 1 is the default step, so no need for it in this case
for number in xrange(1,190):
print search_url % number
String formatting takes a string with a variety of specifiers - you'll recognize them because they have % in them - followed by % and a tuple containing the arguments to the format string.
If you want to add the year and parliment, change the string to this:
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A%d-%d-%.4d&language=EN"
where the important changes are here:
reference=A%d-%d-%.4d&language=EN
That means you'll need to pass 3 decimals like so:
print search_url % (parliment, year, number)

Can you use python and wget ? Loop through the sessions that exist, and create a string to give to wget? Or is that overkill?

Sorry I can't give this as a comment, but I don't have a high enough score yet.
Looking at the code you quoted in the comment above, your problem is you are trying to add a string and an integer. While some languages will do on the fly conversion (useful when it works but confusing when it doesn't), you have to explicitly convert it with str().
It should be something like:
"http://firstpartofurl" + str(number[i]) + "restofurl"
or, you can use string formatting (using % etc. as Wayne's answer).

Use selenium. Since it controls uses a real browser, it can handle sites using complex javascript. Many language bindings are available, including python.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list

The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]

open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.

This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.

The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.