Retrieving a lot url addresses

Retrieving a lot url addresses - python

Edit: Just for clarification I am using python, and would like to do this within python.
I am in the middle of collecting data for a research project at our university. Basically I need to scrape a lot of information from a website that monitors the European Parliament. Here is an example of how the url of one site looks like:
http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0190&language=EN
The numbers after the reference part of the address refers to:
A7 = Parliament in session (previous parliaments are A6 etc.),
2010 = year,
0190 = number of the file.
What I want to do is to create a variable that has all the urls for different parliaments, so I can loop over this variable and scrape the information from the websites.
P.S: I have tried this:
number = range(1,190,1)
for i in number:
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-" + str(number[i]) +"&language=EN"
results = search_url
print results
but this gives me the following error:
Traceback (most recent call last):
File "", line 7, in
IndexError: list index out of range

If I understand correctly, you just want to be able to loop over the parliments?
i.e. you want A7, A6, A5...?
If that's what you want a simple loop could handle it:
for p in xrange(7,0, -1):
parliment = "A%d" % p
print p
for the other values similar loops would work just as well:
for year in xrange(2010, 2000, -1):
print year
for filenum in xrange(100,200):
fnum = "%.4d" % filenum
print fnum
You could easily nest your loops in the proper order to generate the combination(s) you need. HTH!
Edit:
String formatting is super useful, and here's how you can do it with your example:
# Just create a string with the format specifier in it: %.4d - a [d]ecimal with a
# precision/width of 4 - so instead of 3 you'll get 0003
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"
# This creates a Python generator. They're super powerful and fun to use,
# and you can iterate over them, just like a collection.
# 1 is the default step, so no need for it in this case
for number in xrange(1,190):
print search_url % number
String formatting takes a string with a variety of specifiers - you'll recognize them because they have % in them - followed by % and a tuple containing the arguments to the format string.
If you want to add the year and parliment, change the string to this:
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A%d-%d-%.4d&language=EN"
where the important changes are here:
reference=A%d-%d-%.4d&language=EN
That means you'll need to pass 3 decimals like so:
print search_url % (parliment, year, number)

Can you use python and wget ? Loop through the sessions that exist, and create a string to give to wget? Or is that overkill?

Sorry I can't give this as a comment, but I don't have a high enough score yet.
Looking at the code you quoted in the comment above, your problem is you are trying to add a string and an integer. While some languages will do on the fly conversion (useful when it works but confusing when it doesn't), you have to explicitly convert it with str().
It should be something like:
"http://firstpartofurl" + str(number[i]) + "restofurl"
or, you can use string formatting (using % etc. as Wayne's answer).

Use selenium. Since it controls uses a real browser, it can handle sites using complex javascript. Many language bindings are available, including python.

Related

Extract information in a line of text with a format from user input

I am trying to make a program which takes in input song files and a format to write metatags in file. Here is a few examples of the call:
./parser '%n_-_%t.mp3' 01_-_Respect.mp3 gives me track=01; title=Respect
./parser '%b._%n.%t.mp3' The_Queen_of_Soul._01.Respect.mp3 gives me album=The_Queen_of_Sould; track=01; title=Respect
./parser '%a-%b._%n.%t.mp3' Aretha_Franklin-The_Queen_of_Soul._01.Respect.mp3 gives me artist=Aretha_Franklin; track=01; title=Respect
./parser '%a_-_%b_-_%n_-_%t.mp3' Aretha_Franklin_-_The_Queen_of_Soul_-_01_-_Respect.mp3 gives me artist=Aretha_Franklin; track=01; title=Respect
For a call on the file 01_-_Respect.mp3, I'd like to have a variable containing 01, and the other Respect.
Here %n and %t represents respectively the number and the title of the songs. The problem is that I don't know how to extract this information in bash (or eventually in python).
My biggest problem is that I don't know the format in advance!
Note: There is more information than this, for example %b for the album, %a for the artist etc.

Well, you can use the string method split to split the string by _-_.
and for taking the input from the command line, you can use sys.argv to get that.
here's an example:
import sys
number,title = sys.argv[1].split("_-_")
Update:
Surely you can pass the pattern as a first argument and the file as the second argument like that:
import sys
pattern = sys.argv[1]
number,title = sys.argv[2].split(pattern)
Now if you need more complex and dynamic processing, then Regex is your winning card!
And in order to write a good regex, you got to understand your data and your problem or you'll end up writing a glitchy regex

You can elaborate on this. It is a very simple example, though.
import re
p = re.compile('([0-1][0-1])_\-_(.*)\.mp3')
title = '01_-_Respect.mp3'
p.findall(title)
Output
[('01', 'Respect')]
I use this page to play with regex.
Update
Since the format is given, go with string slicing. Ok, pretty limited to the specific case..
number = title[:title.find('_')]
>>> number
'01'
>>> track = title[len(number) + 3:len(title)-4]
>>> track
'Respect'

Try This code:
(considering argument is given in runtime)
tmp=$1
num=echo ${tmp%%_*}
title=echo ${tmp##*_}|cut -d. -f1
Variables num and title will store the parts from the argument

Selenium - How to add integer variable in xpath using python

Im new with selenium/python and that my problem:
I have a simple site with a couple of news.
I try to write script that iterates over all news, open each one, do something and goes back to all other news
All news have same xpath, difference only with last symbol - i try to put this symbol as variable and loop over all news, with increment my variable after every visited news:
x = len(driver.find_elements_by_class_name('cards-news-event'))
print (x)
for i in range(x):
driver.find_element_by_xpath('/html/body/div[1]/div[1]/div/div/div/div[2]/div/div[3]/div/div[1]/div/**a["'+i+'"]**').click()
do something
i = i+1
Python return error: "Except type "str", got "int" instead. Google it couple of hours but really can't deal with it
Very appreciate for any help

You are trying to add a string and a int which is is why the exception. Use str(i) instead of i
xpath_string = '/html/body/div[1]/div[1]/div/div/div/div[2]/div/div[3]/div/div[1]/div/**a[{0}]**'.format(str(i))
driver.find_element_by_xpath(xpath_string).click()
In the above the {0} is replaced with str(i). You can use .format to substitute multiple variables in a string by providing them as positional values, it is more elegant and easy to use that using + to concatenate strings.
refer: http://thepythonguru.com/python-string-formatting/

Python Regex to pull multiple pieces of data out of a data structure

I need a regex to pull tidbits out of the following data structure. This data is in a javascript variable. I'm using BeautifulSoup and Mechanize to make the request and parse the page but I don't see how I can get what I need without a regex. More details follow below.
raw data:
var d = [[909.0546875,842.3125,32429,'TownID: 32429','GREY','circle_grey.png',970,'goldpimp\'s city','','N/A'],[1434.8890625,1365.41484375,32143,'TownID: 32143','GREY','circle_grey.png',899,'1..','','N/A'],[1553.92265625,1117.43046875,32326,'TownID: 32326','GREY','circle_grey.png',522,'Avacyns Pantheon','','N/A'],[1305.17265625,1328.6421875,28927,'TownID: 28927','GREY','circle_grey.png',3554,'Furiocity','','N/A'],...(cont.)
For example on the first line I need to pull TownID: 32429, 970, and goldpimp\'s city
I need to do this for the whole data structure to get each townID and associated information. Sorry for the newbie question but regex really racks my brain.

d is a list, you can access lists by indexing. So, why the regex? You don't need it.
For getting your result:
for city in d:
print "%s %s %s" % (city[3], city[6], city[7])
The print statement prints the text in console. Each %s will be replaced (in order) with a string from the right group (first %s will be replaced with city[3], second with city[6] and third with city[7]).
EDIT
OK, if d comes from a Javascript source, you need to convert to Python data using json.loads, store the result of it in a variable and access with the eariler method (see info about the Python's json module here for 2.7 and here for 3.3).

Python splitting values from urllib in string

I'm trying to get IP location and other stuff from ipinfodb.com, but I'm stuck.
I want to split all of the values into new strings that I can format how I want later. What I wrote so far is:
resp = urllib2.urlopen('http://api.ipinfodb.com/v3/ip-city/?key=mykey&ip=someip').read()
out = resp.replace(";", " ")
print out
Before I replaced the string into new one the output was:
OK;;someip;somecountry;somecountrycode;somecity;somecity;-;42.1975;23.3342;+05:00
So I made it show only
OK someip somecountry somecountrycode somecity somecity - 42.1975;23.3342 +05:00
But the problem is that this is pretty stupid, because I want to use them not in one string, but in more, because what I do now is print out and it outputs this, I want to change it like print country, print city and it outputs the country,city etc. I tried checking in their site, there's some class for that but it's for different api version so I can't use it (v2, mine is v3). Does anyone have an idea how to do that?
PS. Sorry if the answer is obvious or I'm mistaken, I'm new with Python :s

You need to split the resp text by ;:
out = resp.split(';')
Now out is a list of values instead, use indexes to access various items:
print 'Country: {}'.format(out[3])
Alternatively, add format=json to your query string and receive a JSON response from that API:
import json
resp = urllib2.urlopen('http://api.ipinfodb.com/v3/ip-city/?format=json&key=mykey&ip=someip')
data = json.load(resp)
print data['countryName']

Irregular String Parsing on Python

I'm new to python/django and I am trying to suss out more effective information from my scraper. Currently, the scraper takes a list of comic book titles and correctly divides them into a CSV list in three parts (Published Date, Original Date, and Title). I then pass the current date and title through to different parts of my databse, which I do in my Loader script (convert mm/dd/yy into yyyy-mm-dd, save to "pub_date" column, title goes to "title" column).
A common string can look like this:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
I am successfully grabbing the date, but the title is trickier. In this instance, I'd ideally like to fill three different columns with the information after the second "|". The Title should go to "title", a charfield. the number 12 (after the '#') should go into the DecimalField "issue_num", and everything between the '()' 's should go into the "Special" charfield. I am not sure how to do this kind of rigorous parsing.
Sometimes, there are multiple #'s (one comic in particular is described as a bundle, "Containing issues #90-#95") and several have multiple '()' groups (such as, "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover)
)
What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines. How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?

Use the regular expression module re. For example, if you have the third |-delimited field of your sample record in a variable s, then you can do
match = re.match(r"^(?P<title>[^#]*) #(?P<num>[0-9]+) \((?P<special>.*)\)$", s)
title = match.groups('title')
issue = match.groups('num')
special = match.groups('special')
You'll get an IndexError in the last three lines for a missing field. Adapt the RE until it parses everything your want.

Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title.
I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.
To write such rules you can use the re module or even pyparsing.
The general idea goes like this:
class CantParse(Exception):
pass
# one rule to parse one kind of title
import re
def title_with_special( title ):
""" accepts only a title of the form
<text> #<issue> (<special>) """
m = re.match(r"[^#]*#(\d+) \(([^)]+)\)", title)
if m:
return m.group(1), m.group(2)
else:
raise CantParse(title)
def parse_extra(title, rules):
""" tries to parse extra information from a title using the rules """
for rule in rules:
try:
return rule(title)
except CantParse:
pass
# nothing matched
raise CantParse(title)
# lets try this out
rules = [title_with_special] # list of rules to apply, add more functions here
titles = ["Stan Lee's Traveler #12 (10 Copy Incentive Cover)",
"Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover) )"]
for title in titles:
try:
issue, special = parse_extra(title, rules)
print "Parsed", title, "to issue=%s special='%s'" % (issue, special)
except CantParse:
print "No matching rule for", title
As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following:
>>> from stringparser import Parser
>>> p = Parser(r"{date:s}\|{date2:s}\|{title:s}#{issue:d} \({special:s}\)")
>>> x = p("10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)")
OrderedDict([('date', '10/12/11'), ('date2', '10/12/11'), ('title', "Stan Lee's Traveler "), ('issue', 12), ('special', '10 Copy Incentive Cover')])
>>> x.issue
12
The output in this case is an (ordered) dictionary. This will work for any simple cases and you might tweak it to catch multiple issues or multiple ()
One more thing: notice that in the current version you need to manually escape regex characters (i.e. if you want to find |, you need to type \|). I am planning to change this soon.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrieving a lot url addresses - python

Can you use python and wget ? Loop through the sessions that exist, and create a string to give to wget? Or is that overkill?

Use selenium. Since it controls uses a real browser, it can handle sites using complex javascript. Many language bindings are available, including python.

Related

Extract information in a line of text with a format from user input

Selenium - How to add integer variable in xpath using python

Python Regex to pull multiple pieces of data out of a data structure

Python splitting values from urllib in string

Irregular String Parsing on Python

Categories

Resources