Grouping list of similar urls in python

Grouping list of similar urls in python - python

I have a large sets of urls. Some are similar to each other i.e. they represent the similar set of pages.
For eg.
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
are similar. Similarly
http://example.com/showitem/apple/
http://example.com/showitem/banana/
http://example.com/showitem/grapes/
are also similar. So i need to represent them as http://example.com/product/(Integers)/
where (Integers) = 1,2,40,33 and http://example.com/showitem/(strings)/ where strings = apple,banana,grapes ... and so on.
Is there any inbuilt function or library in python to do find these similar urls from large set of mixed urls? How can this be done more efficiently? Please suggest. Thanks in advance.

Use a string to store the first part of the URL and just handle IDs, example:
In [1]: PRODUCT_URL='http://example.com/product/%(id)s/'
In [2]: _ids = '1 2 40 33'.split() # split string into list of IDs
In [3]: for id in _ids:
...: print PRODUCT_URL % {'id':id}
...:
http://example.com/product/1/
http://example.com/product/2/
http://example.com/product/40/
http://example.com/product/33/
The statement print PRODUCT_URL % {'id':id} uses Python string formatting to format the product URL depending on the variable id passed.
UPDATE:
I see you've changed your question. The solution for your problem is quite domain-specific and depends on your data set. There are several approaches, some more manual than others. One such approach would be to get the top-level URLs i.e. to retrieve the domain name:
In [7]: _url = 'http://example.com/product/33/' # url we're testing with
In [8]: ('/').join(_url.split('/')[:3]) # get domain
Out[8]: 'http://example.com'
In [9]: ('/').join(_url.split('/')[:4]) # get domain + first URL sub-part
Out[9]: 'http://example.com/product'
[:3] and [:4] above are just slicing the list resulting from split('/')
You can set the result as a key on a dict for which you keep a count of each time you encounter the URL part. And move on from there. Again the solution depends on your data. If it gets more complex than above then I suggest you look into regex as the other answers suggest.

You can use regular expressions to handle that cases. You can go to the Python documentation to see how is this handle.
Also you can see how Django implement this on its routings system

I'm not exactly sure what specifically you are looking for. It sounds to me that you are looking for something to match URLs. If this is indeed what you want then I suggest you use something that is built using regular expressions. One example can be found here.
I also suggest you take a look at Django and its routing system.

Not in Python, but I've created a Ruby Library (and an accompanying app) --
https://rubygems.org/gems/LinkGrouper
It works on all links (doesn't need to know any pattern).

Related

Output list of links grouped by extension or base URL - built on regex using python.

Working on this assignment for a while now. The regex is not particularly difficult, but I don't quite follow how to get the output they want
Your program should:
Read the html of a webpage (which has been stored as textfile);
Extract all the domains referred to and list all the full http addresses related to these domains;
Extract all the resource types referred to and list all the full http * addresses related to these resource types.
Please solve the task using regular expressions and re functions/methods. I suggest using ‘finditer’ and ‘groups’ (there might be other possibilities). Please do not use string functions where re is better suited."
The output is supposed to look like this
www.fairfaxmedia.co.nz
http://www.fairfaxmedia.co.nz
www.essentialmums.co.nz
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
www.nzfishingnews.co.nz
http://www.nzfishingnews.co.nz/
www.nzlifeandleisure.co.nz
http://www.nzlifeandleisure.co.nz/
www.weatherzone.co.nz
http://www.weatherzone.co.nz/
www.azdirect.co.nz
http://www.azdirect.co.nz/
i.stuff.co.nz
http://i.stuff.co.nz/
ico
http://static.stuff.co.nz/781/3251781.ico
zip
http://static2.stuff.co.nz/1392867595/static/jwplayer/skin/Modieus.zip
mp4
http://file2.stuff.co.nz/1394587586/272/9819272.mp4
I really need help with how to filter stuff out so the output shows up like that?

create list of tuples (keyword, url)
sort it according to keyword
using itertools.groupby group per keyword
for each keyword, print keyword and then all urls (these to be printed indentend).

how do I modify a url that I pick at random in python

I have an app that will show images from reddit. Some images come like this http://imgur.com/Cuv9oau, when I need to make them look like this http://i.imgur.com/Cuv9oau.jpg. Just add an (i) at the beginning and (.jpg) at the end.

You can use a string replace:
s = "http://imgur.com/Cuv9oau"
s = s.replace("//imgur", "//i.imgur")+(".jpg" if not s.endswith(".jpg") else "")
This sets s to:
'http://i.imgur.com/Cuv9oau.jpg'

This function should do what you need. I expanded on #jh314's response and made the code a little less compact and checked that the url started with http://imgur.com as that code would cause issues with other URLs, like the google search I included. It also only replaces the first instance, which could causes issues.
def fixImgurLinks(url):
if url.lower().startswith("http://imgur.com"):
url = url.replace("http://imgur", "http://i.imgur",1) # Only replace the first instance.
if not url.endswith(".jpg"):
url +=".jpg"
return url
for u in ["http://imgur.com/Cuv9oau","http://www.google.com/search?q=http://imgur"]:
print fixImgurLinks(u)
Gives:
>>> http://i.imgur.com/Cuv9oau.jpg
>>> http://www.google.com/search?q=http://imgur

You should use Python's regular expressions to place the i. As for the .jpg you can just append it.

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.

Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...

Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

Split in py. write in django

asd = "qweasdzxc";
qwen = asd.split("")
self.response.out.write(qwen[0]) # write q
i can split the sentence with this way but i want to do this with using django template in my html doc.
How can i do this
Thanks for helping

You might want to write a custom filter for django for this. Here is a snippet for splitting.
There is also make_list and slice (look on the same page as the last link) to accomplish the same goal.
Given the example slice is probably what you want.

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Situation:
I am writing a basic templating system in Python/mod_python that reads in a main HTML template and replaces instances of ":value:" throughout the document with additional HTML or db results and then returns it as a view to the user.
I am not trying to replace all instances of 1 substring. Values can vary. There is a finite list of what's acceptable. It is not unlimited. The syntax for the values is [colon]value[colon]. Examples might be ":gallery: , :related: , :comments:". The replacement may be additional static HTML or a call to a function. The functions may vary as well.
Question:
What's the most efficient way to read in the main HTML file and replace the unknown combination of values with their appropriate replacement?
Thanks in advance for any thoughts/solutions,
c

There are dozens of templating options that already exist. Consider genshi, mako, jinja2, django templates, or more.
You'll find that you're reinventing the wheel with little/no benefit.

If you can't use an existing templating system for whatever reason, your problem seems best tackled with regular expressions:
import re
valre = re.compile(r':\w+:')
def dosub(correspvals, correspfuns, lastditch):
def f(value):
v = value.group()[1:-1]
if v in correspvals:
return correspvals[v]
if v in correspfuns:
return correspfuns[v]() # or whatever args you need
# what if a value has neither a corresponding value to
# substitute, NOR a function to call? Whatever...:
return lastditch(v)
return f
replacer = dosub(adict, another, somefun)
thehtml = valre.sub(replacer, thehtml)
Basically you'll need two dictionaries (one mapping values to corresponding values, another mapping values to corresponding functions to be called) and a function to be called as a last-ditch attempt for values that can't be found in either dictionary; the code above shows you how to put these things together (I'm using a closure, a class would of course do just as well) and how to apply them for the required replacement task.

This is probably a job for a templating engine and for Python there are a number of choices. In this stackoveflow question people have listed their favourites and some helpfully explain why: What is your single favorite Python templating engine?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping list of similar urls in python - python

You can use regular expressions to handle that cases. You can go to the Python documentation to see how is this handle. Also you can see how Django implement this on its routings system

Not in Python, but I've created a Ruby Library (and an accompanying app) -- https://rubygems.org/gems/LinkGrouper It works on all links (doesn't need to know any pattern).

Related

Output list of links grouped by extension or base URL - built on regex using python.

how do I modify a url that I pick at random in python

How to transform hyperlink codes into normal URL strings?

Split in py. write in django

Python: Need to replace a series of different substrings in HTML template with additional HTML or database results

Categories

Resources