cannot extract ID correctly from URL using split operation

cannot extract ID correctly from URL using split operation - python

I was using the standard split operation in python to extract ids from urls. It works for
urls of the form https://music.com/146 where I need to extract 146 but fails in these cases
https://music.com/144?i=150
from where I need to extract 150 after i
I use the standard
url.split("/")[-1]
Is there a better way to do it ?

Python provides a few tools to make this process easier.
As #Barmar mentioned, you can use urlsplit to split the URL, which gets you a named tuple:
>>> from urllib import parse as urlparse
>>> x = urlparse.urlsplit('https://music.com/144?i=150')
>>> x
SplitResult(scheme='https', netloc='music.com', path='/144', query='i=150', fragment='')
You can use the parse_qs function to convert the query string into a dictionary:
>>> urlparse.parse_qs(x.query)
{'i': ['150']}
Or in a single line:
>>> urlparse.parse_qs(urlparse.urlsplit('https://music.com/144?i=150').query)['i']
['150']

A particularly useful tool for manipulating URLs in Python is furl, which provides an interface mimicking the convenience of Python's standard pathlib module.
Accessing a parameter in the query string (the part after the ? of the URL) is as simple as indexing the URL's args attribute with the name of the parameter you want:
>>> from furl import furl
>>> url = furl('https://music.com/144?i=150')
>>> url.args['i']
'150'
In my opinion, this is a lot easier than using urllib.

As #Barmar mentioned, you can fix your code to:
url.split("/")[-1].split("?i=")[-1]
Basically you need to split https://music.com/144?i=150 into https://music.com and 144?i=150, get the second element 144?i=150, then split it to 144 and 150, then get the second.
If you need it to be number, you can use int(url.split("/")[-1].split("?i="))[-1]

you can use regexp
import re
url = 'https://music.com/144?i=150'
match = re.search(r'(\d+)\?', url)
if match:
value = match[1] # 144
if you need the 150
match = re.search(r'i=(\d+)', url)
if match:
value = match[1] # 150

Related

How to remove the sensitive information before #github.com to sanitize it correctly using Python 3.9 and/or regex?

I need to include a username and token in a github url to access a private repo on github.
After accessing it, I need to sanitize it to obtain the clean version.
The input pattern is https://{username}:{token}#github.com/{repo_owner}/{repo-name}
The output pattern i want is https://github.com/{repo_owner}/{repo-name}
For e.g. I am given this
https://usernameabc:token1234#github.com/abc/easy-as-123
I want this
https://github.com/abc/easy-as-123
How do I do this with Python? I am okay to use regex
What I use that works
I am using this
def sanitize_github_url(github_url_with_username_token):
github_url_with_username_token = github_url_with_username_token.lower()
index = github_url_with_username_token.find("github.com/", 0)
suffix = github_url_with_username_token[index:]
return f"https://{suffix}"
And it works for my purposes. Is there a better way to do this?

I'd prefer to not use regex in this scenario, and instead use a url manipulation library like furl
eg:
from furl import furl
url = furl("https://usernameabc:token1234#github.com/abc/easy-as-123")
url.password = None
url.username = None
print(str(url))
output:
https://github.com/abc/easy-as-123

Use regex with backward and forward lookaround.
raw = r'https://usernameabc:token1234#github.com/abc/easy-as-123'
re.sub("(?<=https://).*?(?=github.com)", "", raw)

Extracting data from hyperlink cell in CSV

When I am reading a cell with hyperlink from CSV file I am getting the following:
=HYPERLINK("http://google.com","google") #for example
Is there a way to extract only the "google" without the =hyperlink and the link?

As per #martineau's comment, you have two versions of HYPERLINK.
>>> s1 = '=HYPERLINK("http://google.com","google")'
Or
>>> s2 = '=HYPERLINK("http://google.com")'
You can split, use a regex, but these methods are tricky (what if you have a comma in the url? an escaped quote in the name?).
There is a module called ast that parses Python expressions. We can use it, because Excel function call syntax is close to Python's one. Here's a version that returns the friendly name if there is one, and the url else:
>>> import ast
>>> ast.parse(s1[1:]).body[0].value.args[-1].s
'google'
And:
>>> ast.parse(s2[1:]).body[0].value.args[-1].s
'http://google.com'
This is how it works: s1[1:] removes the = sign. Then we take the value of the expression:
>>> v = ast.parse(s1[1:]).body[0].value
>>> v
<_ast.Call object at ...>
It is easy to extract the function name:
>>> v.func.id
'HYPERLINK'
And the args:
>>> [arg.s for arg in v.args]
['http://google.com', 'google']
Just take the last arg ( ....args[-1].s) to get the friendly name if it exists, and the url else. You can also checklen(args)` to do something if there is one arg, and something else if there are two args.

How to extract 'RS4' element in Rpy2

I’m a Chinese reader of book “Applied Spatial Data Analysis with R”, which is very useful for spatial analysis work. I tried to translate the R code into python with the Rpy2, since python can handle more complex data sources. But I have a problem with the Rpy2 code. The NY8 data is attached with the email.
The R code is:
library(rgdal)
NY8<-readOGR(".","NY8_utm18")
Syracuse<-NY8[NY8$AREANAME=="Syracuse city",]
Sy2_nb<-poly2nb(Syracuse,queen=FALSE)
library(spdep)
Sy2_nb<-poly2nb(Syracuse,queen=FALSE)
Sy2_lw<-nb2listw(Sy2_nb)
moran.plot(NY$POP8,Sy2_lw)
When I translate it with Rpy2, the code is:
>>> from rpy2.robjects.packages import importr
>>> utils = importr('utils')
>>> utils.install_packages('rgdal')
>>> rgdal=importr('rgdal')
>>> import os
>>> os.chdir("C:\\PYDATA\\NY")
>>> NY8=rgdal.readOGR(".","NY8_utm18")
>>> print(robjects.r['summary'](NY8))
When I want to translate the code “Syracuse<-NY8[NY8$AREANAME=="Syracuse city",]
”, for example:
>>>Syracuse=NY8[NY8$AREANAME=="Syracuse city",]
The error message is shown: SyntaxError: invalid syntax
It seems I could not extract the “AREANAME” through “$”, because “$” is illegal in python.

Could not get the accepted answer to work so wrote this function:
def subset_RS4(rs4, subset):
subset_func = r("""function(o, s){
o[s]
}
""")
return subset_func(rs4, subset)
Now you can call subset_RS4 with your object as the first arg and the subset as the other.
I am using it like this:
subset1 = r[">"](r["width"](peaks1), args.min_width)
print(subset_RS4(peaks1, subset1))

Use the method rx2 (https://rpy2.github.io/doc/latest/html/vector.html#extracting-r-style):
NY8.rx2("AREANAME")
If this an S4 object (your comment suggests so), a simple way to proceed is to fetch the generic "$" and use it as a function.
base = importr("base")
# "$" is not a syntactically valid name for a Python function,
# so we fetch it form the instance's dictionary of attributes
dollar = base.__dict__["$"]
dollar(NY8, "AREANAME")

AttributeError: 'dict_values' object has no attribute 'rsplit'

I am trying to a reverse split of a URL generated from a text file and am getting the above error when printing that split value. I have tried making a string from the URL and splitting that, but this causes the GUI to freeze completely and not even produce an error message. My code is here:
a = URLS.rsplit('=', 1)
The code I used when attempting to resolve a string from the URL then split that is here:
urlstr = str(URLS)
a = urlstr.rsplit('=', 1)
print(a)
Can anyone tell me why I cant split the URL using the split method (the URLS were defined in a dictionary) and/or why creating a string and then splitting that is not working?
Thanks

The error suggests that URLS is not a string, but rather a dict_values object. I think that's what you get when you call the values method of a dictionary (in Python 3). A values view is an iterable object, so you probably want to loop over it, with something like:
for url in URLS:
a = url.rsplit("=", 1)
# do stuff with a here
Or if you want a list of the various a values, you could use a list comprehension:
a_lst = [url.rsplit("=", 1) for url in URLS]

A dict_values object is a sequence. It does not have an rsplit method, though str objects do.
Really though, instead of using rsplit, you probably should be using urllib.parse to extract information from your URLs.
For example,
>>> import urllib.parse as parse
>>> url = 'http://stackoverflow.com/questions?x=foo&y=bar'
>>> parse.urlsplit(url)
SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions', query='x=foo&y=bar', fragment='')
>>> parse.urlsplit(url).query
'x=foo&y=bar'
>>> parse.parse_qs(parse.urlsplit(url).query)
{'x': ['foo'], 'y': ['bar']}
So, if URLS is a dict, then you can loop through the values and extract the parameter values using
>>> URLS = {'a': 'http://stackoverflow.com/questions?x=foo&y=bar'}
>>> for url in URLS.values():
... print(parse.parse_qs(parse.urlsplit(url).query))
...
{'x': ['foo'], 'y': ['bar']}
Unlike rsplit, parse_qs will allow you to properly unquote percent-encoded query strings, and control the parsing of blank values.

Manipulating Directory Paths in Python

Basically I've got this current url and this other key that I want to merge into a new url, but there are three different cases.
Suppose the current url is localhost:32401/A/B/foo
if key is bar then I want to return localhost:32401/A/B/bar
if key starts with a slash and is /A/bar then I want to return localhost:32401/A/bar
finally if key is its own independent url then I just want to return that key = http://foo.com/bar -> http://foo.com/bar
I assume there is a way to do at least the first two cases without manipulating the strings manually, but nothing jumped out at me immediately in the os.path module.

Have you checked out the urlparse module?
From the docs,
from urlparse import urljoin
urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
Should help with your first case.
Obviously, you can always do basic string manipulation for the rest.

I assume there is a way to do at least the first two cases without manipulating the strings manually, but nothing jumped out at me immediately in the os.path module.
That's because you want to use urllib.parse (for Python 3.x) or urlparse (for Python 2.x) instead.
I don't have much experience with it, though, so here's a snippet using str.split() and str.join().
urlparts = url.split('/')
if key.startswith('http://'):
return key
elif key.startswith('/'):
return '/'.join(urlparts[:2], key[1:])
else:
urlparts[len(urlparts) - 1] = key
return '/'.join(urlparts)

String objects in Python all have startswith and endswith methods that should be able to get you there. Something like this perhaps?
def merge(current, key):
if key.startswith('http'):
return key
if key.startswith('/'):
parts = current.partition('/')
return '/'.join(parts[0], key)
parts = current.rpartition('/')
return '/'.join(parts[0], key)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

cannot extract ID correctly from URL using split operation - python

you can use regexp import re url = 'https://music.com/144?i=150' match = re.search(r'(\d+)\?', url) if match: value = match[1] # 144 if you need the 150 match = re.search(r'i=(\d+)', url) if match: value = match[1] # 150

Related

How to remove the sensitive information before #github.com to sanitize it correctly using Python 3.9 and/or regex?

Extracting data from hyperlink cell in CSV

How to extract 'RS4' element in Rpy2

AttributeError: 'dict_values' object has no attribute 'rsplit'

Manipulating Directory Paths in Python

Categories

Resources