Extracting data from hyperlink cell in CSV

Extracting data from hyperlink cell in CSV - python

When I am reading a cell with hyperlink from CSV file I am getting the following:
=HYPERLINK("http://google.com","google") #for example
Is there a way to extract only the "google" without the =hyperlink and the link?

As per #martineau's comment, you have two versions of HYPERLINK.
>>> s1 = '=HYPERLINK("http://google.com","google")'
Or
>>> s2 = '=HYPERLINK("http://google.com")'
You can split, use a regex, but these methods are tricky (what if you have a comma in the url? an escaped quote in the name?).
There is a module called ast that parses Python expressions. We can use it, because Excel function call syntax is close to Python's one. Here's a version that returns the friendly name if there is one, and the url else:
>>> import ast
>>> ast.parse(s1[1:]).body[0].value.args[-1].s
'google'
And:
>>> ast.parse(s2[1:]).body[0].value.args[-1].s
'http://google.com'
This is how it works: s1[1:] removes the = sign. Then we take the value of the expression:
>>> v = ast.parse(s1[1:]).body[0].value
>>> v
<_ast.Call object at ...>
It is easy to extract the function name:
>>> v.func.id
'HYPERLINK'
And the args:
>>> [arg.s for arg in v.args]
['http://google.com', 'google']
Just take the last arg ( ....args[-1].s) to get the friendly name if it exists, and the url else. You can also checklen(args)` to do something if there is one arg, and something else if there are two args.

Related

cannot extract ID correctly from URL using split operation

I was using the standard split operation in python to extract ids from urls. It works for
urls of the form https://music.com/146 where I need to extract 146 but fails in these cases
https://music.com/144?i=150
from where I need to extract 150 after i
I use the standard
url.split("/")[-1]
Is there a better way to do it ?

Python provides a few tools to make this process easier.
As #Barmar mentioned, you can use urlsplit to split the URL, which gets you a named tuple:
>>> from urllib import parse as urlparse
>>> x = urlparse.urlsplit('https://music.com/144?i=150')
>>> x
SplitResult(scheme='https', netloc='music.com', path='/144', query='i=150', fragment='')
You can use the parse_qs function to convert the query string into a dictionary:
>>> urlparse.parse_qs(x.query)
{'i': ['150']}
Or in a single line:
>>> urlparse.parse_qs(urlparse.urlsplit('https://music.com/144?i=150').query)['i']
['150']

A particularly useful tool for manipulating URLs in Python is furl, which provides an interface mimicking the convenience of Python's standard pathlib module.
Accessing a parameter in the query string (the part after the ? of the URL) is as simple as indexing the URL's args attribute with the name of the parameter you want:
>>> from furl import furl
>>> url = furl('https://music.com/144?i=150')
>>> url.args['i']
'150'
In my opinion, this is a lot easier than using urllib.

As #Barmar mentioned, you can fix your code to:
url.split("/")[-1].split("?i=")[-1]
Basically you need to split https://music.com/144?i=150 into https://music.com and 144?i=150, get the second element 144?i=150, then split it to 144 and 150, then get the second.
If you need it to be number, you can use int(url.split("/")[-1].split("?i="))[-1]

you can use regexp
import re
url = 'https://music.com/144?i=150'
match = re.search(r'(\d+)\?', url)
if match:
value = match[1] # 144
if you need the 150
match = re.search(r'i=(\d+)', url)
if match:
value = match[1] # 150

Parse a URL and replace variables if present

I have URI's specified in an xls file. I want to read that xls file, get the URI from there, parse it, and replace variables (if present) with the corresponding values, and then make an API call to that URI
For example:
These are a few URI's in the xls sheet:
https://api.something.com/v1/me
https://api.something.com/v1/{user_id}/account
(Where user_id is a variable, that has to be replaces by an appropriate value.) is there an easy way to parse the URI and check if there's a variable present there, if yes, get the value of the variable and form a new string with the value and then use the URI to make an API call. Else use the URI as is.

Field names can be discovered using stdlib string.Formatter:
>>> s = "https://api.something.com/v1/{user_id}/account"
>>> from string import Formatter
>>> parsed = Formatter().parse(s)
>>> field_names = []
>>> for literal_text, field_name, format_spec, conversion in parsed:
... if field_name is not None:
... field_names.append(field_name)
...
>>> field_names
['user_id']

Fortunately, Python has a built-in mechanism for handling this!
>>> 'https://api.something.com/v1/{user_id}/account'.format(user_id='my_id', unused_variable='xyzzy')
'https://api.something.com/v1/my_id/account'

How to extract 'RS4' element in Rpy2

I’m a Chinese reader of book “Applied Spatial Data Analysis with R”, which is very useful for spatial analysis work. I tried to translate the R code into python with the Rpy2, since python can handle more complex data sources. But I have a problem with the Rpy2 code. The NY8 data is attached with the email.
The R code is:
library(rgdal)
NY8<-readOGR(".","NY8_utm18")
Syracuse<-NY8[NY8$AREANAME=="Syracuse city",]
Sy2_nb<-poly2nb(Syracuse,queen=FALSE)
library(spdep)
Sy2_nb<-poly2nb(Syracuse,queen=FALSE)
Sy2_lw<-nb2listw(Sy2_nb)
moran.plot(NY$POP8,Sy2_lw)
When I translate it with Rpy2, the code is:
>>> from rpy2.robjects.packages import importr
>>> utils = importr('utils')
>>> utils.install_packages('rgdal')
>>> rgdal=importr('rgdal')
>>> import os
>>> os.chdir("C:\\PYDATA\\NY")
>>> NY8=rgdal.readOGR(".","NY8_utm18")
>>> print(robjects.r['summary'](NY8))
When I want to translate the code “Syracuse<-NY8[NY8$AREANAME=="Syracuse city",]
”, for example:
>>>Syracuse=NY8[NY8$AREANAME=="Syracuse city",]
The error message is shown: SyntaxError: invalid syntax
It seems I could not extract the “AREANAME” through “$”, because “$” is illegal in python.

Could not get the accepted answer to work so wrote this function:
def subset_RS4(rs4, subset):
subset_func = r("""function(o, s){
o[s]
}
""")
return subset_func(rs4, subset)
Now you can call subset_RS4 with your object as the first arg and the subset as the other.
I am using it like this:
subset1 = r[">"](r["width"](peaks1), args.min_width)
print(subset_RS4(peaks1, subset1))

Use the method rx2 (https://rpy2.github.io/doc/latest/html/vector.html#extracting-r-style):
NY8.rx2("AREANAME")
If this an S4 object (your comment suggests so), a simple way to proceed is to fetch the generic "$" and use it as a function.
base = importr("base")
# "$" is not a syntactically valid name for a Python function,
# so we fetch it form the instance's dictionary of attributes
dollar = base.__dict__["$"]
dollar(NY8, "AREANAME")

AttributeError: 'dict_values' object has no attribute 'rsplit'

I am trying to a reverse split of a URL generated from a text file and am getting the above error when printing that split value. I have tried making a string from the URL and splitting that, but this causes the GUI to freeze completely and not even produce an error message. My code is here:
a = URLS.rsplit('=', 1)
The code I used when attempting to resolve a string from the URL then split that is here:
urlstr = str(URLS)
a = urlstr.rsplit('=', 1)
print(a)
Can anyone tell me why I cant split the URL using the split method (the URLS were defined in a dictionary) and/or why creating a string and then splitting that is not working?
Thanks

The error suggests that URLS is not a string, but rather a dict_values object. I think that's what you get when you call the values method of a dictionary (in Python 3). A values view is an iterable object, so you probably want to loop over it, with something like:
for url in URLS:
a = url.rsplit("=", 1)
# do stuff with a here
Or if you want a list of the various a values, you could use a list comprehension:
a_lst = [url.rsplit("=", 1) for url in URLS]

A dict_values object is a sequence. It does not have an rsplit method, though str objects do.
Really though, instead of using rsplit, you probably should be using urllib.parse to extract information from your URLs.
For example,
>>> import urllib.parse as parse
>>> url = 'http://stackoverflow.com/questions?x=foo&y=bar'
>>> parse.urlsplit(url)
SplitResult(scheme='http', netloc='stackoverflow.com', path='/questions', query='x=foo&y=bar', fragment='')
>>> parse.urlsplit(url).query
'x=foo&y=bar'
>>> parse.parse_qs(parse.urlsplit(url).query)
{'x': ['foo'], 'y': ['bar']}
So, if URLS is a dict, then you can loop through the values and extract the parameter values using
>>> URLS = {'a': 'http://stackoverflow.com/questions?x=foo&y=bar'}
>>> for url in URLS.values():
... print(parse.parse_qs(parse.urlsplit(url).query))
...
{'x': ['foo'], 'y': ['bar']}
Unlike rsplit, parse_qs will allow you to properly unquote percent-encoded query strings, and control the parsing of blank values.

Python text processing and parsing

I have a file in gran/config.py AND I cannot import this file (not an option).
Inside this config.py, there is the following code
...<more code>
animal = dict(
bear = r'^bear4x',
tiger = r'^.*\tiger\b.*$'
)
...<more code>
I want to be able parse r'^bear4x' or r'^.*\tiger\b.*$' based on bear or tiger.
I started out with
try:
text = open('gran/config.py','r')
tline = filter('not sure', text.readlines())
text.close()
except IOError, str:
pass
I was hoping to grab the whole animal dict by
grab = re.compile("^animal\s*=\s*('.*')") or something like that
and maybe change tline to tline = filter(grab.search,text.readlines())
but it only grabs animal = dict( and not the following lines of dict.
how can i grab multiple lines?
look for animal then confirm the first '(' then continue to look until ')' ??
Note: the size of animal dict may change so anything static approach (like grab 4 extra lines after animal is found) wouldnt work

Maybe you should try some AST hacks? With python it is easy, just:
import ast
config= ast.parse( file('config.py').read() )
So know you have your parsed module. You need to extract assign to animals and evaluate it. There are safe ast.literal_eval function but since we make a call to dict it wont work here. The idea is to traverse whole module tree leaving only assigns and run it localy:
class OnlyAssings(ast.NodeTransformer):
def generic_visit( self, node ):
return None #throw other things away
def visit_Module( self, node ):
#We need to visit Module and pass it
return ast.NodeTransformer.generic_visit( self, node )
def visit_Assign(self, node):
if node.targets[0].id == 'animals': # this you may want to change
return node #pass it
return None # throw away
config= OnlyAssings().visit(config)
Compile it and run:
exec( compile(config,'config.py','exec') )
print animals
If animals should be in some dictionary, pass it as a local to exec:
data={}
exec( compile(config,'config.py','exec'), globals(), data )
print data['animals']
There is much more you can do with ast hacking, like visit all If and For statement or much more. You need to check documentation.

If the only reason you can't import that file as-is is because of imports that will fail otherwise, you can potentially hack your way around it than trying to process a perfectly good Python file as just text.
For example, if I have a file named busted_import.py with:
import doesnotexist
foo = 'imported!'
And I try to import it, I will get an ImportError. But if I define what the doesnotexist module refers to using sys.modules before trying to import it, the import will succeed:
>>> import sys
>>> sys.modules['doesnotexist'] = ""
>>> import busted_import
>>> busted_import.foo
'imported!'
So if you can just isolate the imports that will fail in your Python file and redefine those prior to attempting an import, you can work around the ImportErrors

I am not getting what exactly are you trying to do.
If you want to process each line with regular expression - you have ^ in regular expression re.compile("^animal\s*=\s*('.*')"). It matches only when animal is at the start of line, not after some spaces. Also of course it does not match bear or tiger - use something like re.compile("^\s*([a-z]+)\s*=\s*('.*')").
If you want to process multiple lines with single regular expression,
read about re.DOTALL and re.MULTILINE and how they affect matching newline characters:
http://docs.python.org/2/library/re.html#re.MULTILINE
Also note that text.readlines() reads lines, so the filter function in filter('not sure', text.readlines()) is run on each line, not on whole file. You cannot pass regular expression in this filter(<re here>, text.readlines()) and hope it will match multiple lines.
BTW processing Python files (and HTML, XML, JSON... files) using regular expressions is not wise. For every regular expression you write there are cases where it will not work. Use parser designed for given format - for Python source code it's ast. But for your use case ast is too complex.
Maybe it would be better to use classic config files and configparser. More structured data like lists and dicts can be easily stored in JSON or YAML files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from hyperlink cell in CSV - python

When I am reading a cell with hyperlink from CSV file I am getting the following: =HYPERLINK("http://google.com","google") #for example Is there a way to extract only the "google" without the =hyperlink and the link?

Related

cannot extract ID correctly from URL using split operation

Parse a URL and replace variables if present

How to extract 'RS4' element in Rpy2

AttributeError: 'dict_values' object has no attribute 'rsplit'

Python text processing and parsing

Categories

Resources