Python- urllib.urlencode: parse dictionary items into string - python

I need to encode dictionary item like this
data = OrderedDict([('mID', ['54a309ae1c61be23aba0da54', '54a309ae1c61be23aba0da63'])])
into a string formatted like this
mID=[54a309ae1c61be23aba0da54,54a309ae1c61be23aba0da63]
When I use url_values = urllib.urlencode(data)
I get mID=%5B%2754a309ae1c61be23aba0da54%27%2C+%2754a309ae1c61be23aba0da63%27%5D
What could I do?

May be:
"{}=[{}]".format("mID",",".join(data["mID"]))

With urllib.parse module for Python v3.x:
import collections
from urllib import parse
data = collections.OrderedDict([('mID', ['54a309ae1c61be23aba0da54', '54a309ae1c61be23aba0da63'])])
urlenc_str = parse.unquote_plus(parse.urlencode(data))
urlenc_str = urlenc_str.replace("'", '').replace(' ', '')
print(urlenc_str)
The output:
mID=[54a309ae1c61be23aba0da54,54a309ae1c61be23aba0da63]
Checking type:
print(type(urlenc_str)) # <class 'str'>

Related

Strip A specific part from a url string in python

Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?
You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']
Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"
Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.
We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?
Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"

Extracting a value between a long string

I am trying to extract a string from a longer string in one of my columns.
Here is a sample of what I have tried:
df['Campaign'] = df.full_utm.str.extract('utm_campaign=([^&]*)')
and this is a sample of the string I am referring to:
?utm_source=Facebook&utm_medium=CPC&utm_campaign=April+Merchants+LAL+-+All+SA+-+CAP+250&utm_content=01noprice
The problem is that this only returns this:
A
The desired output in this context would be
April+Merchants+LAL+-+All+SA+-+CAP+250
Use urlparse
Ex:
import urllib.parse as urlparse
df['Campaign'] = df["full_utm"].apply(lambda x: urlparse.parse_qs(urlparse.urlparse(x).query)["utm_campaign"])
print(df)

python 2.7 get rid of double backslashes

I have list with one string element, see below
>>> s
['{\\"SrcIP\\":\\"1.1.1.1\\",\\"DstIP\\":\\"2.2.2.2\\",\\"DstPort\\":\\"80\\"}']
want to get rid of these '\\' and have dict instead:
{"SrcIP":"1.1.1.1","DstIP":"2.2.2.2","DstPort":"80"}
It looks like JSON object. You can load it to dict by using json package, but first to get rid of list and \\ you can call s[0].replace('\\', '')
import json
my_dict = json.loads(s[0].replace('\\', ''))
You can try this:
import re
import ast
s = ['{\\"SrcIP\\":\\"1.1.1.1\\",\\"DstIP\\":\\"2.2.2.2\\",\\"DstPort\\":\\"80\\"}']
final_response = [ast.literal_eval(re.sub('\\\\', '', i)) for i in s][0]
Output:
{'SrcIP': '1.1.1.1', 'DstIP': '2.2.2.2', 'DstPort': '80'}
Just use string replace method :
list_1=['{\\"SrcIP\\":\\"1.1.1.1\\",\\"DstIP\\":\\"2.2.2.2\\",\\"DstPort\\":\\"80\\"}']
for i in list_1:
print(str(i).replace("\\",""))
Or you can do in one line:
print(str(list_1[0]).replace("\\",""))
output:
{"SrcIP":"1.1.1.1","DstIP":"2.2.2.2","DstPort":"80"}
s is a list with one text item, you could get your desired output as follows:
import ast
s = ['{\\"SrcIP\\":\\"1.1.1.1\\",\\"DstIP\\":\\"2.2.2.2\\",\\"DstPort\\":\\"80\\"}']
s_dict = ast.literal_eval(s[0].replace('\\', ''))
print s_dict
print s_dict['DstIP']
Giving you the following output:
{'SrcIP': '1.1.1.1', 'DstIP': '2.2.2.2', 'DstPort': '80'}
2.2.2.2
The Python function ast.litertal_eval() can be used to safely convert a string into a Python object, in this case a dictionary.

How to search for matched string then extract the string after it and a colon

I'm new to Python and web scraping so I apology if the question is too basic!
I want to extract the "score" and "rate" (rating) from the following example BeautifulSoup object
import bs4
import re
text = '<html><body>{"count":1,"results":[{"score":"2-1","MatchId":{"number":"889349"},"name":"Match","rating":{"rate":9.0}}],"performance":{"comment":{}}}</body></html>'
page = bs4.BeautifulSoup(text, "lxml")
print type(page)
I have tried these but nothing showed up (just blank [])
tmp = page.find_all(text=re.compile("score:(.*)"));
print(tmp)
tmp = page.findAll("score");
print(tmp)
I found this similar question but it gave me error
tmp = page.findAll(text = lambda(x): x.lower.index('score') != -1)
print(tmp)
AttributeError: 'builtin_function_or_method' object has no attribute 'index'
What did I do wrong? Thanks in advance!
This is two thirds of the way to a turducken of protocols. You can use beautifulsoup to find the body text and decode that with json. Then you have some python dicts and lists to through.
>>> import json
>>> import bs4
>>> import re
>>> text = '<html><body>{"count":1,"results":[{"score":"2-1","MatchId":{"number":"889349"},"name":"Match","rating":{"rate":9.0}}],"performance":{"comment":{}}}</body></html>'
>>> page = bs4.BeautifulSoup(text, "lxml")
>>>
>>> data = json.loads(page.find('body').text)
>>> for result in data["results"]:
... print(result["score"], result["rating"]["rate"])
...
2-1 9.0
>>>

Converting a string representation of a list in python to a list object

I have some data of the format
[[prod149090160, prod146340131, prod160860042, prod147040186, prod147860348, prod157590283, prod153940219, prod162460011, prod160410115, prod157370014], [prod162290002, prod151790213, prod159380278, prod154180602, prod160020244, prod161410007, prod155540059, prod152810207, prod152870263, prod159300061], [prod156900051, prod157590288, prod153540027, prod162940222, prod160330181, prod162680033, prod155370061, prod156970034, prod159310027, prod159410165]]
This is a list of list in string format. Is there any simple way to convert this into an in-built python list type.
Or PyYAML:
>>> import yaml
>>> s = '[[prod149090160, prod146340131, prod160860042, prod147040186, prod147860348, prod157590283, prod153940219, prod162460011, prod160410115, prod157370014], [prod162290002, prod151790213, prod159380278, prod154180602, prod160020244, prod161410007, prod155540059, prod152810207, prod152870263, prod159300061], [prod156900051, prod157590288, prod153540027, prod162940222, prod160330181, prod162680033, prod155370061, prod156970034, prod159310027, prod159410165]]'
>>> yaml.load(s)
Use regular expressions:
>>> import re
>>> s = '[[prod149090160, prod146340131, prod160860042, prod147040186, prod147860348, prod157590283, prod153940219, prod162460011, prod160410115, prod157370014], [prod162290002, prod151790213, prod159380278, prod154180602, prod160020244, prod161410007, prod155540059, prod152810207, prod152870263, prod159300061], [prod156900051, prod157590288, prod153540027, prod162940222, prod160330181, prod162680033, prod155370061, prod156970034, prod159310027, prod159410165]]'
>>> groups = re.findall('\[([^\]]*)\]', s[1:-1])
>>> [re.findall('(prod\d+)', group) for group in groups]
[['prod149090160', 'prod146340131', 'prod160860042', 'prod147040186', 'prod147860348', 'prod157590283', 'prod153940219', 'prod162460011', 'prod160410115', 'prod157370014'], ['prod162290002', 'prod151790213', 'prod159380278', 'prod154180602', 'prod160020244', 'prod161410007', 'prod155540059', 'prod152810207', 'prod152870263', 'prod159300061'], ['prod156900051', 'prod157590288', 'prod153540027', 'prod162940222', 'prod160330181', 'prod162680033', 'prod155370061', 'prod156970034', 'prod159310027', 'prod159410165']]
This is what Bakuriu was talking about:
data = '''[["prod149090160", "prod146340131", "prod160860042", "prod147040186",
"prod147860348", "prod157590283", "prod153940219", "prod162460011",
"prod160410115", "prod157370014"],
["prod162290002", "prod151790213", "prod159380278", "prod154180602",
"prod160020244", "prod161410007", "prod155540059", "prod152810207",
"prod152870263", "prod159300061"],
["prod156900051", "prod157590288", "prod153540027", "prod162940222",
"prod160330181", "prod162680033", "prod155370061", "prod156970034",
"prod159310027", "prod159410165"]]'''
import ast
print ast.literal_eval(data)
Output:
[['prod149090160', 'prod146340131', 'prod160860042', 'prod147040186',
'prod147860348', 'prod157590283', 'prod153940219', 'prod162460011',
'prod160410115', 'prod157370014'],
['prod162290002', 'prod151790213', 'prod159380278', 'prod154180602',
'prod160020244', 'prod161410007', 'prod155540059', 'prod152810207',
'prod152870263', 'prod159300061'],
['prod156900051', 'prod157590288', 'prod153540027', 'prod162940222',
'prod160330181', 'prod162680033', 'prod155370061', 'prod156970034',
'prod159310027', 'prod159410165']]
The format shown would also be a legal JSON parse-able string:
import json
print json.loads(data)
import json
import re
print json.loads(re.sub(r'([^\[\],\s+]+)', r'"\1"', i))

Categories