repeat function to extract similar information - python

Here is how I use pandas to open and read json file. I really appreciate the power of pandas :)
import pandas as pd
df = pd.read_json("https://datameetgeobk.s3.amazonaws.com/cftemplates/EyeOfCustomer.json")
def mytype(mydict):
try:
if mydict["Type"]:
return mydict["Type"]
except:
pass
df["myParametersType"] = df.Parameters.apply(lambda x: mytype(x))
The problem is that I need "Description" and "Default" values also along with "Type" strings. I have already written a function to extract types as mentioned above. Do I really need to write 2 more functions as shown below?
def mydescription(mydict):
try:
if mydict["Description"]:
return mydict["Description"]
except:
pass
def mydefault(mydict):
try:
if mydict["Default"]:
return mydict["Default"]
except:
pass
df["myParametersDescription"] = df.Parameters.apply(lambda x: mydescription(x))
df["myParametersDefault"] = df.Parameters.apply(lambda x: mydefault(x))
And how will I handle it if the dictionary contains more than 3 keys?
The final table should look something like this...
df.iloc[:, -3:].dropna(how="all")
myParametersType myParametersDescription myParametersDefault
pInstanceKeyName AWS::EC2::KeyPair::KeyName The name of the private key to use for SSH acc... None
pTwitterTermList String List of terms for twitter to listen to 'your', 'search', 'terms', 'here'
pTwitterLanguages String List of languages to use for the twitter strea... 'en'
pTwitterAuthConsumerKey String Consumer key for access twitter None
pTwitterAuthConsumerSecret String Consumer Secret for access twitter None
pTwitterAuthToken String Access Token Secret for calling twitter None
pTwitterAuthTokenSecret String Access Token Secret for calling twitter None
pApplicationName String Name of the application deploying for the EyeO... EyeOfCustomer
pVpcCIDR String Please enter the IP range (CIDR notation) for ... 10.193.0.0/16
pPublicSubnet1CIDR String Please enter the IP range (CIDR notation) for ... 10.193.10.0/24

You can pass new parameter to function:
def func(mydict, val):
try:
if mydict[val]:
return mydict[val]
except:
pass
df["myParametersType"] = df.Parameters.apply(lambda x: func(x, 'Type'))
df["myParametersDescription"] = df.Parameters.apply(lambda x: func(x, 'Description'))
df["myParametersDefault"] = df.Parameters.apply(lambda x: func(x, 'Default'))
df = df.iloc[:, -3:].dropna(how="all")

By making each row a pd.Series, you can create dataframe for every key and value in each dictionary.
like this :
get_df = df['Parameters'].apply(lambda x : pd.Series(x)).drop(0, axis=1) # NaN is colnam 0
get_df.columns = ['P_' + col for col in get_df.columns] # you already have 'Description' column
get_df.head()
So, just paste it(by using concat or something).
df[get_df.columns] = get_df
df.head()

Related

How to add unique alphanumeric id for pandas dataframe?

I need a solution where i can generate unique alphanumeric id column for my dataframe. I need that the ids remain same even if I run the script later in future.
Name
Sam
Pray
Brad
I can generate the ids based on this post but I need 5 digit aplhanumeric values which will always remain same.
This is desired output:
Name ID
Sam X25TR
Peter WE558
Pepe TR589
One way would be to generate a hash of the name, by whatever hashing algorithm, and keep the first five characters of the hash. But you should keep in mind that with such as short hash this is likely to cause collisions (same output for multiple different inputs) if you have enough data.
Something along these lines:
import hashlib
def get_id(name: str) -> str:
hash = hashlib.md5(name.encode())
return hash.hexdigest()[:5]
Now for a given input string, get_id returns an alphanumeric 5-character string which is always the same for the same input.
This function generate random alphanumeric string with given length:
import math
import secrets
def random_alphanum(length: int) -> str:
text = secrets.token_hex(nbytes=math.ceil(length / 2))
isEven = length % 2 == 0
return text if isEven else text[1:]
df['ID'] == random_alphanum(5)
Apply to whole rows:
df2['ID'] = df2.apply(lambda x: random_alphanum(5), axis=1, result_type="expand")
Here's my attempt
import secrets
ids = []
while len(ids) < df.shape[0]:
temp = secrets.token_hex(5)[:5]
if temp not in ids:
ids.append(temp)
df.merge(pd.DataFrame(ids).reset_index(), left_on = df.groupby(['Name']).ngroup(), right_on = 'index')

Applying custom function to a column of lists in pandas, how to handle exceptions?

I have a data frame of 15000 record which has text column (column name = clean) as a list, please refer below
enter image description here
I need to find the minimum value in each row and add as a new column called min
def find_min(x):
x = min(x)
return(x)
I tried to pass the above function
df1['min'] = df1['clean'].map(find_min)
Getting below error
ValueError: min() arg is an empty sequence
It seems there is an empty list, how to address this as I am having 15000 records
Please advise
Since we have a list of columns, let us try handling errors in your function using try/except (EAFP pattern):
def find_min(x):
try:
return min(x)
except ValueError:
return np.nan
df1['min'] = df1['clean'].map(find_min)
Another way is to skip the function and define this inline:
df1['min'] = df1['clean'].map(lambda x: min(x) if len(x) else np.nan)
You can also do this using a list comprehension, which is quite fast:
df1['min'] = [find_min(x) for x in df1['clean']]
or,
df1['min'] = [min(x) if len(x) else np.nan for x in df1['clean']]

Split string list into dictionary keys in python

I have a string 'request.context.user_id' and I want to split the string by '.' and use each element in the list as a dictionary key. Is there a way to do this for lists of varying lengths without trying to hard code all the different possible list lengths after the split?
parts = string.split('.')
if len(parts)==1:
data = [x for x in logData if x[parts[0]] in listX]
elif len(parts)==2:
data = [x for x in logData if x[parts[0]][parts[1]] in listX]
else:
print("Add more hard code")
listX is a list of string values that should be retrieved by x[parts[0]][parts[1]
logData is a list obtained from reading a json file and then the list can be read into a dataframe using json_normalize... the df portion is provided to give some context about its structure.. a list of dicts:
import json
from pandas.io.json import json_normalize
with open(project_root+"filename") as f:
logData = json.load(f)
df = json_normalize(logData)
If you want arbitrary counts, that means you need a loop. You can use get repeatedly to drill through layers of dictionaries.
parts = "request.context.user_id".split(".")
logData = [{"request": {"context": {"user_id": "jim"}}}]
listX = "jim"
def generate(logData, parts):
for x in logData:
ref = x
# ref will be, successively, x, then the 'request' dictionary, then the
# 'context' dictionary, then the 'user_id' value 'jim'.
for key in parts:
ref = ref[key]
if ref in listX:
yield x
data = list(generate(logData, parts))) # ['jim']
I just realized in the comments you said that you didn't want to create a new dictionary but access an existing one x via chaining up the parts in the list.
(3.b) use a for loop to get/set the value in the key the path
In case you want to only read the value at the end of the path in
import copy
def get_val(key_list, dict_):
reduced = copy.deepcopy(dict_)
for i in range(len(key_list)):
reduced = reduced[key_list[i]]
return reduced
# this solution isn't mine, see the link below
def set_val(dict_, key_list, value_):
for key in key_list[:-1]:
dict_ = dict_.setdefault(key, {})
dict_[key_list[-1]] = value_
get_val()
Where the key_list is the result of string.slit('.') and dict_ is the x dictionary in your case.
You can leave out the copy.deepcopy() part, that's just for paranoid peeps like me - the reason is the python dict is not immutable, thus working on a deepcopy (a separate but exact copy in the memory) is a solution.
set_val() As I said it's not my idea, credit to #Bakuriu
dict.setdefault(key, default_value) will take care of non-existing keys in x.
(3) evaluating a string as code with eval() and/or exec()
So here's an ugly unsafe solution:
def chainer(key_list):
new_str = ''
for key in key_list:
new_str = "{}['{}']".format(new_str, key)
return new_str
x = {'request': {'context': {'user_id': 'is this what you are looking for?'}}}
keys = 'request.context.user_id'.split('.')
chained_keys = chainer(keys)
# quite dirty but you may use eval() to evaluate a string
print( eval("x{}".format(chained_keys)) )
# will print
is this what you are looking for?
which is the innermost value of the mockup x dict
I assume you could use this in your code like this
data = [x for x in logData if eval("x{}".format(chained_keys)) in listX]
# or in python 3.x with f-string
data = [x for x in logData if eval(f"x{chained_keys}") in listX]
...or something similar.
Similarly, you can use exec() to execute a string as code if you wanted to write to x, though it's just as dirty and unsafe.
exec("x{} = '...or this, maybe?'".format(chained_keys))
print(x)
# will print
{'request': {'context': {'user_id': '...or this, maybe?'}}}
(2) An actual solution could be a recursive function as so:
def nester(key_list):
if len(key_list) == 0:
return 'value' # can change this to whatever you like
else:
return {key_list.pop(0): nester(key_list)}
keys = 'request.context.user_id'.split('.')
# ['request', 'context', 'user_id']
data = nester(keys)
print(data)
# will result
{'request': {'context': {'user_id': 'value'}}}
(1) A solution with list comprehension for split the string by '.' and use each element in the list as a dictionary key
data = {}
parts = 'request.context.user_id'.split('.')
if parts: # one or more items
[data.update({part: 'value'}) for part in parts]
print(data)
# the result
{'request': 'value', 'context': 'value', 'user_id': 'value'}
You can overwrite the values in data afterwards.

How to pass list to UserDefinedFunction (UDF) in pyspark

I need to pass a list as arguments for a certain UDF I have in pyspark. Example:
def cat(mine,mine2):
if mine is not None and mine2 is not None:
return "2_"+mine+"_"+mine2
udf_cat = UserDefinedFunction(cat, "string")
l = ["COLUMN1","COLUMN2"]
df = df.withColumn("NEW_COLUMN", udf_cat(l))
But I always get an error.
After a while, I figured out that all I need is to pass the list using the character '*' before it. Example:
df = df.withColumn("NEW_COLUMN", udf_cat(*l))
That way, it will work.

Getting all the values matching a key pattern from redis through Lua

I am trying to find all the keys and their values matching a specific pattern using py-redis and lua and here is my code
rc = redis.Redis(..)
rc.set('google:',100)
rc.set('google:3',200)
rc.set('google:2',3400)
rc.set('google',200)
rc.set('fb',300)
get_script = """
local value = redis.call('GET', KEYS[1])
return value
"""
get_values = rc.register_script(get_script)
print get_values(rc.keys(pattern='google:*'))
print get_values(keys=['google:'])
print get_values(keys=['google:*'])
The output that I am getting is
100
100
None
First of all I do not get why I am getting None for the last print statement. My original purpose is to get all the keys ( and their values) matching the pattern but I am only getting the first key
I think that I have found what I was missing
Instead of GET I should have passed KEYS as first argument of the initial redis.call and then iterate over the keys to get the values
get_script = """
local keys = (redis.call('keys', ARGV[1]))
local values={}
for i,key in ipairs(keys) do
local val = redis.call('GET', key)
values[i]=val
i=i+1
end
return values
"""
get_values = rc.register_script(get_script)
print get_values(args=['google:*'])

Categories