I have a dataframe with 3 million rows. I need to transform the values in a column. The column contains strings joined together with ";". The transformation involves breaking up the string into its components and then choosing one of the strings based on some priority rules.
Here is the sample dataset and the function:
data = {'Name': ['X1', 'X2', 'X3', 'X4', 'X5','X6'], 'category': ['CatA;CatB', 'CatB', None, 'CatB;CatC;CatA', 'CatA;CatB', 'CatB;CatD;CatB;CatC;CatA']}
sample_dataframe = pd.DataFrame(data)
def cat_name(x):
if x:
x = pd.Series(x.split(";"))
y = x[(x!='CatA') & x.notna()]
custom_dict = {'CatC': 0, 'CatD':1, 'CatB': 2, 'CatE': 3}
if x.count() == 1:
return x.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more'
else:
return y.iloc[0]+'+'
elif y.count() == 1:
return y.iloc[0]
else:
return None
else:
return None
I am using the apply method test_data = sample_dataframe['category'].apply(cat_name) to run the function on the column. For my dataset of 3 million rows, the function takes almost 10 minutes to run.
How can I optimize the function to run faster?
Also, I have two set of of category rules and the output category needs to be stored in two columns. Currently I am using the apply function twice. Kinda dumb and slow, I know, but it works.
Is there a way to run the function at the same time for a different priority dictionary and return two output values? I tried to use
test_data['CAT_NAME'], test_data['MAIN_CAT_NAME']=zip(*sample_dataframe['category'].apply(joint_cat_name)) with the function
def joint_cat_name(x):
cat_string = x
if cat_string:
string_series = pd.Series(cat_string.split(";"))
y = string_series[(string_series!='CatA') & string_series.notna()]
custom_dict = {'CatB': 0, 'CatC':1, 'CatD': 2, 'CatE': 3}
if string_series.count() == 1:
return string_series.iloc[0], string_series.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more', y.iloc[0]
elif y.count() == 1:
return y.iloc[0]+'+', y.iloc[0]
elif y.count() == 1:
return y.iloc[0], y.iloc[0]
else:
return None, None
else:
return None, None
But I got an error TypeError: 'NoneType' object is not iterable when the zip function encountered tuple containing Nones. ie it threw an error when output was (None, None)
Thanks a lot in advance.
Your function does a lot of unnecessary work. Even if you just reorder some conditionals it will run much faster.
custom_dict = {"CatC": 0, "CatD": 1, "CatB": 2, "CatE": 3}
def cat_name(x):
if x is None:
return x
xs = x.split(";")
if len(xs) == 1:
return xs[0]
ys = [x for x in xs if x != "CatA"]
l = len(ys)
if l == 0:
return None
if l == 1:
return ys[0]
if l == 2:
return min(ys, key=lambda k: custom_dict[k]) + "+"
if l > 2:
return "3 or more"
Faster than running one Python method on each row might be to go through your dataframe multiple times, and each time use an optimized Pandas query. You'd have to rewrite your code something like this:
# select empty categories
no_cat = sample_dataframe['category'].isna()
# select categorie strings with only one category
single_cat = ~no_cat & (sample_dataframe['category'].str.count(";") == 0)
# get number of categories
num_cats = sample_dataframe['category'].str.count(";") + 1
three_or_more = num_cats > 2
# has a "CatA" category
has_cat_A = sample_dataframe['category'].str.contains("CatA", na=False)
# then also write these selected rows in a custom way
sample_dataframe["cat_name"] = ""
cat_name_col = sample_dataframe["cat_name"]
cat_name_col[no_cat] = None
cat_name_col[single_cat] = sample_dataframe["category"][single_cat]
cat_name_col[three_or_more] = "3 or more"
# continue with however complex you want to get to cover more cases, e.g.
two_cats_no_cat_A = (num_cats == 2) & ~has_cat_A
# then handle only the remaining cases with the apply
not_handled = ~no_cat & ~single_cat & ~three_or_more
cat_name_col[not_handled] = sample_dataframe["category"][not_handled].apply(cat_name)
Running these queries on 3 million rows should be plenty fast, even if you have to do a few of them and combine them. If it's still too slow, you can handle more special cases from the apply in the same vectorized fashion.
Related
The code below was built based on udf dictionary broadcast blog posting. In the code below the broadcast dictionary should return a lists of dictionaries ie Expected return from broadcast dictionary {'key':[{'key':'value','key2':'value'},{'key':'value','key2':'value'}]
When I try to access any key from a dictionary in the list no value is returned. I know this because I get an error "local variable 'tmp' referenced before assignment". This variables should contain a calculated value resulting from operations defined in each dictionary in the list. Can anyone tell me why or how to debug this or why the dictionary keys are not returning values ? I added print statements in the code that returns the broadcast key but it never appears in the ouput or driver logs.
broadcast dictionary
calc dict {'name': 'Lvl1_Name_Score', 'operations': [{'operator': 'multi', 'value': 1.5}, {'operator': 'div', 'value': 1}]}
def apply_weights_a(calc_broadcasted,key):
def apply_weights(col_name):
calc_operations = calc_broadcasted.value.get(key)
results = 0
print(f'calc_operations {calc_operations}')
for op in calc_operations:
print(f'op {op}')
# for i,op in enumerate(calc['dependencies']
if op['operator'] == 'div': tmp = col_name / float(op['value'])
elif op['operator'] == 'multi': tmp = col_name * float(op['value'])
elif op['operator'] == 'sub':tmp = col_name - float(op['value'])
elif op['operator'] == 'add': tmp = col_name + float(op['value'])
# sdf = sdf['tmp'].apply(lambda x: 0 if x < 0 else x)
results = results + tmp
return results
return udf(apply_weights)
def multi_apply_weights(col_names,weights):
def inner(sdf):
for col_name in col_names:
#print(f"weights {weights}")
size = len(col_name) #get col_name without _c
calc = [mylist for mylist in weights if mylist['name'] == col_name[:size-2] ]
#print(f"calc {calc}")
calc_dict = calc[0]
#print(f"calc dict {calc[0]}")
b = spark.sparkContext.broadcast(calc_dict)
sdf = sdf.withColumn(
col_name,
apply_weights_a(b,'operations')(col_name)
)
return sdf
return inner
sdf = multi_apply_weights(weight_list,config['algorithmScores'])(sdf)
This spark code is the result of converting this pandas function which performs 1 to multiple different calculations on a set of columns...
def apply_weights(df, weights):
for calc in weights:
col = calc['name']
for op in calc['operations']:
if op['operator'] == 'div':
df[col + '_w'] = df[col]/float(op['value'])
elif op['operator'] == 'multi':
df[col+ '_w'] = df[col]*float(op['value'])
elif op['operator'] == 'sub':
df[col+ '_w'] = df[col]-float(op['value'])
elif op['operator'] == 'add':
df[col+ '_w'] = df[col]+float(op['value'])
df[col] = df[col].apply(lambda x: 0 if x < 0 else x)
return df
I found that I was able to run the original python code by making a minor change. This worked on 3 of my 4 functions. When using this approach on my last function it caused my cluster to crash with the error ''The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached'. I have opened a ticket with microsoft to get addtional help with this.
def apply_weights(kdf, weights) -> ks.DataFrame[zip(kdf.dtypes, kdf.columns)]:
I also resolved the issue with broadcasting the dictionary by adding a key before broadcasting. I changed the broadcast to
brdcst = dict({"rules":newcolrules})
b = spark.sparkContext.broadcast(brdcst)
calcs = b.value.get('rules') #test broadcasted dictionary
print(f"calcs : {calcs}") #test broadcasted dictionary
** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.
I have a database of New York apartments which has thousands of rented apartments. What I'm trying to do is create another column based on "pet_level". Their are two other columns 'dog_allowed' and 'cat_allowed' that have a 0 or 1 if the pet is allowed
I'm looking to create the 'pet_level' column on this:
0 if no pets are allowed
1 if cats_allowed
2 if dogs_allowed
3 if both are allowed
my initial approach at solving this was as follows:
df['pet_level'] = df.apply(lambda x: plev(0 = x[x['dog_allowed'] == 0 & x['cat_allowed'] == 0] ,1 = x[x['cat_allowed'] == 1], 2 = x[x['dog_allowed'] == 1], 3 = x[x['dog_allowed'] == 1 & x['cat_allowed'] == 1]))
Just because I've done smaller test datasets in a similar manner
I tried out a lambda function using the apply method but that doesn't seem to allow for that.
The approach that is currently working, define a function with the conditional statements needed.
def plvl(db):
if db['cats_allowed'] == 0 and db['dogs_allowed'] == 0:
val = 0
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 0:
val = 1
elif db['cats_allowed'] == 0 and db['dogs_allowed'] == 1:
val = 2
elif db['cats_allowed'] == 1 and db['dogs_allowed'] == 1:
val = 3
return val
Then pass in that function by applying the function along the columns(axis=1) to create the desired column.
df['pet_level'] = df.apply(plvl, axis=1)
I'm not sure if this is the most performance efficient but for testing purposes it currently works. I'm sure there's are more pythonic approaches that would be less demanding and equally helpful to know.
Instead of mapping, you can vectorize the operation like this:
df['pet_level'] = df['dog_allowed'] * 1 + df['cat_allowed'] * 2
I am trying to do the following:
1) calculate the amount of the same numbers in the data list. eg : there are three numbers between and including 10 and 20.
2) represent the value for each number range with the same number of '#'. eg: there are 3 numbers between 10 and 20 = ###.
Ideally ending in having the two values represented next to each other.
Unfortunately I really can't figure out step two and any help would really be appreciated.
My code is below:
def count_range_in_list(li, min, max):
ctr = 0
for x in li:
if min <= x <= max:
ctr += 1
return ctr
def amountOfHashes(count_range_in_list,ctr):
ctr = count_range_in_list()
if ctr == 1:
print ('#')
elif ctr == 2:
print ('##')
elif ctr == 3:
print ('###')
elif ctr == 4:
print ('####')
elif ctr == 5:
print ('#####')
elif ctr == 6:
print ('######')
elif ctr == 7:
print ('#######')
elif ctr == 8:
print ('########')
elif ctr == 9:
print ('#########')
elif ctr == 10:
print ('##########')
data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
print(count_range_in_list(data, 0, 10),amountOfHashes)
print(count_range_in_list(data, 10, 20),amountOfHashes)
print(count_range_in_list(data, 20, 30),amountOfHashes)
print(count_range_in_list(data, 30, 40),amountOfHashes)
print(count_range_in_list(data, 40, 50),amountOfHashes)
print(count_range_in_list(data, 50, 60),amountOfHashes)
print(count_range_in_list(data, 60, 70),amountOfHashes)
print(count_range_in_list(data, 70, 80),amountOfHashes)
print(count_range_in_list(data, 80, 90),amountOfHashes)
print(count_range_in_list(data, 90, 100),amountOfHashes)
I'll start by clearing out some doubts you seem to have.
First, how to use the value of a function inside another one:
You don't need to pass the reference of a method to another here. What I mean is, in amountOfHashes(count_range_in_list,ctr) you can just drop count_range_in_list as a parameter, and just define it like amountOfHashes(ctr). Or better yet, use snake case in the method name instead of camel case, so you end up with amount_of_hashes(ctr). Even if you had to execute count_range_in_list inside amount_of_hashes, Python is smart enough to let you do that without having to pass the function reference, since both methods are inside the same file already.
And why do you only need ctr? Well, count_range_in_list already returns a counter, so that's all we need. One parameter, named ctr. In doing so, to "use the result from a function in a new one", we could:
def amount_of_hashes(ctr):
...
# now, passing the value of count_range_in_list in amount_of_hashes
amount_of_hashes(count_range_in_list(data, 10, 20))
You've figured out step 1) quite well already, so we can go to step 2) right away.
In Python it's good to think of iterative processes such as yours dynamically rather than in hard coded ways. That is, creating methods to check the same condition with a tiny difference between them, such as the ones in amountOfHashes, can be avoided in this fashion:
# Method name changed for preference. Use the name that best fits you
def counter_hashes(ctr):
# A '#' for each item in a range with the length of our counter
if ctr == 0:
return 'N/A'
return ''.join(['#' for each in range(ctr)])
But as noted by Roland Smith, you can take a string and multiply it by a number - that'll do exactly what you think: repeat the string multiple times.
>>> 3*'#'
###
So you don't even need my counter_hashes above, you can just ctr*'#' and that's it. But for consistency, I'll change counter_hashes with this new finding:
def counter_hashes(ctr):
# will still return 'N/A' when ctr = 0
return ctr*'#' or 'N/A'
For organization purposes, since you have a specific need (printing the hashes and the hash count) you may then want to format right what comes into print, you could make a specific method for the printing, that calls both counter_hashes and count_Range_in_list, and gives you a cleaner result afterwards:
def hash_range(data, min, max):
ctr = count_range_in_list(data, min, max)
hashes = counter_hashes(ctr)
print(f'{hashes} | {ctr} items in range')
The use and output of this would then become:
>>> data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
>>> hash_range(data, 0, 10)
N/A | 0 items in range
>>> hash_range(data, 10, 20)
### | 3 items in range
>>> hash_range(data, 20, 30)
#### | 4 items in range
And so on. If you just want to print things right away, without the hash_range method above, it's simpler but more lengthy/repetitive if you want a oneliner:
>>> ctr = count_range_in_list(data, 10, 20)
>>> print(counter_hashes(ctr), ctr)
### 3
Why not just do it like this:
Python 3.x:
def amount_of_hashes(ctr):
while ctr > 0:
print('#', end = '')
ctr = ctr-1
Python 2.x:
def amount_of_hashes(ctr):
while ctr > 0:
print '#',
ctr = ctr-1
Counting the number in a list can be done like this:
def count_range_in_list(li, mini, maxi):
return len([i for i in li if mini <= i <= maxi])
Then making a number of hashes is even simpler. Just multiply a string containing the hash sign with a number.
print(ount_range_in_list(data, 0, 10)*'#')
Example in IPython:
In [1]: data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
In [2]: def count_range_in_list(li, mini, maxi):
...: return len([i for i in li if mini <= i <= maxi])
...:
In [3]: print(count_range_in_list(data, 0, 10)*'#')
In [4]: print(count_range_in_list(data, 10, 20)*'#')
###
In [5]: print(count_range_in_list(data, 20, 30)*'#')
####
There are many ways to do this. One way is to use a for loop with range:
# Most basic
def count_range_in_list(li, min, max):
ctr = 0
hashes = ""
for x in li:
if min <= x <= max:
ctr += 1
hashes += "#"
print("There are {0} numbers = {1}".format(ctr, hashes))
# more declarative
def count_range_in_list(li, min, max):
nums = [x for x in li if min <= x <= max]
hashes = "".join(["#" for n in nums])
print("There are {0} numbers = {1}".format(len(nums), hashes))
I normally implement switch/case for equal comparison using a dictionary.
dict = {0:'zero', 1:'one', 2:'two'};
a=1; res = dict[a]
instead of
if a == 0:
res = 'zero'
elif a == 1:
res = 'one'
elif a == 2:
res = 'two'
Is there a strategy to implement similar approach for non-equal comparison?
if score <= 10:
cat = 'A'
elif score > 10 and score <= 30:
cat = 'B'
elif score > 30 and score <= 50:
cat = 'C'
elif score > 50 and score <= 90:
cat = 'D'
else:
cat = 'E'
I know that may be tricky with the <, <=, >, >=, but is there any strategy to generalize that or generate automatic statements from let's say a list
{[10]:'A', [10,30]:'B', [30,50]:'C',[50,90]:'D',[90]:'E'}
and some flag to say if it's < or <=.
A dictionary can hold a lot of values. If your ranges aren't too broad, you could make a dictionary that is similar to the one you had for the equality conditions by expanding each range programmatically:
from collections import defaultdict
ranges = {(0,10):'A', (10,30):'B', (30,50):'C',(50,90):'D'}
valueMap = defaultdict(lambda:'E')
for r,letter in ranges.items():
valueMap.update({ v:letter for v in range(*r) })
valueMap[701] # 'E'
valueMap[7] # 'A'
You could also just remove the redundant conditions from your if/elif statement and format it a little differently. That would almost look like a case statement:
if score < 10 : cat = 'A'
elif score < 30 : cat = 'B'
elif score < 50 : cat = 'C'
elif score < 90 : cat = 'D'
else : cat = 'E'
To avoid repeating score <, you could define a case function and use it with the value:
score = 43
case = lambda x: score < x
if case(10): cat = "A"
elif case(30): cat = "B"
elif case(50): cat = "C"
elif case(90): cat = "D"
else : cat = "E"
print (cat) # 'C'
You could generalize this by creating a switch function that returns a "case" function that applies to the test value with a generic comparison pattern:
def switch(value):
def case(check,lessThan=None):
if lessThan is not None:
return (check is None or check <= value) and value < lessThan
if type(value) == type(check): return value == check
if isinstance(value,type(case)): return check(value)
return value in check
return case
This generic version allows all sorts of combinations:
score = 35
case = switch(score)
if case(0,10) : cat = "A"
elif case([10,11,12,13,14,15,16,17,18,19]):
cat = "B"
elif score < 30 : cat = "B"
elif case(30) \
or case(range(31,50)) : cat = 'C'
elif case(50,90) : cat = 'D'
else : cat = "E"
print(cat) # 'C'
And there is yet another way using a lambda function when all you need to do is return a value:
score = 41
case = lambda x,v: v if score<x else None
cat = case(10,'A') or case(20,'B') or case(30,'C') or case(50,'D') or 'E'
print(cat) # "D"
This last one can also be expressed using a list comprehension and a mapping table:
mapping = [(10,'A'),(30,'B'),(50,'C'),(90,'D')]
scoreCat = lambda s: next( (L for x,L in mapping if s<x),"E" )
score = 37
cat = scoreCat(score)
print(cat) #"D"
More specifically to the question, a generalized solution can be created using a setup function that returns a mapping function in accordance with your parameters:
def rangeMap(*breaks,inclusive=False):
default = breaks[-1] if len(breaks)&1 else None
breaks = list(zip(breaks[::2],breaks[1::2]))
def mapValueLT(value):
return next( (tag for tag,bound in breaks if value<bound), default)
def mapValueLE(value):
return next( (tag for tag,bound in breaks if value<=bound), default)
return mapValueLE if inclusive else mapValueLT
scoreToCategory = rangeMap('A',10,'B',30,'C',50,'D',90,'E')
print(scoreToCategory(53)) # D
print(scoreToCategory(30)) # C
scoreToCategoryLE = rangeMap('A',10,'B',30,'C',50,'D',90,'E',inclusive=True)
print(scoreToCategoryLE(30)) # B
Note that with a little more work you can improve the performance of the returned function using the bisect module.
Python 3.10 introduced match-case (basically switch) and you can use it as
def check_number(no):
match no:
case 0:
return 'zero'
case 1:
return 'one'
case 2:
return 'two'
case _:
return "Invalid num"
This is something that I tried for an example.
The bisect module can be used for such categorization problem. In particular, the documentation offers an example which solves a problem very similar to yours.
Here is the same example adapted to your use case. The function returns two values: the letter grade and a bool flag which indicates if the match was exact.
from bisect import bisect_left
grades = "ABCDE"
breakpoints = [10, 30, 50, 90, 100]
def grade(score):
index = bisect_left(breakpoints, score)
exact = score == breakpoints[index]
grade = grades[index]
return grade, exact
grade(10) # 'A', True
grade(15) # 'B', False
In the above, I assumed that your last breakpoint was 100 for E. If you truly do not want an upper bound, notice that you can replace 100 by math.inf to keep the code working.
For your particular case an efficient approach to convert a score to a grade in O(1) time complexity would be to use 100 minus the score divided by 10 as a string index to obtain the letter grade:
def get_grade(score):
return 'EDDDDCCBBAA'[(100 - score) // 10]
so that:
print(get_grade(100))
print(get_grade(91))
print(get_grade(90))
print(get_grade(50))
print(get_grade(30))
print(get_grade(10))
print(get_grade(0))
outputs:
E
E
D
C
B
A
A
Yes, there is a strategy, but not quite as clean as human thought patterns. First some notes:
There are other questions dealing with "Python switch"; I'll assume that you consulted them already and eliminated those solutions from consideration.
The structure you posted is not a list; it's an invalid attempt at a dict. Keys must be hashable; the lists you give are not valid keys.
You have two separate types of comparison here: exact match to the lower bound and range containment.
That said, I'll keep the concept of a look-up table, but we'll drop it to a low common denominator to make it easy to understand and alter for other considerations.
low = [10, 30, 50, 90]
grade = "ABCDE"
for idx, bkpt in enumerate(low):
if score <= bkpt:
exact = (score == bkpt)
break
cat = grade[idx]
exact is the flag you requested.
Python had (was added in python 3.10, hat-tip Peter Mortensen) no case statement. Just write a long series of elifs.
if score <= 10:
return 'A'
if score <= 30:
return 'B'
if score <= 50:
return 'C'
if score <= 90:
return 'D'
return 'E'
Stuff, like look it up in a dictionary, sounds great, but really, it's too slow. elifs beat them all.
low = [10,30,50,70,90]
gradE = "FEDCBA"
def grade(score):
for i,b in enumerate(low):
#if score < b: # 0--9F,10-29E,30-49D,50-69C,70-89B,90-100A Easy
if score <= b: # 0-10F,11-30E,31-50D,51-70C,71-90B,91-100A Taff
return gradE[i]
else:return gradE[-1]
for score in range(0,101):
print(score,grade(score))