Related
Using plotnine in python, I'd like to add dashed horizontal lines to my plot (a scatterplot, but preferably an answer compatible with other plot types) representing the mean for every color separately. I'd like to do so without manually computing the mean values myself or adapting other parts of the data (e.g. adding columns for color values etc).
Additionally, the original plot is generated via a function (make_plot below) and the mean lines are to be added afterwards, yet need to have the same color as the points from which they are derived.
Consider the following as a minimal example;
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
def make_plot(df, var_x, var_y, var_fill) :
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
I'd like to add 4 lines, one for each Size. The exact same can be done in R using ggplot, as shown by this question. Adding geom_line(stat="hline", yintercept="mean", linetype="dashed") to plot however results in an error PlotnineError: "'stat_hline' Not in Registry. Make sure the module in which it is defined has been imported." that I am unable to resolve.
Answers that can resolve the aforementioned issue, or propose another working solution entirely, are greatly appreciated.
You can do it by first defining the means as a vector and then pass it to your function:
import pandas as pd
import numpy as np
from plotnine import *
from random import randint
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
a = df.groupby(['Size'])['MSE'].mean() ### Defining yuor means
a = list(a)
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =a,linetype="dashed")
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which gives:
Note that two of the lines coincide:
a = [0.6666666666666666, 0.5, 0.4666666666666666, 0.6666666666666666]
To add different colors to each dashed line, you can do this:
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
### Generate a list of colors of the same length as your categories (Sizes)
color = []
n = len(list(set(df.Size)))
for i in range(n):
color.append('#%06X' % randint(0, 0xFFFFFF))
######################################################
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =list(df.groupby(['Size'])['MSE'].mean()),linetype="dashed", color =b)
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which returns:
Here's an example of my dataframe:
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df
group round score
0 a 3 0.3
1 a 3 0.1
2 a 2 0.6
3 b 1 0.8
4 b 3 0.2
5 b 1 0.5
6 b 3 0.5
7 b 3 0.6
8 c 3 0.4
9 d 2 0.9
10 d 1 0.1
My actual dataframe has 6 columns and > 1,000,000 rows. I'm trying to figure out the fastest way to do the following:
For each group find the average of scores and perform some calculation with it for each of 3 rounds. If there are no scores, write 'NA'.
I'm not sure if it would be faster to make a list of lists and then convert it into a dataframe or make a new dataframe and populate that, so i went with the list first:
def test_df(data):
value_counts = data['group'].value_counts().to_dict()
avgs = []
for key, val in value_counts.items():
row = data[data['group'] == key]
x = [key]
if val < 2:
x.extend([10 * row['score'].values[0] + 1 if i == row['round'].values[0] else 'NA' for i in range (1,4)])
else:
x.extend([(10 * row[row['round'] == i]['score'].mean() + 1) if len(row[row['round'] == i]) > 0 else 'NA' for i in range(1, 4)])
avgs.append(x)
return avgs
Here I created a separate case because about 80% of groups in my data only have one row, so I figured it might speed things up maybe?
this returns the correct results in format [group, round 1, round 2, round 3]
[['b', 7.5, 'NA', 5.333333333333333],
['a', 'NA', 7.0, 3.0],
['d', 2.0, 10.0, 'NA'],
['c', 'NA', 'NA', 5.0]]
but it's looking like it's going to take a really really long time on the actual dataframe...
Does anyone have any better ideas?
It looks to me like you're basically going a groupby/mean and a pivot.
import pandas as pd
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df = (df.groupby(['group','round'])['score'].mean()*10+1).reset_index()
df.pivot_table(index='group',columns='round',values='score', fill_value='NA').reset_index().values
Output
array([['a', 'NA', 7.0, 3.0],
['b', 7.5, 'NA', 5.333333333333333],
['c', 'NA', 'NA', 5.0],
['d', 2.0, 10.0, 'NA']], dtype=object)
The imbalanced dataset may show different results, but I tested with the blow scripts and found out even with the pandas dataframe, the result shows okay performance. However, you can always compare it with the native python data structure.
import random
import datetime
import pandas as pd
def generate_data(): # augmentation
data = {'group': [], 'round': [], 'score': []}
for index in range(10 ** 6): # sample size
data['group'].append(random.choice(['a', 'b', 'c', 'd']))
data['round'].append(random.randrange(1, 4))
data['score'].append(round(random.random(), 1))
return data
def calc_with_native_ds(data): # native python data structure
pass
def calc_with_pandas_df(df): # pandas dataframe
return df.groupby(['group', 'round']).mean()
if __name__ == '__main__':
data = generate_data()
df = pd.DataFrame(data)
print(df.shape)
start_datetime = datetime.datetime.now()
# calc_with_native_ds(data)
calc_with_pandas_df(df)
end_datetime = datetime.datetime.now()
elapsed_time = round((end_datetime - start_datetime).total_seconds(), 5)
print(f"elapsed_time: {elapsed_time}")
without any imports
# given
deps = {'W': ['R', 'S'], 'C': [], 'S': ['C'], 'R': ['C'], 'F': ['W']}
prob = {'C': [0.5], 'R': [0.2, 0.8], 'S': [0.5, 0.1], 'W': [0.01, 0.9, 0.9, 0.99], 'F' : [0.4, 0.3]}
k = 'F'
# want to return: L = [[0.2, 0.8], [0.5, 0.1], [0.01, 0.9, 0.9, 0.99], [0.4, 0.3]]
# attempt
L = []
for i in deps[k]:
s = i
while(deps[s] != []):
L.append(prob[s])
s = deps[s]
print(L)
I'm having trouble figuring this out. So given 2 dictionaries: dependents and probability I wish to traverse through a select point and set every value so for the above example I chose 'F'.
It would first go into the deps of 'F', find 'W' and then check the deps of that being ['R', 'S'] then check 'R' seeing that the depedent of 'R' is 'C' and 'C' does not a depedent so we stop at 'R' and append its probability into L.
[[0.2, 0.8]]
then we go into S and do the same thing
[[0.2, 0.8], [0.5, 0.1]]
then we're done with that and we're back at W
[[0.2, 0.8], [0.5, 0.1], [0.01, 0.9, 0.9, 0.99]]
and finally since we're done with W we get the prob dict of F
[[0.2, 0.8], [0.5, 0.1], [0.01, 0.9, 0.9, 0.99], [0.4, 0.3]]
My code fails when theres more than one dependent value. Not sure how to wrap my head around that. Trying to make a function that will do this given deps and prob and value of k
I would solve the problem with a while loop that keeps looking to see if you've used all the values you've recursively found. You can use a structure like:
deps = {'W': ['R', 'S'], 'C': [], 'S': ['C'], 'R': ['C'], 'F': ['W']}
# out = ['F', 'W', 'R', 'S']
prob = {'C': [0.5], 'R': [0.2, 0.8], 'S': [0.5, 0.1], 'W': [0.01, 0.9, 0.9, 0.99], 'F': [0.4, 0.3]}
k = 'F'
L = []
my_list = []
found_all = False
def get_values(dep_dictionary, prob_dict, start_key):
used_keys = []
keys_to_use = [start_key]
probability = []
# build a list of linked values from deps dictionary
while used_keys != keys_to_use:
print('used: {}'.format(used_keys))
print('to use: {}'.format(keys_to_use))
for i in range(len(keys_to_use)):
if keys_to_use[i] not in used_keys:
new_keys = dep_dictionary[keys_to_use[i]]
if len(new_keys):
for sub_key in new_keys:
if sub_key not in keys_to_use:
keys_to_use.append(sub_key)
used_keys.append(keys_to_use[i])
else:
del keys_to_use[i]
# at this point used_keys = ['F', 'W', 'R', 'S']
for key in used_keys:
probability.append(prob_dict[key])
print(probability)
get_values(deps, prob, k)
Which outputs:
used: []
to use: ['F']
used: ['F']
to use: ['F', 'W']
used: ['F', 'W']
to use: ['F', 'W', 'R', 'S']
used: ['F', 'W', 'R', 'S']
to use: ['F', 'W', 'R', 'S', 'C']
[[0.4, 0.3], [0.01, 0.9, 0.9, 0.99], [0.2, 0.8], [0.5, 0.1]]
Where you can see the output is correct ([[0.4, 0.3], [0.01, 0.9, 0.9, 0.99], [0.2, 0.8], [0.5, 0.1]]), however it is not in the exact same order, but it doesn't sound like that should be a huge issue. If it is, you can always re-splice it into a dictionary by adjusting the
for key in used_keys:
probability.append(prob_dict[key])
bit such that probability is a dictionary also. You can also take the print() statements out, they were just there to debug and show visually what is going on within the loop. You also would probably have the function return probability instead of printing it, but I'll leave that to your discretion!
Here is a solution that uses a stack-based depth-first search to traverse the dependency tree. It adds probabilities at each step iff. the node has dependencies, and then simply reverses the list at the end.
def prob_list(root):
nodes_to_visit = [root]
prob_list = []
while nodes_to_visit:
curr = nodes_to_visit.pop()
print(f"Visiting {curr}")
if deps[curr]:
prob_list.append(prob[curr])
for dep in deps[curr]:
nodes_to_visit.append(dep)
return list(reversed(prob_list))
print(prob_list("F")) # [[0.2, 0.8], [0.5, 0.1], [0.01, 0.9, 0.9, 0.99], [0.4, 0.3]]
Suppose I have an array:
[['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
And I want a dict (or JSON):
{
'a': {
10: {1: 0.1, 2: 0.2},
20: {2: 0.3}
}
'b': {
10: {1: 0.4},
20: {2: 0.5}
}
}
Is there any good way or some library for this task?
In this example the array is just 4-column, but my original array is more complicated (7-column).
Currently I implement this naively:
import pandas as pd
df = pd.DataFrame(array)
grouped1 = df.groupby('column1')
for column1 in grouped1.groups:
group1 = grouped1.get_group(column1)
grouped2 = group1.groupby('column2')
for column2 in grouped2.groups:
group2 = grouped2.get_group(column2)
...
And defaultdict way:
d = defaultdict(lambda x: defaultdict(lambda y: defaultdict ... ))
for row in array:
d[row[0]][row[1]][row[2]... = row[-1]
But I think neither is smart.
I would suggest this rather simple solution:
from functools import reduce
data = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
result = dict()
for row in data:
reduce(lambda v, k: v.setdefault(k, {}), row[:-2], result)[row[-2]] = row[-1]
print(result)
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
An actual recursive solution would be something like this:
def add_to_group(keys: list, group: dict):
if len(keys) == 2:
group[keys[0]] = keys[1]
else:
add_to_group(keys[1:], group.setdefault(keys[0], dict()))
result = dict()
for row in data:
add_to_group(row, result)
print(result)
Introduction
Here is a recursive solution. The base case is when you have a list of 2-element lists (or tuples), in which case, the dict will do what we want:
>>> dict([(1, 0.1), (2, 0.2)])
{1: 0.1, 2: 0.2}
For other cases, we will remove the first column and recurse down until we get to the base case.
The code:
from itertools import groupby
def rows2dict(rows):
if len(rows[0]) == 2:
# e.g. [(1, 0.1), (2, 0.2)] ==> {1: 0.1, 2: 0.2}
return dict(rows)
else:
dict_object = dict()
for column1, groupped_rows in groupby(rows, lambda x: x[0]):
rows_without_first_column = [x[1:] for x in groupped_rows]
dict_object[column1] = rows2dict(rows_without_first_column)
return dict_object
if __name__ == '__main__':
rows = [['a', 10, 1, 0.1],
['a', 10, 2, 0.2],
['a', 20, 2, 0.3],
['b', 10, 1, 0.4],
['b', 20, 2, 0.5]]
dict_object = rows2dict(rows)
print dict_object
Output
{'a': {10: {1: 0.1, 2: 0.2}, 20: {2: 0.3}}, 'b': {10: {1: 0.4}, 20: {2: 0.5}}}
Notes
We use the itertools.groupby generator to simplify grouping of similar rows based on the first column
For each group of rows, we remove the first column and recurse down
This solution assumes that the rows variable has 2 or more columns. The result is unpreditable for rows which has 0 or 1 column.
I have a list of lists, each list contains four elements, and the elements represent id, age, val1, val2. I am manipulating each list in such a way that the val1 and val2 values of that list always depend on the most recent values seen in the previous lists. The previous lists for a list are those lists for which the age difference is not less than timeDelta. The list of lists are in sorted order by age.
My code is working perfect but it is slow. I feel that the line marked ** is generating too many lists of lists and can be avoided, by keep on deleting the lists from the begining one I know that the age difference of a list with the next list is more than timeDelta.
myList = [
[1, 20, '', 'x'],
[1, 25, 's', ''],
[1, 26, '', 'e'],
[1, 30, 'd', 's'],
[1, 50, 'd', 'd'],
[1, 52, 'f', 'g']
]
age_Idx =1
timeDelta = 10
for i in range(len(myList))[1:]:
newList = myList[:i+1] #Subset of lists. #********
respList = newList.pop(-1)
currage = float(respList[age_Idx])
retval = collapseListTogether(newList, age_Idx, currage, timeDelta)
if(len(retval) == 0):
continue
retval[0:2] = respList[0:2]
print(retval)
def collapseListTogether(li, age_Idx, currage, timeDelta):
finalList = []
for xl in reversed(li) :
#print(xl)
oldage = float(xl[age_Idx])
if ((currage-timeDelta) <= oldage < currage):
finalList.append(xl)
else:
break
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalList[::-1])])
Example
[1, 20, '', 'x'] ==> Not dependent on anything. Skip this list
[1, 25, 's', ''] == > [1, 25, '', 'x']
[1, 26, '', 'e'] ==> [1, 26, 's', 'x']
[1, 30, 'd', 's'] ==> [1, 30, 's', 'e']
[1, 50, 'd', 'd'] ==> Age difference (50-30 = 20) which is more than 10
[1, 52, 'f', 'g'] ==> [1, 52, 'd', 'd']
I'm just rewriting your data structure and your code:
from collections import namedtuple
Record = namedtuple('Record', ['id', 'age', 'val1', 'val2'])
myList = [
Record._make([1, 20, '', 'x']),
Record._make([1, 25, 's', '']),
Record._make([1, 26, '', 'e']),
Record._make([1, 30, 'd', 's']),
Record._make([1, 50, 'd', 'd']),
Record._make([1, 52, 'f', 'g'])
]
timeDelta = 10
for i in range(1, len(myList)):
subList = list(myList[:i+1])
rec = supList.pop(-1)
age = float(rec.age)
retval = collapseListTogether(subList, age, timeDelta)
if len(retval) == 0:
continue
retval.id, retval.age = rec.id, rec.age
print(retval)
def collapseListTogether(lst, age, tdelta):
finalLst = []
[finalLst.append(ele) if age - float(ele.age) <= tdelta and age > float(ele.age)
else None for ele in lst]
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalLst[::-1])])
Your code is not readable to me. I did not change the logic, but just modify places for performance.
One of the way out is to replace your 4-element list with tuple, even better with namedtuple, which is a famous high-performance container in Python. Also, for-loop should be avoided in interpreted languages. In python, one would use comprehensions instead of for-loop if possible to enhance performance. Your list is not too large, so time earned in efficient line interpreting should be more than that in breaking.
To me, your code should not work, but I am not sure.
Assuming your example is correct, I see no reason you can't do this in a single pass, since they're sorted by age. If the last sublist you inspected has too great a difference, you know nothing earlier will count, so you should just leave the current sublist unmodified.
previous_age = None
previous_val1 = ''
previous_val2 = ''
for sublist in myList:
age = sublist[1]
latest_val1 = sublist[2]
latest_val2 = sublist[3]
if previous_age is not None and ((age - previous_age) <= timeDelta):
# there is at least one previous list
sublist[2] = previous_val1
sublist[3] = previous_val2
previous_age = age
previous_val1 = latest_val1 or previous_val1
previous_val2 = latest_val2 or previous_val2
When testing, that code produces this modified value for your initial myList:
[[1, 20, '', 'x'],
[1, 25, '', 'x'],
[1, 26, 's', 'x'],
[1, 30, 's', 'e'],
[1, 50, 'd', 'd'],
[1, 52, 'd', 'd']]
It's a straightforward modification to build a new list rather than edit one in place, or to entirely omit the skipped lines rather than just leave them unchanged.
reduce and list comprehensions are powerful tools, but they're not right for all problems.