I have a dataframe with 3 columns df=["a", "b", "value"]. (Actually this is a snippet, the solution should be able to handle n variables, like "a", "b", "c", "d"...) In this case, the "value" column has been generated depending on the "a" and "b" value, doing something like:
for a in range(1,10):
for b in range (1,10):
generate_value(a,b)
The resulting data is similar to:
a b value
0 1 1 0.23
1 1 2 6.34
2 1 3 0.25
3 1 4 2.17
4 1 5 5.97
[...]
I want to know the statistical better combinations of "a" and "b" that gives me the bigger "value". So I want to draw some kind of histogram that shows me which values of "a" and "b" statistically generates bigger "value". I tried with something like:
fig = plot.figure()
ax=fig.add_subplot(111)
ax.hist(df["a"],bins=50, normed=True)
or:
plot.plot(df["a"].values, df["value"].values, "o")
But the results are not good. I think that I should use some kind of histogram or gauss bell curve, but I'm not sure how to plot it.
So, how to plot the statistically better "a" and "b" to get maximum "value"?
Note: the answer 1 is perfect for two variables a and b, but the problem is that the correct answer would need to work for multiple variables, a, b, c, d...
Edit 1: Please note that although I'm asking about two variables, the solution can't be to bound "a" to axis x and "b" to axis y, as there may be more variables. So if we have "a", "b", "c", "d", "e", the solution should be valid
Edit 2: Trying to explain it better: Lets take the following dataframe:
a b c d value
0 1 6 9 7 0.23
1 5 2 3 5 11.34
2 6 7 8 4 0.25
3 1 4 9 3 2.17
4 1 5 9 1 4.97
5 6 6 4 7 25.9
6 3 5 5 2 10.37
7 1 5 1 2 7.87
8 2 5 3 3 8.12
9 1 5 2 1 2.97
10 7 5 4 9 5.97
11 3 5 2 3 9.92
[...]
The row 5 clearly is the winner, with a 25.9 value, so the supposedly better values of a,b,c,d are: 6 6 4 7 . But we can see that statistically it is a strange result, it is the only one so high with those values of a,b,c,d, so it is very unlikely that we're going to get, in the future, a high value choosing those values for a,b,c,d. Instead, seems much more safe to choose numbers that have generated "value" between 8 and 11. Although a 8 to 11 gain is less than 25.9, the probability that the values of a,b,c,d (5,2,3,3) generate this higher "value" is bigger
Edit 3: Although a,b,c,d are discrete, the combination/order of them will generate different results. I mean, there is a function that will return a value inside a small range, like: value=func(a,b,c,d). That value will depend not only on the values of a,b,c,d, but also on some random things. So, for instance, func(5,2,3,5) could return a value of 11.34, but it also could return a similar value, like 10.8, 9.5 or something like that (a range value between 8 and 11). Also, func(1,6,9,7) will return 0.23, or it could return 2.7, but probably it won't return 10.1 as it is also very far from its range.
Following the the example, I'm trying to get the numbers that most probably will generate something in the range of 8-11 (well, the maximum). Probably the numbers I want to visualize somehow will be some kind of combination of numbers 3,5 and 2. But probably there won't be any 6,7,4 numbers, as they usually generate smaller "value" results
I don't think there are any statistics involved here. You can plot the value as a function of a and b.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
A,B = np.meshgrid(np.arange(10),np.arange(10))
df = pd.DataFrame({"a" : A.flatten(), "b" : B.flatten(),
"value" : np.random.rand(100)})
ax = df.plot.scatter(x="a",y="b", c=df["value"])
plt.colorbar(ax.collections[0])
plt.show()
The darker the dots, the higher the value.
This problem seems to be very complicated to solve it by one built-in function.
I think it should be solved in this way:
exclude outliers from data
select n largest values
summarize results with bar plot or any other
Clean data from outliers
We might choose any appropriate method for outliers detection e.g. 3*sigma, 1.5*IQR etc. I used 1.5*IQR in the example bellow.
cleaned_data = data[data['value'] < 1.5 * stats.iqr(data['value'])]
Select n largest values
Pandas provides method nlargest, so you can use it to select n largest values:
largest_values = cleaned_data.nlargest(5, 'value')
or you can use interval of values
largest_values = cleaned_data[cleaned_data['value'] > cleaned_data['value'].max() - 3]
Summarize results
Here we should count ocurrences of values in each column and then plot this data.
melted = pd.melt(largest_values['here you should select columns with explanatory variables'])
table = pd.crosstab(melted['variable'], melted['value'])
table.plot.bar()
example of resulting plot
Related
I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):
Category
Samples
A,B,123
6
A,B,456
3
A,B,789
1
X,Y,123
18
X,Y,456
7
X,Y,789
2
P,Q,123
1
P,Q,456
2
P,Q,789
2
L,M,123
1
L,M,456
3
S,T,123
5
S,T,456
5
S,T,789
3
The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:
Category
Samples
A,B,123
10
X,Y,123
18
X,Y,456
9
P,Q,456
5
L,M,456
4
S,T,123
8
S,T,456
5
These are the possible rules I see that will be needed:
Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.
Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.
Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.
Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.
Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.
What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?
I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.
Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:
df.sort_values(by='Samples', inplace=True, ignore_index=True)
#grouper containing groupby keys ['A,B', 'X,Y', etc.)
g = df['Category'].str.extract("(.*),+")[0]
#create a column to keep the category together
df['sample_category'] = list(zip(df['Samples'], df['Category']))
Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:
df2 = df.groupby(g, as_index=False).agg(
{'sample_category': lambda s:
functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})
Then do some munging to modify the elements to a list type:
df2['sample_category'] = df2['sample_category'].apply(
lambda x: [x] if isinstance(x[0], int) else list(x))
Then explode, extract columns and finally drop the intermediate column 'sample_category'
df2 = df2.explode('sample_category', ignore_index=True)
df2['Sample'] = df2['sample_category'].str[0]
df2['Category'] = df2['sample_category'].str[1]
df2.drop('sample_category', axis=1, inplace=True)
print(df2):
Sample Category
0 10 A,B,123
1 4 L,M,456
2 5 P,Q,789
3 8 S,T,123
4 5 S,T,456
5 9 X,Y,456
6 18 X,Y,123
I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.
The Situation
I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings
'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long
'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as strings
df.sort_values('A', inplace=True)
new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
classifier = classy_dict[name]
vectors = vectorize(group.B.values)
preds = classifier.predict(vectors)
scores = classifier.decision_function(vectors)
for tup in zip(preds, scores, group.C.values):
if tup[2] == tup[0]:
new_col1.append(np.nan)
new_col2.append(tup[2])
else:
new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
new_col2.append(np.nan)
df['D'] = new_col1
df['E'] = new_col2
The Issue
I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs
My Expectations
All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers
df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
'B': randint(10, size=100)})
print(df.A.unique()) # unique values in order of appearance per the docs
for name, group in df.groupby('A', sort=False):
print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.
Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
""
Data from #coldspeed
df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()
Output:
col sort=False sort=True
0 16 0 7
1 1 1 0
2 10 2 5
3 20 3 8
4 3 4 2
5 13 5 6
6 2 6 1
7 5 7 3
8 7 8 4
When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.
Let's do a little empirical test. You can iterate over groupby and see the order in which groups are iterated over.
df
col
0 16
1 1
2 10
3 20
4 3
5 13
6 2
7 5
8 7
for c, g in df.groupby('col', sort=False):
print(c)
16
1
10
20
3
13
2
5
7
It appears that the order is preserved.
I have a data set where clients answer a question, and clients belong to a certain category. The category is ordinal. I want to visualize the change in percentages as a proportional stacked barplot. Here is some test data:
answer | categ
1 1
2 1
3 2
1 2
2 3
3 3
1 1
2 1
3 2
1 2
2 3
3 3
1 3
2 2
3 1
Here is how you can generate it:
pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
Using some convoluted code which can probably be written much nicer and more efficient I got to percentages within the answer.
test = pd.DataFrame({'answer':[1,2,3]*5, 'categ':[1,1,2,2,3,3]*2+[3,2,1]})
rel_data = pd.merge(pd.DataFrame(test.groupby(['answer','categ']).size()).reset_index(),pd.DataFrame(test.groupby('answer').size()).reset_index(), how='left', on='answer')
rel_data.columns = ['answer', 'categ', 'number_combination', 'number_answer']
rel_data['perc'] = rel_data['number_combination']/rel_data['number_answer']
rel_data[['answer', 'categ', 'perc']]
This results in:
answer | categ | perc
1 1 0.4
1 2 0.4
1 3 0.2
2 1 0.4
2 2 0.2
2 3 0.4
3 1 0.2
3 2 0.4
3 3 0.4
How do I get this into a stacked bar plot with per answer a bar and colored areas per category?
Once I had the last dataframe, I could get it fairly easily. By doing this:
rel_data = rel_data.groupby(['answer','categ']).\
perc.sum().unstack().plot(kind='bar', stacked=True, ylim=(0,1))
It's again dirty but at least it got the job done. The perc.sum turns it into one value per group (even though it already was that), the unstack() turns it into a DF with the categories as columns and the answers as rows, the plot turns this into a proportional stacked barplot. The ylim is due to some tiny rounding error where it could add up to 1.00001 which added a whole new tick.
This is by no means perfect but it's a start:
for i in set(df.categ):
colors = ["r", "g", "b", "y", "o"] #etc....
if i == 1:
x = np.zeros(len(set(df.answer)))
else:
x += df[df.categ == i - 1].perc.as_matrix()
plt.bar(df[df.categ == i].answer, df[df.categ == i].perc, bottom=x, color=colors[i - 1])
plt.xticks(list(set(df.answer)))
plt.show()
The approach is to group the data first by category and then we can iterate over each category to get the answers which will be the individual bars. We then check if its the first iteration by the i == 1 check. This creates an empty array which will be used when stacking. Then we draw the first bars. Then we iterate and add the height of the bars as we go into the variable x.
The colors array are there just so you can differentiate the bars a bit better.
Hope this helps.
You can make a barplot with the matplotlib library. Follow this tuto : http://matplotlib.org/examples/api/barchart_demo.html
Say that I have a dataframe (df)with lots of values, including two columns, X and Y. I want to create a stacked histogram where each bin is a categorical value in X (say A and B), and inside each bin are stacks by values in Y (say a,b,c,...).
I can run df.groupby(["X","Y"]).size() to get output like below, but how can I make the stacked histogram from this?
A a 14
b 41
c 4
d 2
e 2
f 15
g 1
h 3
B a 18
b 37
c 1
d 3
e 1
f 17
g 2
So, I think I figured this out. First one needs to stack the data using; .unstack(level=-1)
This will turn it into an n by m array-like structure where n is number of X entries and m is the number of Y entries. From this form you can follow the outline given here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
So in total the command will be:
df.groupby(["X","Y"]).size().unstack(level=-1).plot(kind='bar',stacked=True)
Kinda unwieldy looking though!