Stacked Histograms of Grouped Data In pandas - python

Say that I have a dataframe (df)with lots of values, including two columns, X and Y. I want to create a stacked histogram where each bin is a categorical value in X (say A and B), and inside each bin are stacks by values in Y (say a,b,c,...).
I can run df.groupby(["X","Y"]).size() to get output like below, but how can I make the stacked histogram from this?
A a 14
b 41
c 4
d 2
e 2
f 15
g 1
h 3
B a 18
b 37
c 1
d 3
e 1
f 17
g 2

So, I think I figured this out. First one needs to stack the data using; .unstack(level=-1)
This will turn it into an n by m array-like structure where n is number of X entries and m is the number of Y entries. From this form you can follow the outline given here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
So in total the command will be:
df.groupby(["X","Y"]).size().unstack(level=-1).plot(kind='bar',stacked=True)
Kinda unwieldy looking though!

Related

Convert columns of number into percentages

I have 4 columns. The 4th column is sum of values of 3 columns.
It is like this.
A B C D
4 3 3 10
I want to convert above expression into this.
E F G H
40% 30% 30% 100%
How can I do this in python ?
You can use numpy arrays for this operation, let’s consider that all the columns are represented by vectors then first 3 columns are represented by 3 vectors.
A = numpy.array([4])
B = numpy.array([3])
C = numpy.array([3])
Then you can add them as normal vectors (in your case columns)
D = A + B + C
All the numerical values will be work as you expected but according to my knowledge problem is letters those can’t be added like you mentioned. Because if we consider
A = 1
B = 2
C = 3
Then the answer would be 10 or J not D, same goes for the second set.

Plot a CDF from a frequency table in Python

I have some frequency data:
Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1
...
in a dictionary:
d = {"A":34,"B":1,"C":1,"D":2,"E":1,"F":4,"G":112,"H":1,.......}
The letters represent a rank from highest to lowest (A to Z), and the number of time I observed the rank in the dataset.
How can I plot the cumulative distribution function given that I already have the frequencies of my observations in the dictionary? I want to be able to see the general ranking of the observations. For example: 50% of my observations have a rank lower than E.
I have been searching for info about this but I always find ways to plot the CDF from the raw observations but not from the counts.
Thanks in advance.
Maybe you want to plot a bar plot with the rank on the x axis and the cdf on the y axis?
u = u"""Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
df["Cum"] = df.Count.cumsum()/df.Count.sum()
df.plot.bar(x="Rank", y="Cum")
plt.show()

how to plot histogram of maximum values of a dataframe

I have a dataframe with 3 columns df=["a", "b", "value"]. (Actually this is a snippet, the solution should be able to handle n variables, like "a", "b", "c", "d"...) In this case, the "value" column has been generated depending on the "a" and "b" value, doing something like:
for a in range(1,10):
for b in range (1,10):
generate_value(a,b)
The resulting data is similar to:
a b value
0 1 1 0.23
1 1 2 6.34
2 1 3 0.25
3 1 4 2.17
4 1 5 5.97
[...]
I want to know the statistical better combinations of "a" and "b" that gives me the bigger "value". So I want to draw some kind of histogram that shows me which values of "a" and "b" statistically generates bigger "value". I tried with something like:
fig = plot.figure()
ax=fig.add_subplot(111)
ax.hist(df["a"],bins=50, normed=True)
or:
plot.plot(df["a"].values, df["value"].values, "o")
But the results are not good. I think that I should use some kind of histogram or gauss bell curve, but I'm not sure how to plot it.
So, how to plot the statistically better "a" and "b" to get maximum "value"?
Note: the answer 1 is perfect for two variables a and b, but the problem is that the correct answer would need to work for multiple variables, a, b, c, d...
Edit 1: Please note that although I'm asking about two variables, the solution can't be to bound "a" to axis x and "b" to axis y, as there may be more variables. So if we have "a", "b", "c", "d", "e", the solution should be valid
Edit 2: Trying to explain it better: Lets take the following dataframe:
a b c d value
0 1 6 9 7 0.23
1 5 2 3 5 11.34
2 6 7 8 4 0.25
3 1 4 9 3 2.17
4 1 5 9 1 4.97
5 6 6 4 7 25.9
6 3 5 5 2 10.37
7 1 5 1 2 7.87
8 2 5 3 3 8.12
9 1 5 2 1 2.97
10 7 5 4 9 5.97
11 3 5 2 3 9.92
[...]
The row 5 clearly is the winner, with a 25.9 value, so the supposedly better values of a,b,c,d are: 6 6 4 7 . But we can see that statistically it is a strange result, it is the only one so high with those values of a,b,c,d, so it is very unlikely that we're going to get, in the future, a high value choosing those values for a,b,c,d. Instead, seems much more safe to choose numbers that have generated "value" between 8 and 11. Although a 8 to 11 gain is less than 25.9, the probability that the values of a,b,c,d (5,2,3,3) generate this higher "value" is bigger
Edit 3: Although a,b,c,d are discrete, the combination/order of them will generate different results. I mean, there is a function that will return a value inside a small range, like: value=func(a,b,c,d). That value will depend not only on the values of a,b,c,d, but also on some random things. So, for instance, func(5,2,3,5) could return a value of 11.34, but it also could return a similar value, like 10.8, 9.5 or something like that (a range value between 8 and 11). Also, func(1,6,9,7) will return 0.23, or it could return 2.7, but probably it won't return 10.1 as it is also very far from its range.
Following the the example, I'm trying to get the numbers that most probably will generate something in the range of 8-11 (well, the maximum). Probably the numbers I want to visualize somehow will be some kind of combination of numbers 3,5 and 2. But probably there won't be any 6,7,4 numbers, as they usually generate smaller "value" results
I don't think there are any statistics involved here. You can plot the value as a function of a and b.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
A,B = np.meshgrid(np.arange(10),np.arange(10))
df = pd.DataFrame({"a" : A.flatten(), "b" : B.flatten(),
"value" : np.random.rand(100)})
ax = df.plot.scatter(x="a",y="b", c=df["value"])
plt.colorbar(ax.collections[0])
plt.show()
The darker the dots, the higher the value.
This problem seems to be very complicated to solve it by one built-in function.
I think it should be solved in this way:
exclude outliers from data
select n largest values
summarize results with bar plot or any other
Clean data from outliers
We might choose any appropriate method for outliers detection e.g. 3*sigma, 1.5*IQR etc. I used 1.5*IQR in the example bellow.
cleaned_data = data[data['value'] < 1.5 * stats.iqr(data['value'])]
Select n largest values
Pandas provides method nlargest, so you can use it to select n largest values:
largest_values = cleaned_data.nlargest(5, 'value')
or you can use interval of values
largest_values = cleaned_data[cleaned_data['value'] > cleaned_data['value'].max() - 3]
Summarize results
Here we should count ocurrences of values in each column and then plot this data.
melted = pd.melt(largest_values['here you should select columns with explanatory variables'])
table = pd.crosstab(melted['variable'], melted['value'])
table.plot.bar()
example of resulting plot

Convert (origin, destination, distance) to a distance matrix

I have a matrix with three columns (origin, destination, distance) and I want to convert this to a origin/destination matrix with Pandas, is there a fast way (lambda, map) to do this without for loops?
For example (what I have):
a b 10
a c 20
b c 30
What I need:
a b c
a 0 10 20
b 10 0 30
c 20 30 0
Here is an option, firstly duplicate the data frame information by concatenating the original data frame values and the origin-destination swapped values and then do a pivot:
pd.DataFrame(pd.np.concatenate([df.values, df.values[:, [1,0,2]]])).pivot(0,1,2).fillna(0)

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

Categories