Axis bug on Pandas groupby boxplots - python

The head of my Pandas dataframe , df, is shown below:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
I'd like to make boxplots of count1, count2, and totalcount, grouped by season (there are 4 seasons) and have each set of box plots show up on their own subplot in a single figure.
When I do this with only two of the columns, say count1 and count2, everything looks great.
df.boxplot(['count1', 'count2'], by='season')
But when I add totalcount to the mix, the axis limits go haywire.
df.boxplot(['count1', 'count2', 'totalcount'], by='season')
This happens irregardless of the order of the columns. I realize there are several ways around this problem, but it would be much more convenient if this worked properly.
Am I missing something? Is this a known bug in Pandas? I wasn't able to find anything in my first pass of the Pandas bug reports.
I'm using Pandas 0.14.0 and matplotlib 1.3.1.

Did you tried to upgrade your pandas/matplotlib packages?
I'm using Pandas 0.13.1 + Matplotlib 1.2.1 and this is the plot I get:
In [31]: df
Out[34]:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
5 3 13 16 2
6 8 32 40 2
7 5 27 32 3
8 3 10 13 3
9 0 1 1 4
10 3 10 13 4
11 3 13 16 4
[12 rows x 4 columns]

Related

Looping through pandas and finding row column pairs

I can't wrap my head around the best way to accomplish this.
The visualization below is what I would like to accomplish. I'm not sure what you would call it exactly but essentially I want to iterate through rows and columns and make a new dataframe with the x, y and then the intercepting point. Just reshaping the dataframe I don't want to lose any values.
I'm doing this just to learn Pandas so any help in the right direction of how to think about this/ best way to solve it would be greatly appreciated.
1 2 3
1 10 15 20
2 11 16 21
3 12 17 22
x y z
1 1 10
1 2 15
1 3 20
2 1 11
2 2 16
2 3 21
3 1 12
3 2 17
3 3 22
Given your dataset and expected output you are looking for pandas melt. Iterating over a dataframe can be slow and inefficient, if you are learning I strongly suggest you look for more efficient ways of working (for example vectorizing an operation, or pivoting) instead of using for loops. The last line of the proposed solution is merely for ordering the columns to match your desired output. Kindly try the following:
df = pd.DataFrame({1:[10,11,12],
2:[15,16,17],
3:[20,21,22]})
df = df.melt(value_vars=[1,2,3],var_name='x',value_name='z')
df['y'] = df.groupby('x').cumcount()+1
df[['x','y','z']]
Outputs:
x y z
0 1 1 10
1 1 2 11
2 1 3 12
3 2 1 15
4 2 2 16
5 2 3 17
6 3 1 20
7 3 2 21
8 3 3 22

Filling data in Pandas

I'm using pandas and i have a little data like that.
4 1
5 8
6 25
7 33
8 24
9 4
and I want to fill in missing parts. I want to like that :
1 0
2 0
3 0
4 1
5 8
6 25
7 33
8 24
9 4
10 0
It's gonna be a list for use. like that [0,0,0,1,8,25,33,24,4,0]
looked for a solution but couldn't find any. Any idea?
Try with reindex
l = s.reindex(range(10+1),fill_value=0).tolist()

In python using iloc how would you retrive the last 12 values of a specific column in a data frame?

So the problem I seem to have is that I want to acces the data in a dataframe but only the last twelve numbers in every column so I have a data frame:
index A B C
20 1 2 3
21 2 5 6
22 7 8 9
23 10 1 2
24 3 1 2
25 4 9 0
26 10 11 12
27 1 2 3
28 2 1 5
29 6 7 8
30 8 4 5
31 1 3 4
32 1 2 3
33 5 6 7
34 1 3 4
The values inside A,B,C are not important they are just to show an example
currently I am using
df1=df2.iloc[23:35]
perhaps there is an easier way to do this because I have to do this for around 20 different dataframes of different sizes I know that if I use
df1=df2.iloc[-1]
it will return the last number but I dont know how to incorporate it for the last twelve numbers. any help would be appreciated.
You can get the last n rows of a DataFrame by:
df.tail(n)
or
df.iloc[-n-1:-1]

Group rows by overlapping ranges

I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively).
So, for example, if this is my dataframe:
left right
0 0 4
1 5 8
2 10 13
3 3 7
4 12 19
5 18 23
6 31 35
so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.
So, for this example the output should be something like that:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
I thought of various directions, but didn't figure it out (without an ugly for).
Any help will be appreciated!
I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by #CentAu):
# Union intervals by #CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:
Can you try this, use rolling max and rolling min, to find the intersection of the range :
df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()
df.sort_index()
Out[331]:
left right Group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
For example , (1,3) and (2,4)
To find the intersection
mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection
You can sort samples and utilize cumulative functions cummax and cumsum. Let's take your example:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
First you need to sort values so that longer ranges come first:
df = df.sort_values(['left', 'right'], ascending=[True, False])
Result:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
Then you can find overlapping groups through comparing 'left' with previous 'right' values:
df['group'] = (df['right'].cummax().shift() <= df['left']).cumsum()
df.sort_index(inplace=True)
Result:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 13 16 1
6 18 23 1
7 31 35 2
In one line:

Pandas Random Weighted Choice

I would like to randomly select a value in consideration of weightings using Pandas.
df:
0 1 2 3 4 5
0 40 5 20 10 35 25
1 24 3 12 6 21 15
2 72 9 36 18 63 45
3 8 1 4 2 7 5
4 16 2 8 4 14 10
5 48 6 24 12 42 30
I am aware of using np.random.choice, e.g:
x = np.random.choice(
['0-0','0-1',etc.],
1,
p=[0.4,0.24 etc.]
)
And so, I would like to get an output, in a similar style/alternative method to np.random.choice from df, but using Pandas. I would like to do so in a more efficient way in comparison to manually inserting the values as I have done above.
Using np.random.choice I am aware that all values must add up to 1. I'm not sure as to how to go about solving this, nor randomly selecting a value based on weightings using Pandas.
When referring to an output, if the randomly selected weight was for example, 40, then the output would be 0-0 since it is located in that column 0, row 0 and so on.
Stack the DataFrame:
stacked = df.stack()
Normalize the weights (so that they add up to 1):
weights = stacked / stacked.sum()
# As GeoMatt22 pointed out, this part is not necessary. See the other comment.
And then use sample:
stacked.sample(1, weights=weights)
Out:
1 2 12
dtype: int64
# Or without normalization, stacked.sample(1, weights=stacked)
DataFrame.sample method allows you to either sample from rows or from columns. Consider this:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05])
Out:
0 1 2 3 4 5
1 24 3 12 6 21 15
It selects one row (the first row with 40% chance, the second with 30% chance etc.)
This is also possible:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05], axis=1)
Out:
1
0 5
1 3
2 9
3 1
4 2
5 6
Same process but 40% chance is associated with the first column and we are selecting from columns. However, your question seems to imply that you don't want to select rows or columns - you want to select the cells inside. Therefore, I changed the dimension from 2D to 1D.
df.stack()
Out:
0 0 40
1 5
2 20
3 10
4 35
5 25
1 0 24
1 3
2 12
3 6
4 21
5 15
2 0 72
1 9
2 36
3 18
4 63
5 45
3 0 8
1 1
2 4
3 2
4 7
5 5
4 0 16
1 2
2 8
3 4
4 14
5 10
5 0 48
1 6
2 24
3 12
4 42
5 30
dtype: int64
So if I now sample from this, I will both sample a row and a column. For example:
df.stack().sample()
Out:
1 0 24
dtype: int64
selects row 1 and column 0.

Categories