python Compute histogram values with a groupby panda dataframe - python

I want to group data from a dataframe using dataframe and I want to compute the histogram of the grouped data :
This is my dataframe :
indicator
key
14 1
14 2
14 3
15 1
16 2
16 5
16 6
17 1
18 3
And I want to get this result using groupby :
indicator
key
14 1,2,3
15 1
16 2,5,6
17 1
18 3
and then compute the histogram of every key

numpy.histogram cannot deal with the array in an array. You need to format your data like this.
import numpy as np
import pandas as pd
dataf = pd.DataFrame()
dataf['key'] = range(14,25)
dataf['indicator'] = [1,1,2,1,3,4,7,15,23,43,67]
dataf.loc[11] = [14,2]
dataf.loc[12] = [14,3]
dataf.loc[13] = [16,5]
dataf.loc[14] = [16,6]
Because there is no raw data provided, I can only assume data maybe can be reformatted like this.
In [30]: dataf
Out[30]:
key indicator
0 14 1
1 15 1
2 16 2
3 17 1
4 18 3
5 19 4
6 20 7
7 21 15
8 22 23
9 23 43
10 24 67
11 14 2
12 14 3
13 16 5
14 16 6
numpy.histogram already handled the groupby concept so you don't need to do groupby function in DataFrame.
You just need to do np.histogram(dff['indicator'])
FYI, if you want to plot a histogram, you can also use DataFrame.hist()
dataf.indicator.hist()
import matplotlib.pyplot as plt
plt.savefig('test.png')

Related

Merging rows in a dataframe based on reoccurring values

I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21

Extract indices of a grouped elements in Pandas

The objective is to extract the index number of a randomly selected grouped rows in Pandas.
Specifically, given a df
nval
0 4
1 4
2 0
...
23 0
24 4
...
29 4
30 4
31 0
I would like to extract each 5 random index of the element 0 and 4.
For example, the 5 randomly selected value for
0
can be
3,11,15,16,22
and
4
can be
6 9 7 29 27
Currently, the code below answer the above objective
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,
4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
cgroup=5
df=df.reset_index()
all_df=[]
for idx in [0,4]:
x=df[df['nval']==idx].reset_index(drop=True)
ids = np.random.choice(len(x), size=cgroup, replace=False).tolist()
all_df.append(x.iloc[ids].reset_index(drop=True))
df=pd.concat(all_df).reset_index(drop=True).sort_values(by=['index'])
sel_index=df[['index']]
Which produced
index
0 3
1 6
2 7
3 9
4 11
5 15
6 16
7 22
8 27
9 29
However, I wonder there is compact way of doing this using pandas or numpy?
How about this:
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
df2 = df.groupby('nval').sample(5).reset_index()
print(df2)
output:
index nval
0 31 0
1 22 0
2 14 0
3 8 0
4 17 0
5 29 4
6 13 4
7 1 4
8 19 4
9 12 4
IIUC, you can use
pd.DataFrame({'index': df.groupby('nval').sample(5).index.sort_values()})
I'd just keep the result as an index, so it simplifies to
df.groupby('nval').sample(5).index.sort_values()

How to exclude some string patterns when using filter on pandas?

dataframe
df.columns=['ipo_date','l2y_gg_date','l1k_kk_date']
Goal
return dataframe with columns name containing _date except for ipo_date.
Try
df.filter(regex='_date&^ipo_date')
Try a negative lookbehind:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(1, 21).reshape((5, 4)),
columns=['ipo_date', 'l2y_gg_date', 'l1k_kk_date', 'other'])
filtered = df.filter(regex=r'(?<!ipo)_date')
print(filtered)
Sample df:
ipo_date l2y_gg_date l1k_kk_date other
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
filtered:
l2y_gg_date l1k_kk_date
0 2 3
1 6 7
2 10 11
3 14 15
4 18 19

How to subset pandas dataframe columns with idxmax output?

I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64

Python: how to create a colorbar based on a list of values?

I have to fill some patches with certain color based on given values. I have a data frame that looks like the following:
df:
Val Patch
0 12 0
1 13 1
2 16 2
3 18 3
4 19 4
5 24 5
6 31 6
7 33 7
8 34 8
9 35 9
I would like to know how to create associate to each values the "right" color in a colorbar (of reds for instance).
cmap = plt.cm.Reds

Categories