I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
The objective is to extract the index number of a randomly selected grouped rows in Pandas.
Specifically, given a df
nval
0 4
1 4
2 0
...
23 0
24 4
...
29 4
30 4
31 0
I would like to extract each 5 random index of the element 0 and 4.
For example, the 5 randomly selected value for
0
can be
3,11,15,16,22
and
4
can be
6 9 7 29 27
Currently, the code below answer the above objective
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,
4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
cgroup=5
df=df.reset_index()
all_df=[]
for idx in [0,4]:
x=df[df['nval']==idx].reset_index(drop=True)
ids = np.random.choice(len(x), size=cgroup, replace=False).tolist()
all_df.append(x.iloc[ids].reset_index(drop=True))
df=pd.concat(all_df).reset_index(drop=True).sort_values(by=['index'])
sel_index=df[['index']]
Which produced
index
0 3
1 6
2 7
3 9
4 11
5 15
6 16
7 22
8 27
9 29
However, I wonder there is compact way of doing this using pandas or numpy?
How about this:
import numpy as np
import numpy.random
import pandas as pd
np.random.seed(0)
dval=[4,4,0,0,0,0,4,4,0,4,0,0,4,4,0,0,0,0,4,4,0,0,0,0,4,0,4,4,4,4,4,0,]
df = pd.DataFrame (dict(nval=dval))
df2 = df.groupby('nval').sample(5).reset_index()
print(df2)
output:
index nval
0 31 0
1 22 0
2 14 0
3 8 0
4 17 0
5 29 4
6 13 4
7 1 4
8 19 4
9 12 4
IIUC, you can use
pd.DataFrame({'index': df.groupby('nval').sample(5).index.sort_values()})
I'd just keep the result as an index, so it simplifies to
df.groupby('nval').sample(5).index.sort_values()
dataframe
df.columns=['ipo_date','l2y_gg_date','l1k_kk_date']
Goal
return dataframe with columns name containing _date except for ipo_date.
Try
df.filter(regex='_date&^ipo_date')
Try a negative lookbehind:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(1, 21).reshape((5, 4)),
columns=['ipo_date', 'l2y_gg_date', 'l1k_kk_date', 'other'])
filtered = df.filter(regex=r'(?<!ipo)_date')
print(filtered)
Sample df:
ipo_date l2y_gg_date l1k_kk_date other
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
filtered:
l2y_gg_date l1k_kk_date
0 2 3
1 6 7
2 10 11
3 14 15
4 18 19
I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64
I have to fill some patches with certain color based on given values. I have a data frame that looks like the following:
df:
Val Patch
0 12 0
1 13 1
2 16 2
3 18 3
4 19 4
5 24 5
6 31 6
7 33 7
8 34 8
9 35 9
I would like to know how to create associate to each values the "right" color in a colorbar (of reds for instance).
cmap = plt.cm.Reds