How to combine rows into seperate dataframe python pandas - python

i have the following dataset:
A B C D E F
154.6175111 148.0112337 155.7859835 1 1 x
255 253.960131 242.5382584 1 1 x
251.9665958 235.1105659 185.9121703 1 1 x
137.9974994 225.3985177 254.4420772 1 1 x
85.74722877 116.7060415 158.4608395 1 1 x
123.6969939 140.0524405 132.6798037 1 1 x
133.3251695 80.08976196 38.81201612 1 1 y
118.0718812 243.5927927 255 1 1 y
189.5557302 139.9046713 91.90519519 1 1 y
172.3117291 188.000268 129.8155501 1 1 y
48.07634611 21.9183119 25.99669279 1 1 y
23.40525987 8.395857933 25.62371342 1 1 y
228.753009 164.0697727 172.6624107 1 1 z
203.3405006 173.9368303 189.8103708 1 1 z
184.9801932 117.1591341 87.94739034 1 1 z
29.55251224 46.03945452 70.7433477 1 1 z
143.6159623 120.6170926 155.0736604 1 1 z
142.5421179 128.8916843 169.6013111 1 1 z
i want to combine x y z into another dataframe like this:
A B C D E F
154.6175111 148.0112337 155.7859835 1 1 x ->first x value
133.3251695 80.08976196 38.81201612 1 1 y ->first y value
228.753009 164.0697727 172.6624107 1 1 z ->first z value
and i want these dataframes for each x y z value like first, second third and so on.
how can i select and combine them?
desired output:
A B C D E F
154.6175111 148.0112337 155.7859835 1 1 x
133.3251695 80.08976196 38.81201612 1 1 y
228.753009 164.0697727 172.6624107 1 1 z
A B C D E F
255 253.960131 242.5382584 1 1 x
118.0718812 243.5927927 255 1 1 y
203.3405006 173.9368303 189.8103708 1 1 z
A B C D E F
251.9665958 235.1105659 185.9121703 1 1 x
189.5557302 139.9046713 91.90519519 1 1 y
184.9801932 117.1591341 87.94739034 1 1 z
A B C D E F
137.9974994 225.3985177 254.4420772 1 1 x
172.3117291 188.000268 129.8155501 1 1 y
29.55251224 46.03945452 70.7433477 1 1 z
A B C D E F
85.74722877 116.7060415 158.4608395 1 1 x
48.07634611 21.9183119 25.99669279 1 1 y
143.6159623 120.6170926 155.0736604 1 1 z
A B C D E F
123.6969939 140.0524405 132.6798037 1 1 x
23.40525987 8.395857933 25.62371342 1 1 y
142.5421179 128.8916843 169.6013111 1 1 z

Use GroupBy.cumcount for counter and then loop by another groupby object:
g = df.groupby('F').cumcount()
for i, g in df.groupby(g):
print (g)

Related

Randomly select cells in df pandas

From this pandas df
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
samples_indices = df.sample(frac=0.5, replace=False).index
df.loc[samples_indices] = 'X'
will assign 'X' to all columns in randomly selected rows corresponding to 50% of df, like so:
X X X X
1 1 1 1
X X X X
1 1 1 1
But how do I assign 'X' to 50% randomly selected cells in the df?
For example like this:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
Use numpy and boolean indexing, for an efficient solution:
import numpy as np
df[np.random.choice([True, False], size=df.shape)] = 'X'
# with a custom probability:
N = 0.5
df[np.random.choice([True, False], size=df.shape, p=[N, 1-N])] = 'X'
Example output:
0 1 2 3
0 X 1 X X
1 X X 1 X
2 X X X 1
3 X X 1 X
If you need an exact proportion, you can use:
frac = 0.5
df[np.random.permutation(df.size).reshape(df.shape)>=df.size*frac] = 'X'
Example:
0 1 2 3
0 X 1 X 1
1 X 1 X 1
2 1 1 X 1
3 X X 1 X
In #mozway's answer you can set to 'X' cells with a certain probability. But let's say you want to have exactly 50% of your data being marked as 'X'. This is how you can do it:
import numpy as np
df[np.random.permutation(np.hstack([np.ones(df.size // 2), np.zeros(df.size // 2)])).astype(bool).reshape(df.shape)] = 'X'
Example output:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
Create MultiIndex Series by DataFrame.stack, then use Series.sample and last replace removed values by X in Series.unstack:
N = 0.5
df = (df.stack().sample(frac=1-N).unstack(fill_value='X')
.reindex(index=df.index, columns=df.columns, fill_value='X'))
print (df)
0 1 2 3
0 X X 1 1
1 X 1 X 1
2 1 X X X
3 1 1 1 X

Get the middle value from a column according to a criteria

I have a dataframe with 3 columns. I need to get the value from col A and B in the middle of C when C = 1. If the amount of C = 1 is even, I want the first one from the middle
For example, this one is for an odd amount of C = 1
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 0
q p 0
The row in the middle when C = 1 is
A B C
e p 1
Therefore, it should return
df_return
A B C
e p 1
When we have an even amount of C = 1:
df_return
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 1
r e 1
u f 1
q p 0
The ones in the middle when C = 1 are
A B C
t b 1
u e 1
However, I want only 1 of them, and it should be the first one. So
df_return
A B C
t b 1
How can I do it?
One thing you should know is that A and B are ordered
Focus on the relevant part, discarding rows holding zeros:
df = df[df.C == 1]
Now it's simple. Just find the midpoint, based on length or .shape.
if len(df) > 0:
mid = (len(df) - 1) // 2
return df.iloc[mid, :]

How to print the row and columns of the value you're looking for in dataframe

So I made this dataframe
alp = "abcdefghijklmnopqrstuvwxyz0123456789"
s = "carl"
for i in s:
alp = alp.replace(i,"")
jaa = s+alp
x = list(jaa)
array = np.array(x)
re = np.reshape(array,(6,6))
dt = pd.DataFrame(re)
dt.columns = [1,2,3,4,5,6]
dt.index = [1,2,3,4,5,6]
dt
1 2 3 4 5 6
1 c a r l b d
2 e f g h i j
3 k m n o p q
4 s t u v w x
5 y z 0 1 2 3
6 4 5 6 7 8 9
I want to search a value , and print its row(index) and column.
For example, 'h', the output i want is 2,4.
Is there any way to get that output?
row, col = np.where(dt == "h")
print(dt.index[row[0]], dt.columns[col[0]])

how to change string matrix to a integer matrix

I have a voting dataset like that:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
but they are both string so I want to change them to integer matrix and make statistic
hou_dat = pd.read_csv("house.data", header=None)
for i in range (0, hou_dat.shape[0]):
for j in range (0, hou_dat.shape[1]):
if hou_dat[i, j] == "republican":
hou_dat[i, j] = 2
if hou_dat[i, j] == "democrat":
hou_dat[i, j] = 3
if hou_dat[i, j] == "y":
hou_dat[i, j] = 1
if hou_dat[i, j] == "n":
hou_dat[i, j] = 0
if hou_dat[i, j] == "?":
hou_dat[i, j] = -1
hou_sta = hou_dat.apply(pd.value_counts)
print(hou_sta)
however, it shows error, how to solve it?:
Exception has occurred: KeyError
(0, 0)
IIUC, you need map and stack
map_dict = {'republican' : 2,
'democrat' : 3,
'y' : 1,
'n' : 0,
'?' : -1}
df1 = df.stack().map(map_dict).unstack()
print(df1)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 2 0 1 0 1 1 1 0 0 0 1 -1 1 1 1 0 1
1 2 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 -1
2 3 -1 1 1 -1 1 1 0 0 0 0 1 0 1 1 0 0
3 3 0 1 1 0 -1 1 0 0 0 0 1 0 1 0 0 1
If you're dealing with data from csv, it is better to use pandas' methods.
In this case, you have replace method to do exactly what you asked for.
hou_dat.replace(to_replace={'republican':2, 'democrat':3, 'y':1, 'n':0, '?':-1}, inplace=True)
You can read more about it in this documentation

Getting list of column names and coeff in ols.param

I'm using OLS with respect to two dataframes:
gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
where,
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
When I try to extract something, say parameter or pvals, I get something like this:
In [3]: gab.params
Out[3]:
Intercept 2.687598e+06
all_but_volume[0] 5.500544e+01
all_but_volume[1] 2.696902e+02
all_but_volume[2] 3.389568e+04
all_but_volume[3] -2.385838e+04
all_but_volume[4] 5.419860e+02
all_but_volume[5] 3.815161e+02
all_but_volume[6] -2.281344e+04
all_but_volume[7] 1.794128e+04
...
all_but_volume[22] 1.374321e+00
Since gab.params provides with 23 values in LHS and all_but_volume has 23 columns, I was hoping if there was a way to get a list/zip of params with column names, instead of params with all_but_volume[i]
Like,
TMC 9.801195e+01
TAC 2.214464e+02
...
What I've tried:
removing all_but_volume and simply using data_p.iloc[:, 1:data_p.shape[1]]
Didn't work:
...
data_p.iloc[:, 1:data_p.shape[1]][21] 2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22] 1.395342e+00
Edit:
Sample Data:
data_p.iloc[1:5,:]
Out[31]:
Volume A B C\
1 569886.171878 759.089217 272.446022 4.163908
2 561695.886128 701.165406 330.301260 4.136530
3 627221.486089 377.746089 656.838394 4.130720
4 625181.750625 361.489041 670.575110 4.134467
D E F G H I \
1 1.000842 12993.06 3371.28 236.90 4.92 6.13
2 0.981514 13005.44 3378.69 236.94 4.92 6.13
3 0.836920 13017.22 3384.47 236.98 4.93 6.13
4 0.810541 13028.56 3388.85 237.01 4.94 6.13
J K L M N \
1 ... 0 0 0 0 0
2 ... 0 0 0 0 0
3 ... 0 0 0 0 0
4 ... 0 0 0 0 0
O P Q R S
1 0 0 0 1 9202.171648
2 0 0 0 0 4381.373520
3 0 0 0 0 -13982.443554
4 0 0 0 0 -22878.843149
only_volume is the first column 'volume'
all_but_volume is all columns except 'volume'
You can use DataFrame constructor or rename, because gab.params is Series:
Sample:
np.random.seed(2018)
import statsmodels.formula.api as sm
data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
print (data_p)
Volume A B C D
0 0.882349 0.104328 0.907009 0.306399 0.446409
1 0.589985 0.837111 0.697801 0.802803 0.107215
2 0.757093 0.999671 0.725931 0.141448 0.356721
3 0.942704 0.610162 0.227577 0.668732 0.692905
4 0.416863 0.171810 0.976891 0.330224 0.629044
5 0.160611 0.089953 0.970822 0.816578 0.571366
6 0.345853 0.403744 0.137383 0.900934 0.933936
7 0.047377 0.671507 0.034832 0.252691 0.557125
8 0.525823 0.352968 0.092983 0.304509 0.862430
9 0.716937 0.964071 0.539702 0.950540 0.667982
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
print (gab.params)
Intercept 0.077570
all_but_volume[0] 0.395072
all_but_volume[1] 0.313150
all_but_volume[2] -0.100752
all_but_volume[3] 0.247532
dtype: float64
print (type(gab.params))
<class 'pandas.core.series.Series'>
df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
print (df)
cols par
0 A 0.395072
1 B 0.313150
2 C -0.100752
3 D 0.247532
If want return Series:
s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
Volume 0.077570
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
Series without first value:
s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64

Categories