I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that. I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
Assuming the same dataframe provided by #MaxU
Not easier, but a perspective on how to use other api elements.
pd.DataFrame.update and pd.DataFrame.mul
df.update(df.iloc[:, 3:7].mul(df.Count, 0))
df
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 15.366436 1.355862 7.231264 4.971494 12 70 69 0.225977
1 49 1 38 1.004190 1.095480 2.829990 0.273870 57 93 64 0.030430
2 2 53 49 49.749460 50.379200 54.157640 16.373240 22 31 41 0.629740
3 38 44 23 28.437516 73.545300 41.185368 73.545300 19 99 57 0.980604
4 45 2 60 10.093230 4.773825 10.502415 6.274170 43 63 55 0.136395
5 65 97 15 10.375760 57.066680 38.260615 14.915155 68 5 21 0.648485
6 95 90 45 52.776000 16.888320 22.517760 50.664960 76 32 75 0.703680
7 60 31 65 63.242210 2.976104 26.784936 38.689352 72 73 94 0.744026
8 64 96 96 7.505370 37.526850 11.007876 10.007160 68 56 39 0.500358
9 78 54 74 8.409275 25.227825 16.528575 9.569175 97 63 37 0.289975
Demo:
Sample DF:
In [6]: df = pd.DataFrame(np.random.randint(100,size=(10,10))) \
.assign(Count=np.random.rand(10))
In [7]: df
Out[7]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 68 6 32 22 12 70 69 0.225977
1 49 1 38 33 36 93 9 57 93 64 0.030430
2 2 53 49 79 80 86 26 22 31 41 0.629740
3 38 44 23 29 75 42 75 19 99 57 0.980604
4 45 2 60 74 35 77 46 43 63 55 0.136395
5 65 97 15 16 88 59 23 68 5 21 0.648485
6 95 90 45 75 24 32 72 76 32 75 0.703680
7 60 31 65 85 4 36 52 72 73 94 0.744026
8 64 96 96 15 75 22 20 68 56 39 0.500358
9 78 54 74 29 87 57 33 97 63 37 0.289975
Let's multiply columns 3-6 by df['Count']:
In [8]: df.iloc[:, 3:6+1]
Out[8]:
3 4 5 6
0 68 6 32 22
1 33 36 93 9
2 79 80 86 26
3 29 75 42 75
4 74 35 77 46
5 16 88 59 23
6 75 24 32 72
7 85 4 36 52
8 15 75 22 20
9 29 87 57 33
In [9]: df.iloc[:, 3:6+1] *= df['Count']
In [10]: df
Out[10]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 66.681065 0.818372 20.751519 15.480964 12 70 69 0.225977
1 49 1 38 32.359929 4.910233 60.309102 6.333122 57 93 64 0.030430
2 2 53 49 77.467708 10.911630 55.769707 18.295685 22 31 41 0.629740
3 38 44 23 28.437513 10.229653 27.236368 52.776014 19 99 57 0.980604
4 45 2 60 72.564688 4.773838 49.933342 32.369289 43 63 55 0.136395
5 65 97 15 15.689662 12.002793 38.260613 16.184644 68 5 21 0.648485
6 95 90 45 73.545292 3.273489 20.751519 50.664974 76 32 75 0.703680
7 60 31 65 83.351331 0.545581 23.345459 36.591370 72 73 94 0.744026
8 64 96 96 14.709058 10.229653 14.266669 14.073604 68 56 39 0.500358
9 78 54 74 28.437513 11.866397 36.963643 23.221446 97 63 37 0.289975
The easiest thing to do here would be to extract the values, multiply, and then assign.
u = df.iloc[0, 6:74].values
v = df[['count']]
df = pd.DataFrame(u * v)
By using combine_first
df.iloc[:, 3:6+1].mul(df['Count'],axis=0).combine_first(df)
You need to concatenate the data frame resulting from multiplication with the remaining columns:
df=pd.concat( [df.iloc[0:6],df.iloc[75:],df.iloc[:,6:74+1].multiply(df['Count'],axis=0)] , axis=1)
Related
I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!
I have two data frames:
df1:
A B C D E F
0 63 9 56 23 41 0
1 40 35 69 98 47 45
2 51 95 55 36 10 34
3 25 11 67 83 49 89
4 91 10 43 73 96 95
5 2 47 8 30 46 9
6 37 10 33 8 45 20
7 40 88 6 29 46 79
8 75 87 49 76 0 69
9 92 21 86 91 46 41
df2:
A B C D E F
0 0 0 0 1 1 0
I want to delete Columns in df1 based on values in df2(lookup table). wherever df2 has 1 I have to delete that column in df1.
so my final output should be like.
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
Assuming len(df1.columns) == len(df2.columns):
df1.loc[:, ~df2.loc[0].astype(bool).values]
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
If the columns aren't the same, but df2 has a subset of columns in df1, then
df1.reindex(df2.columns[~df2.loc[0].astype(bool)], axis=1)
Or with drop, similar to #student's method:
df1.drop(df2.columns[df2.loc[0].astype(bool)], axis=1)
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
columns can do intersection
df1[df1.columns.intersection(df2.columns[~df2.iloc[0].astype(bool)])]
Out[354]:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
You can try with drop to drop the columns:
remove_col = df2.columns[(df2 == 1).any()] # get columns with any value 1
df1.drop(remove_col, axis=1, inplace=True) # drop the columns in original dataframe
Or, in one line as:
df1.drop(df2.columns[(df2 == 1).any()], axis=1, inplace=True)
Following can be an easily understandable solution:
df1.loc[:,df2.loc[0]!=1]
Output:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
loc can be used for selecting rows or columns with a boolean or conditional lookup : https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
Below I am using pandas to read my csv file in the following format:
dataframe = pandas.read_csv("test.csv", header=None, usecols=range(2,62), skiprows=1)
dataset = dataframe.values
How can I delete the first value in the very last column in the dataframe and then delete the last row in the dataframe?
Any ideas?
You can shift the last column up to get rid of the first value, then drop the last line.
df.assign(E=df.E.shift(-1)).drop(df.index[-1])
MVCE:
pd.np.random.seed = 123
df = pd.DataFrame(pd.np.random.randint(0,100,(10,5)),columns=list('ABCDE'))
Output:
A B C D E
0 91 83 40 17 94
1 61 5 43 87 48
2 3 69 73 15 85
3 99 53 18 95 45
4 67 30 69 91 28
5 25 89 14 39 64
6 54 99 49 44 73
7 70 41 96 51 68
8 36 3 15 94 61
9 51 4 31 39 0
df.assign(E=df.E.shift(-1)).drop(df.index[-1]).astype(int)
Output:
A B C D E
0 91 83 40 17 48
1 61 5 43 87 85
2 3 69 73 15 45
3 99 53 18 95 28
4 67 30 69 91 64
5 25 89 14 39 73
6 54 99 49 44 68
7 70 41 96 51 61
8 36 3 15 94 0
or in two steps:
df[df.columns[-1]] = df[df.columns[-1]].shift(-1)
df = df[:-1]
There are two pandas DataFrame, say dfx, dfy of the same shape and exactly the same column and row indices. I want to apply a function to the corresponding rows of these two DataFrame.
In other words, suppose we have a function as follows
def fun( row_x, row_y):
...# a function of the corresponding rows
Let index be the common index of dfx, dfy. I want to compute in pandas the following list/Series
[fun(dfx[i], dfy[i]) for i in index] (pseudo-code)
By the following code, I make a grouped two-level indexed DataFrame. Then I do not know how to apply agg in the proper way.
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy})
dfxy = dfxy.swaplevel(0,1,axis=0).sort_index(level=0)
grouped=dfxy.groupby(level=0)
In [19]:
dfx = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('abcde'))
dfx
Out[19]:
a b c d e
3 44 8 55 95
26 5 18 34 10
20 20 91 15 8
83 7 50 47 27
97 65 10 94 93
44 6 70 60 4
38 64 8 67 92
44 21 42 6 12
30 98 34 7 79
76 7 14 58 5
In [4]:
dfy = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('fghij'))
dfy
Out[4]:
f g h i j
82 48 29 54 78
7 31 78 38 30
90 91 43 8 40
52 88 13 87 39
41 88 90 51 91
55 4 94 62 98
31 23 4 59 93
87 12 33 77 0
25 99 39 23 1
7 50 46 39 66
In [13]:
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy} , axis = 1)
dfxy
Out[13]:
dfx dfy
a b c d e f g h i j
20 76 5 98 38 82 48 29 54 78
39 36 9 3 74 7 31 78 38 30
43 12 50 72 14 90 91 43 8 40
89 41 95 91 86 52 88 13 87 39
33 30 55 64 94 41 88 90 51 91
89 84 48 1 60 55 4 94 62 98
68 40 27 10 63 31 23 4 59 93
33 10 86 89 67 87 12 33 77 0
56 89 0 70 67 25 99 39 23 1
48 58 98 18 24 7 50 46 39 66
def f(x , y):
return pd.Series(data = [np.mean(x) , np.mean(y)] , index=['x_mean' , 'y_mean'])
In [17]:
dfxy.apply( lambda x : f(x['dfx'] , x['dfy']) , axis = 1)
Out[17]:
x_mean y_mean
0 47.4 58.2
1 32.2 36.8
2 38.2 54.4
3 80.4 55.8
4 55.2 72.2
5 56.4 62.6
6 41.6 42.0
7 57.0 41.8
8 56.4 37.4
9 49.2 41.6
Could this be what you are looking for?
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: dfx = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [4]: dfy = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [5]: dfy['index'] = dfx['index']
In [6]: print(dfx)
index a b c d
0 25 41 46 18 98
1 0 21 9 20 29
2 18 78 63 94 70
3 86 71 71 95 64
4 23 33 19 34 29
5 69 10 91 19 42
6 92 68 60 12 58
7 74 49 22 74 1
8 47 35 56 41 80
9 93 20 44 16 49
In [7]: print(dfy)
index a b c d
0 25 28 35 96 89
1 0 44 94 50 43
2 18 18 39 75 45
3 86 18 87 72 88
4 23 2 28 24 4
5 69 53 55 55 40
6 92 0 52 54 91
7 74 8 1 96 59
8 47 74 21 7 7
9 93 42 83 42 60
In [8]: print(dfx.merge(dfy, on='index'))
index a_x b_x c_x d_x a_y b_y c_y d_y
0 25 41 46 18 98 28 35 96 89
1 0 21 9 20 29 44 94 50 43
2 18 78 63 94 70 18 39 75 45
3 86 71 71 95 64 18 87 72 88
4 23 33 19 34 29 2 28 24 4
5 69 10 91 19 42 53 55 55 40
6 92 68 60 12 58 0 52 54 91
7 74 49 22 74 1 8 1 96 59
8 47 35 56 41 80 74 21 7 7
9 93 20 44 16 49 42 83 42 60
In [9]: def my_function(x):
...: return sum(x)
...:
In [10]: print(dfx.merge(dfy, on='index').drop('index', axis=1).apply(my_function, axis=1))
0 451
1 310
2 482
3 566
4 173
5 365
6 395
7 310
8 321
9 356
dtype: int64
In [11]: print(pd.DataFrame(
{
'my_function':
dfx.merge(dfy, on='index').\
drop('index', axis=1).apply(my_function, axis=1),
'index':
dfx['index']
}))
index my_function
0 25 451
1 0 310
2 18 482
3 86 566
4 23 173
5 69 365
6 92 395
7 74 310
8 47 321
9 93 356
The documentation suggests:
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
However I get an error trying to slice along the column index.
import pandas as pd
import numpy as np
cols= [(yr,m) for yr in [2014,2015] for m in [7,8,9,10]]
df = pd.DataFrame(np.random.randint(1,100,(10,8)),index=tuple('ABCDEFGHIJ'))
df.columns =pd.MultiIndex.from_tuples(cols)
print df.head()
2014 2015
7 8 9 10 7 8 9 10
A 68 51 6 48 24 3 4 85
B 79 75 68 62 19 40 63 45
C 60 15 32 32 37 95 56 38
D 4 54 81 50 13 64 65 13
E 78 21 84 1 83 18 39 57
#This does not work as expected
print df.loc(axis=1)[(2014,9):(2015,8)]
AssertionError: Start slice bound is non-scalar
#but an arbitrary transpose and changing axis works!
df = df.T
print df.loc(axis=0)[(2014,9):(2015,8)]
A B C D E F G H I J
2014 9 6 68 32 81 84 60 83 39 94 93
10 48 62 32 50 1 84 18 14 92 33
2015 7 24 19 37 13 83 69 31 91 69 90
8 3 40 95 64 18 8 32 93 16 25
So I could always assign the slice and re-transpose.
That though feels like a hack and the axis=1 setting should have worked.
df = df.loc(axis=0)[(2014,9):(2015,8)]
df = df.T
print df
2014 2015
9 10 7 8
A 64 98 99 87
B 43 36 22 84
C 32 78 86 66
D 67 8 34 73
E 83 54 96 33
F 18 83 36 71
G 13 25 76 8
H 69 4 99 84
I 3 52 50 62
J 67 60 9 49
That might be a bug. Pls post an issue on github. The canoncial way to select things is to fully specify all the axes.
In [6]: df.loc[:,(2014,9):(2015,8)]
Out[6]:
2014 2015
9 10 7 8
A 26 2 44 69
B 41 7 5 1
C 8 27 23 22
D 54 72 81 93
E 18 23 54 7
F 11 81 37 83
G 60 38 59 29
H 3 95 89 96
I 6 9 77 9
J 90 92 10 32