Assume this is my df I am looking for a way to add one column with another.
For example:
Except "col a" every thing else should be add with each other
colb + colc, cold + cole, colf + colg, colh + coli
a b c d e f g h i
group
A 0.15 0.1 0.1 0.15 0.15 0.1 0.1 0.10 0.05
B 0.13 NaN NaN NaN 0.40 0.2 NaN 0.13 0.06
desired output:
a b d f h
group
A 0.15 0.2 0.30 0.2 0.15
B 0.13 NaN 0.40 0.2 0.19
I know I can manually add both the columns but I'm looking for a easier way or a apply function to achieve the output.
I couldn't figure out a way. Any help?
Use add of shifted DataFrame with no first column selected by iloc or removed by drop, last filter by list of columns names:
cols = ['a','b','d','f','h']
df = df.add(df.iloc[:, 1:].shift(-1,axis=1), fill_value=0)[cols]
Alternative:
df = df.add(df.drop('a', axis=1).shift(-1,axis=1), fill_value=0)[cols]
print (df)
a b d f h
group
A 0.15 0.2 0.3 0.2 0.15
B 0.13 NaN 0.4 0.2 0.19
Related
I have a data frame that looks like this
data = [['A', 0.20], ['B',0.25], ['C',0.11], ['D',0.30], ['E',0.29]]
df = pd.DataFrame(data, columns=['col1', 'col2'])
Col1 is a primary key (each row has a unique value)
The max of col2 is 1 and the min is 0. I want to find the number of datapoint in ranges 0-.30 (both 0 and 0.30 are included), 0-.29, 0-.28, and so on till 0-.01. I can use pd.cut, but the lower limit is not fixed. My lower limit is always 0.
Can someone help?
One option using numpy broadcasting:
step = 0.01
up = np.arange(0, 0.3+step, step)
out = pd.Series((df['col2'].to_numpy()[:,None] <= up).sum(axis=0), index=up)
Output:
0.00 0
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.10 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.20 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.30 5
dtype: int64
With pandas.cut and cumsum:
step = 0.01
up = np.arange(0, 0.3+step, step)
(pd.cut(df['col2'], up, labels=up[1:].round(2))
.value_counts(sort=False).cumsum()
)
Output:
0.01 0
0.02 0
0.03 0
0.04 0
0.05 0
0.06 0
0.07 0
0.08 0
0.09 0
0.1 0
0.11 1
0.12 1
0.13 1
0.14 1
0.15 1
0.16 1
0.17 1
0.18 1
0.19 1
0.2 2
0.21 2
0.22 2
0.23 2
0.24 2
0.25 3
0.26 3
0.27 3
0.28 3
0.29 4
0.3 5
Name: col2, dtype: int64
I have the following DataFrame:
df = pd.DataFrame({'a': [0.28, 0, 0.25, 0.85, 0.1],
'b': [0.5, 0.5, 0, 0.75, 0.1],
'c': [0.33, 0.7, 0.25, 0.2, 0.5],
'd': [0, 0.25, 0.2, 0.66, 0.1]})
Output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
For each column, I want to sum the top n max values of the column, where n is determined by how many row max values that column contains.
For example, column b has a row max only in row 1, so its sum is the sum of the top 1 max values in that column, which is just 0.5 -- but column c has three row-maxes, located in rows 1, 2, and 4, so the top 3 max values of column c should be summed.
Expected output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00
where
df.append(
df.where( # only look at values that are max for the row
df.eq( # compare max values to all values in row just
# in case there are more than 1
df.max(axis=1), # actually get max values
axis=0
)
).sum().rename('count')
)
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00
The fastest way to do this would be to using the .max() method passing through the axis argument:
df.max(axis =1)
If you're after another column:
df['column_name'] = df.max(axis =1)
I didn't read the question that well!
I have a pandas dataframe called df_initial with two columns 'a' and 'b' and N rows.
I would like to half the rows number, deleting the row where the value of 'b' is lower.
Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc..
This is the result that I would like to obtain:
print(df_initial)
a b
0 0.04 0.01
1 0.05 0.22
2 0.06 0.34
3 0.07 0.49
4 0.08 0.71
5 0.09 0.09
6 0.10 0.98
7 0.11 0.42
8 0.12 1.32
9 0.13 0.39
10 0.14 0.97
11 0.15 0.05
12 0.16 0.36
13 0.17 1.72
....
print(df_reduced)
a b
0 0.05 0.22
1 0.07 0.49
2 0.08 0.71
3 0.10 0.98
4 0.12 1.32
5 0.14 0.97
6 0.17 1.72
....
Is there some Pandas function to do this ?
I saw that there is a resample function, DataFrame.resample() , but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case.
Thanks who will help me
You can groupby every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax of column b to index the dataframe:
df.loc[df.groupby(df.index//2).b.idxmax(), :]
a b
0 0.05 0.22
1 0.07 0.49
2 0.09 0.71
3 0.11 0.98
4 0.13 1.32
5 0.15 0.97
6 0.17 1.72
Or using DataFrame.rolling:
df.loc[df.b.rolling(2).max()[1::2].index, :]
This is an application for a simple example, you can apply it on your base.
import numpy as np
import pandas as pd
ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]])
df = pd.DataFrame(ar, columns = ['a', 'b'])
for i in range(len(df)):
if df['b'][i] < df['a'][i]:
df = df.drop(index = i)
print(df)````
I am running Python 3.6 and Pandas 0.19.2 in PyCharm Community Edition 2016.3.2 and am trying to ensure a set of rows in my dataframe adds up to 1.
Initially my dataframe looks as follows:
hello world label0 label1 label2
abc def 1.0 0.0 0.0
why not 0.33 0.34 0.33
hello you 0.33 0.38 0.15
I proceed as follows:
# get list of label columns (all column headers that contain the string 'label')
label_list = df.filter(like='label').columns
# ensure every row adds to 1
if (df[label_list].sum(axis=1) != 1).any():
print('ERROR')
Unfortunately this code does not work for me. What seems to be happening is that instead of summing my rows, I just get the value of the first column in my filtered data. In other words: df[label_list].sum(axis=1)returns:
0 1.0
1 0.33
2 0.33
This should be trivial, but I just can't figure out what I'm doing wrong. Thanks up front for the help!
UPDATE:
This is an excerpt from my original data after I have filtered for label columns:
label0 label1 label2 label3 label4 label5 label6 label7 label8
1 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
2 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
3 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
4 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
5 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
6 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
7 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
8 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
9 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
My code from above still does not work, and I still have absolutely no idea why. When I run my code in python console everything works perfectly fine, but when I run my code in Pycharm 2016.3.2, label_data.sum(axis=1) just returns the values of the first column.
With your sample data for me it works. Just try to reproduce your sample adding a new column check to control the sum:
In [3]: df
Out[3]:
hello world label0 label1 label2
0 abc def 1.00 0.00 0.00
1 why not 0.33 0.34 0.33
2 hello you 0.33 0.38 0.15
In [4]: df['check'] = df.sum(axis=1)
In [5]: df
Out[5]:
hello world label0 label1 label2 check
0 abc def 1.00 0.00 0.00 1.00
1 why not 0.33 0.34 0.33 1.00
2 hello you 0.33 0.38 0.15 0.86
In [6]: label_list = df.filter(like='label').columns
In [7]: label_list
Out[7]: Index([u'label0', u'label1', u'label2'], dtype='object')
In [8]: df[label_list].sum(axis=1)
Out[8]:
0 1.00
1 1.00
2 0.86
dtype: float64
In [9]: if (df[label_list].sum(axis=1) != 1).any():
...: print('ERROR')
...:
ERROR
Turns out my data type was not consistent. I used astype(float) and things worked out.
I have a large dataframe, ~ 1 million rows and 9 columns with some rows missing data in a few of the columns.
dat = pd.read_table( 'file path', delimiter = ';')
I z Sp S B B/T r gf k
0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 0.0779 2 0.00 0.00 0.05 0.01 0.00
The first few columns are being read in as a string, and the last few as NaN, even when there is a numeric value there. When I include dtype = 'float64' I get:
ValueError: could not convert string to float:
Any help in fixing this?
You can use replace by regex - one or more whitespaces to NaN, then cast to float
Empty strings in data are converted to NaN in read_table.
df = df.replace({'\s+':np.nan}, regex=True).astype(float)
print (df)
I z Sp S B B/T r gf k
0 0.0 0.0303 2.0 0.606 0.31 0.04 0.23 0.03 0.38
1 1.0 0.0779 2.0 NaN 0.00 0.00 0.05 0.01 0.00
If data contains some strings which need be replaced to NaN is possible use to_numeric with apply:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
print (df)
I z Sp S B B/T r gf k
0 0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 1 0.0779 2 NaN 0.00 0.00 0.05 0.01 0.00