Pandas Pivot table nearest neighbor - python

SOLUTION
df = pd.read_csv('data.txt')
df['z-C+1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(+1))
df['z-C-1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(-1))
df['z-D+1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(+1))
df['z-D-1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(-1))
QUESTION
I have a CSV which is sorted by a few indexes. There is one index in particular I am interested in, and I want to keep the table the same. All I want to do is add extra columns which are a function of the table. So, lets say "v" is the column of interest. I want to take the "z" column, and add more "z" columns from other places in the table where "c" = "c+1" and "c-1" and "d+1", "d-1", and just join those on the end. In the end I want the same number of rows, but with the "Z" column expanded to columns that are "Z.C-1.D", "Z.C.D", "Z.C+1.D", "Z.C.D-1", "Z.C.D+1". If that makes any sense. I'm having difficulties. I've tried the pivot_table method, and that brought me somewhere, while also adding confusion.
If this helps: Think about it like a point in a matrix, and I have an independent variable & dependent variable. I want to extract the neighboring independent variables for every location I have an observation
Here is my example csv:
a b c d v z
10 1 15 42 0.90 5460
10 2 15 42 0.97 6500
10 1 16 42 1.04 7540
10 2 16 42 1.11 8580
10 1 15 43 1.18 9620
10 2 15 43 0.98 10660
10 1 16 43 1.32 3452
10 2 16 43 1.39 4561
11 1 15 42 0.54 5670
11 2 15 42 1.53 6779
11 1 16 42 1.60 7888
11 2 16 42 1.67 8997
11 1 15 43 1.74 10106
11 2 15 43 1.81 11215
11 1 16 43 1.88 12324
11 2 16 43 1.95 13433
And my desired output:
a b c d v z z[c-1] z[c+1] z[d-1] z[d+1]
10 1 15 42 0.90 5460 Nan 7540 Nan 9620
10 2 15 42 0.97 6500 Nan 8580 Nan 10660
10 1 16 42 1.04 7540 5460 Nan Nan 3452
10 2 16 42 1.11 8580 6500 Nan Nan 4561
10 1 15 43 1.18 9620 Nan 3452 5460 Nan
10 2 15 43 0.98 10660 Nan 4561 6500 Nan
10 1 16 43 1.32 3452 9620 Nan 7540 Nan
10 2 16 43 1.39 4561 10660 Nan 8580 Nan
11 1 15 42 0.54 5670 Nan 7888 Nan 10106
11 2 15 42 1.53 6779 Nan 8997 Nan 11215
11 1 16 42 1.60 7888 5670 Nan Nan 12324
11 2 16 42 1.67 8997 6779 Nan Nan 13433
11 1 15 43 1.74 10106 Nan 12324 5670 Nan
11 2 15 43 1.81 11215 Nan 13433 6779 Nan
11 1 16 43 1.88 12324 10106 Nan 7888 Nan
11 2 16 43 1.95 13433 11215 Nan 8997 Nan

Don't know if I understood you, but you can use shift() method to add shifted columns, like:
df['z-1'] = df.groupby('a')['z'].transform(lambda x:x.shift(-1))
update
If you want selection by values, you can use apply():
def lkp_data(c,d,v):
d = df[(df['c'] == c) & (df['d'] == d) & (df['v'] == v)]['z']
return None if len(d) == 0 else d.values[0]
df['z[c-1]'] = df.apply(lambda x: lkp_data(x['c'] - 1, x['d'], x['v']), axis=1)
df['z[c+1]'] = df.apply(lambda x: lkp_data(x['c'] + 1, x['d'], x['v']), axis=1)
df['z[d-1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] - 1, x['v']), axis=1)
df['z[d+1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] + 1, x['v']), axis=1)
c d z v z[c-1] z[c+1] z[d-1] z[d+1]
0 15 42 5460 1 NaN 7540 NaN 9620
1 15 42 6500 2 NaN 8580 NaN 10660
2 16 42 7540 1 5460 NaN NaN 3452
3 16 42 8580 2 6500 NaN NaN 4561
4 15 43 9620 1 NaN 3452 5460 NaN
5 15 43 10660 2 NaN 4561 6500 NaN
6 16 43 3452 1 9620 NaN 7540 NaN
7 16 43 4561 2 10660 NaN 8580 NaN
But I think, this one would be really inefficient

Related

Calculate the mean values of individual rows, based the value of other columns and subtract from other rows

I have the following dataframe:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 1.1 11.2 75 7 NaN
9 12, 43 2 Sample A 1.4 3.3 87 6 Temperature was too hot
10 12, 43 3 Sample A 0.7 7.4 91 5 NaN
11 19, 21 1 Sample B 2.1 3.2 99 7 NaN
12 19, 21 2 Sample B 2.2 11.3 76 8 NaN
13 19, 21 3 Sample B 1.9 9.3 89 9 sample spilt by user
14 19, 21 1 Sample C 3.2 4.0 112 10 NaN
15 19, 21 2 Sample C 2.1 5.0 96 15 NaN
16 19, 21 3 Sample C 2.7 7.0 105 13 Was too cold
17 11, 22 1 Sample C 2.4 3.0 121 19 NaN
I'd like to do two separate things. Firstly, I'd like to calculate the mean values for each 'smell', 'volatility', 'taste' and 'odour' columns of the 'Sample' Control, where the 'Book_No' is the same value. Then, subtract those mean values from the individual Sample A, Sample B and Sample C, where the 'Book_No' matches those of the control. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C -0.200000 1.00 89.0 7.00 NaN
15 19, 21 2 Sample C -1.300000 2.00 73.0 12.00 NaN
16 19, 21 3 Sample C -0.700000 4.00 82.0 10.00 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
I've tried the following codes, but neither seems to give me what I need, plus I'd need to copy and paste the code and change the column name for each column I'd like to apply it to:
df['Smell'] = df['Smell'] - df.groupby(['Book_No', 'Sample'])['Smell'].transform('mean')
and I've tried to apply a mask:
mask = df['Book_No'].unique()
df.loc[~mask, 'Smell'] = (df['Smell'] - df['Smell'].where(mask).groupby([df['Book_No'],df['Sample']]).transform('mean'))
Then, separately, I'd like to subtract the control values from the sample values, when the Book_No and replicate values match. The resulting dataframe should look something like this:
Book_No Replicate Sample Smell Taste Odour Volatility Unnamed: 7
0 12, 43 1 control 0.3 10.0 71 1 NaN
1 12, 43 2 control 0.4 8.0 63 3 NaN
2 12, 43 3 control 0.1 3.0 22 2 NaN
3 19, 21 1 control 1.1 2.0 80 3 NaN
4 19, 21 2 control 0.4 8.0 0 4 NaN
5 19, 21 3 control 0.9 3.0 4 6 NaN
6 19, 21 4 control 2.1 6.0 50 4 NaN
7 11, 22 1 control 3.4 3.0 23 3 NaN
8 12, 43 1 Sample A 0.8 1.2 4 6 NaN
9 12, 43 2 Sample A 1.0 -4.7 24 3 Temperature was too hot
10 12, 43 3 Sample A 0.6 4.4 69 3 NaN
11 19, 21 1 Sample B 1.0 1.2 19 4 NaN
12 19, 21 2 Sample B 1.8 3.3 76 4 NaN
13 19, 21 3 Sample B 1.0 6.3 85 3 sample spilt by user
14 19, 21 1 Sample C 2.1 2.0 32 7 NaN
15 19, 21 2 Sample C 1.7 -3.0 96 11 NaN
16 19, 21 3 Sample C 1.8 4.0 101 7 Was too cold
17 11, 22 1 Sample C -1.0 0.0 98 16 NaN
Could anyone kindly offer their assistance to help with these two scenarios?
Thank you in advance for any help
Splitting into different columns and reordering:
# This may be useful to you in the future, plus, ints are better than strings:
df[['Book', 'No']] = df.Book_No.str.split(', ', expand=True).astype(int)
cols = df.columns.tolist()
df = df[cols[-2:] + cols[1:-2]]
You should only focus on one problem at a time in your questions, so I'll help with the first part.
# Set some vars so we don't have to type these over and over:
cols = ['Smell', 'Volatility', 'Taste', 'Odour']
mask = df.Sample.eq('control')
group = ['Book', 'No']
# Find your control values:
ctrl_means = df[mask].groupby(group)[cols].mean()
# Apply your desired change:
df.loc[~mask, cols] = (df[~mask].groupby(group)[cols]
.apply(lambda x: x.sub(ctrl_means.loc[x.name])))
print(df)
Output:
Book No Replicate Sample Smell Taste Odour Volatility Notes
0 12 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
First we get the mean of the control samples:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
control_means = df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
Then subtract it from the remaining samples to get the fixed sample data. To utilize pandas automatic alignment, we need to temporarily set the index:
new_idx = ['Book_No', df.index]
fixed_samples = (df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- control_means).droplevel(0)
Finally simply assign them back into the dataframe:
df.loc[df.Sample.ne('control'), cols] = fixed_samples
Result:
Book_No Replicate Sample Smell Taste Odour Volatility Notes
0 12, 43 1 control 0.300000 10.00 71.0 1.00 NaN
1 12, 43 2 control 0.400000 8.00 63.0 3.00 NaN
2 12, 43 3 control 0.100000 3.00 22.0 2.00 NaN
3 19, 21 1 control 1.100000 2.00 80.0 3.00 NaN
4 19, 21 2 control 0.400000 8.00 0.0 4.00 NaN
5 19, 21 3 control 0.900000 3.00 4.0 6.00 NaN
6 19, 21 4 control 2.100000 6.00 50.0 4.00 NaN
7 11, 22 1 control 3.400000 3.00 23.0 3.00 NaN
8 12, 43 1 Sample A 0.833333 4.20 23.0 5.00 NaN
9 12, 43 2 Sample A 1.133333 -3.70 35.0 4.00 Temperature was too hot
10 12, 43 3 Sample A 0.433333 0.40 39.0 3.00 NaN
11 19, 21 1 Sample B 0.975000 -1.55 65.5 2.75 NaN
12 19, 21 2 Sample B 1.075000 6.55 42.5 3.75 NaN
13 19, 21 3 Sample B 0.775000 4.55 55.5 4.75 sample spilt by user
14 19, 21 1 Sample C 2.075000 -0.75 78.5 5.75 NaN
15 19, 21 2 Sample C 0.975000 0.25 62.5 10.75 NaN
16 19, 21 3 Sample C 1.575000 2.25 71.5 8.75 Was too cold
17 11, 22 1 Sample C -1.000000 0.00 98.0 16.00 NaN
If you want you can squeeze it into a one-liner, but this his hardly comprehensible:
cols = ['Smell', 'Taste', 'Odour', 'Volatility']
new_idx = ['Book_No', df.index]
df.loc[df.Sample.ne('control'), cols] = (
df.set_index(new_idx).loc[df.set_index(new_idx).Sample.ne('control'), cols]
- df[df.Sample.eq('control')].groupby(['Book_No'])[cols].mean()
).droplevel(0)

Pandas merge 2 dataframes

I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN

Python fillna based on a condition

I have the following dataframe grouped by datafile and I want to fillna(method ='bfill') only for those 'groups' that contain more than half of the data.
df.groupby('datafile').count()
datafile column1 column2 column3 column4
datafile1 5 5 3 4
datafile2 5 5 4 5
datafile3 5 5 5 5
datafile4 5 5 0 0
datafile5 5 5 1 1
As you can see in the df above, I'd like to fill those groups that contain most of the information but not those who has none or little information. So I was thinking in a condition something like fillna those who have more than half of the counts and don't fill the rest or those with less than half.
I'm struggling on how to set up my condition since it involves working with a result of a groupby and the original df.
Help is appreciated it.
example df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 NaN 20
1 datafile1 6 6 NaN 21
2 datafile1 7 7 9 NaN
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 NaN 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
expected output df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 9 20
1 datafile1 6 6 9 21
2 datafile1 7 7 9 23
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 6 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
if the proportion of NON-null values ​​is greater than or equal to 0.5 in each column then it is filled with the bfill method:
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (
df.bfill()
.where(
g.transform('sum')
.div(g['datafile'].transform('size'), axis=0)
.ge(rate) |
not_na
)
)
print(df_fill)
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Also we can use:
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(), axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
both methods have similar returns for the sample dataframe
%%timeit
rate = 0.5
not_na = df.notna()
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(),axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
11.1 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (df.bfill()
.where(g.transform('sum').div(g['datafile'].transform('size'),
axis=0).ge(rate) |
not_na)
)
12.9 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pandas.groupby.filter
def most_not_null(x): return x.isnull().sum().sum() < (x.notnull().sum().sum() // 2)
filtered_groups = df.groupby('datafile').filter(most_not_null)
df.loc[filtered_groups.index] = filtered_groups.bfill()
Output
>>> df
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN

Convert dataframe of floats to integers in pandas?

How do I convert every numeric element of my pandas dataframe to an integer? I have not seen any documentation online for how to do so, which is surprising given Pandas is so popular...
If you have a data frame of ints, simply use astype directly.
df.astype(int)
If not, use select_dtypes first to select numeric columns.
df.select_dtypes(np.number).astype(int)
df = pd.DataFrame({'col1': [1.,2.,3.,4.], 'col2': [10.,20.,30.,40.]})
col1 col2
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
>>> df.astype(int)
col1 col2
0 1 10
1 2 20
2 3 30
3 4 40
You can use apply for this purpose:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1.0, 20.0), 'B':np.arange(101.0, 120.0)})
print(df)
A B
0 1.0 101.0
1 2.0 102.0
2 3.0 103.0
3 4.0 104.0
4 5.0 105.0
5 6.0 106.0
6 7.0 107.0
7 8.0 108.0
8 9.0 109.0
9 10.0 110.0
10 11.0 111.0
11 12.0 112.0
12 13.0 113.0
13 14.0 114.0
14 15.0 115.0
15 16.0 116.0
16 17.0 117.0
17 18.0 118.0
18 19.0 119.0
df2 = df.apply(lambda a: [int(b) for b in a])
print(df2)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
A better approach is to change the type at the level of series:
for col in df.columns:
if df[col].dtype == np.float64:
df[col] = df[col].astype('int')
print(df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
Try this:
column_types = dict(df.dtypes)
for column in df.columns:
if column_types[column] == 'float64':
df[column] = df[column].astype('int')
df[column] = df[column].apply(lambda x: int(x))

Applying cumulative correction factor across dataframe

I'm fairly new to Pandas so please forgive me if the answer to my question is rather obvious. I've got a dataset like this
Data Correction
0 100 Nan
1 104 Nan
2 108 Nan
3 112 Nan
4 116 Nan
5 120 0.5
6 124 Nan
7 128 Nan
8 132 Nan
9 136 0.4
10 140 Nan
11 144 Nan
12 148 Nan
13 152 0.3
14 156 Nan
15 160 Nan
What I want to is to calculate the correction factor for the data which accumulates upwards.
By that I mean that elements from 13 and below should have the factor 0.3 applied, with 9 and below applying 0.3*0.4 and 5 and below 0.3*0.4*0.5.
So the final correction column should look like this
Data Correction Factor
0 100 Nan 0.06
1 104 Nan 0.06
2 108 Nan 0.06
3 112 Nan 0.06
4 116 Nan 0.06
5 120 0.5 0.06
6 124 Nan 0.12
7 128 Nan 0.12
8 132 Nan 0.12
9 136 0.4 0.12
10 140 Nan 0.3
11 144 Nan 0.3
12 148 Nan 0.3
13 152 0.3 0.3
14 156 Nan 1
15 160 Nan 1
How can I do this?
I think you are looking for cumprod() after reversing the Correction column:
df=df.assign(Factor=df.Correction[::-1].cumprod().ffill().fillna(1))
Data Correction Factor
0 100 NaN 0.06
1 104 NaN 0.06
2 108 NaN 0.06
3 112 NaN 0.06
4 116 NaN 0.06
5 120 0.5 0.06
6 124 NaN 0.12
7 128 NaN 0.12
8 132 NaN 0.12
9 136 0.4 0.12
10 140 NaN 0.30
11 144 NaN 0.30
12 148 NaN 0.30
13 152 0.3 0.30
14 156 NaN 1.00
15 160 NaN 1.00
I can't think of a good pandas function that does this, however, you can create a for loop to do multiply an array with the values then put it as a column.
import numpy as np
import pandas as pd
lst = [np.nan,np.nan,np.nan,np.nan,np.nan,0.5,np.nan,np.nan,np.nan,np.nan,0.4,np.nan,np.nan,np.nan,0.3,np.nan,np.nan]
lst1 = [i + 100 for i in range(len(lst))]
newcol= [1.0 for i in range(len(lst))]
newcol = np.asarray(newcol)
df = pd.DataFrame({'Data' : lst1,'Correction' : lst})
for i in range(len(df['Correction'])):
if(~np.isnan(df.Correction[i])):
print(df.Correction[i])
newcol[0:i+1] = newcol[0:i+1] * df.Correction[i]
df['Factor'] = newcol
print(df)
This code prints
Data Correction Factor
0 100 NaN 0.06
1 101 NaN 0.06
2 102 NaN 0.06
3 103 NaN 0.06
4 104 NaN 0.06
5 105 0.5 0.06
6 106 NaN 0.12
7 107 NaN 0.12
8 108 NaN 0.12
9 109 NaN 0.12
10 110 0.4 0.12
11 111 NaN 0.30
12 112 NaN 0.30
13 113 NaN 0.30
14 114 0.3 0.30
15 115 NaN 1.00
16 116 NaN 1.00

Categories