How do I convert every numeric element of my pandas dataframe to an integer? I have not seen any documentation online for how to do so, which is surprising given Pandas is so popular...
If you have a data frame of ints, simply use astype directly.
df.astype(int)
If not, use select_dtypes first to select numeric columns.
df.select_dtypes(np.number).astype(int)
df = pd.DataFrame({'col1': [1.,2.,3.,4.], 'col2': [10.,20.,30.,40.]})
col1 col2
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
>>> df.astype(int)
col1 col2
0 1 10
1 2 20
2 3 30
3 4 40
You can use apply for this purpose:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1.0, 20.0), 'B':np.arange(101.0, 120.0)})
print(df)
A B
0 1.0 101.0
1 2.0 102.0
2 3.0 103.0
3 4.0 104.0
4 5.0 105.0
5 6.0 106.0
6 7.0 107.0
7 8.0 108.0
8 9.0 109.0
9 10.0 110.0
10 11.0 111.0
11 12.0 112.0
12 13.0 113.0
13 14.0 114.0
14 15.0 115.0
15 16.0 116.0
16 17.0 117.0
17 18.0 118.0
18 19.0 119.0
df2 = df.apply(lambda a: [int(b) for b in a])
print(df2)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
A better approach is to change the type at the level of series:
for col in df.columns:
if df[col].dtype == np.float64:
df[col] = df[col].astype('int')
print(df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
15 16 116
16 17 117
17 18 118
18 19 119
Try this:
column_types = dict(df.dtypes)
for column in df.columns:
if column_types[column] == 'float64':
df[column] = df[column].astype('int')
df[column] = df[column].apply(lambda x: int(x))
Related
Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})
print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))
Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000
I have Python which counts cumsum of 14 elements in column, starting from different elements and writes this sum in other column. Does anyone knows the way how to do it without loops?
import pandas as pd
import numpy as np
a = pd.DataFrame({"A": [i for i in range(25)]})
b = pd.DataFrame({"B": [np.nan for i in range(25)]})
for i in range(4, len(b)):
cumsum = 0
for k in range(i - 4, i):
cumsum += a.A[k]
b.B[k] = cumsum
pd.concat([a,b], axis=1)
IIUC you are looking for rolling(4) + sum():
In [83]: a['new'] = a.A.rolling(4).sum()
In [84]: a
Out[84]:
A new
0 0 NaN
1 1 NaN
2 2 NaN
3 3 6.0
4 4 10.0
5 5 14.0
6 6 18.0
7 7 22.0
8 8 26.0
9 9 30.0
10 10 34.0
11 11 38.0
12 12 42.0
13 13 46.0
14 14 50.0
15 15 54.0
16 16 58.0
17 17 62.0
18 18 66.0
19 19 70.0
20 20 74.0
21 21 78.0
22 22 82.0
23 23 86.0
24 24 90.0
check:
In [86]: pd.concat([a,b], axis=1)
Out[86]:
A new B
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 6.0 6.0
4 4 10.0 10.0
5 5 14.0 14.0
6 6 18.0 18.0
7 7 22.0 22.0
8 8 26.0 26.0
9 9 30.0 30.0
10 10 34.0 34.0
11 11 38.0 38.0
12 12 42.0 42.0
13 13 46.0 46.0
14 14 50.0 50.0
15 15 54.0 54.0
16 16 58.0 58.0
17 17 62.0 62.0
18 18 66.0 66.0
19 19 70.0 70.0
20 20 74.0 74.0
21 21 78.0 78.0
22 22 82.0 82.0
23 23 86.0 86.0
24 24 90.0 NaN
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0
SOLUTION
df = pd.read_csv('data.txt')
df['z-C+1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(+1))
df['z-C-1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(-1))
df['z-D+1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(+1))
df['z-D-1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(-1))
QUESTION
I have a CSV which is sorted by a few indexes. There is one index in particular I am interested in, and I want to keep the table the same. All I want to do is add extra columns which are a function of the table. So, lets say "v" is the column of interest. I want to take the "z" column, and add more "z" columns from other places in the table where "c" = "c+1" and "c-1" and "d+1", "d-1", and just join those on the end. In the end I want the same number of rows, but with the "Z" column expanded to columns that are "Z.C-1.D", "Z.C.D", "Z.C+1.D", "Z.C.D-1", "Z.C.D+1". If that makes any sense. I'm having difficulties. I've tried the pivot_table method, and that brought me somewhere, while also adding confusion.
If this helps: Think about it like a point in a matrix, and I have an independent variable & dependent variable. I want to extract the neighboring independent variables for every location I have an observation
Here is my example csv:
a b c d v z
10 1 15 42 0.90 5460
10 2 15 42 0.97 6500
10 1 16 42 1.04 7540
10 2 16 42 1.11 8580
10 1 15 43 1.18 9620
10 2 15 43 0.98 10660
10 1 16 43 1.32 3452
10 2 16 43 1.39 4561
11 1 15 42 0.54 5670
11 2 15 42 1.53 6779
11 1 16 42 1.60 7888
11 2 16 42 1.67 8997
11 1 15 43 1.74 10106
11 2 15 43 1.81 11215
11 1 16 43 1.88 12324
11 2 16 43 1.95 13433
And my desired output:
a b c d v z z[c-1] z[c+1] z[d-1] z[d+1]
10 1 15 42 0.90 5460 Nan 7540 Nan 9620
10 2 15 42 0.97 6500 Nan 8580 Nan 10660
10 1 16 42 1.04 7540 5460 Nan Nan 3452
10 2 16 42 1.11 8580 6500 Nan Nan 4561
10 1 15 43 1.18 9620 Nan 3452 5460 Nan
10 2 15 43 0.98 10660 Nan 4561 6500 Nan
10 1 16 43 1.32 3452 9620 Nan 7540 Nan
10 2 16 43 1.39 4561 10660 Nan 8580 Nan
11 1 15 42 0.54 5670 Nan 7888 Nan 10106
11 2 15 42 1.53 6779 Nan 8997 Nan 11215
11 1 16 42 1.60 7888 5670 Nan Nan 12324
11 2 16 42 1.67 8997 6779 Nan Nan 13433
11 1 15 43 1.74 10106 Nan 12324 5670 Nan
11 2 15 43 1.81 11215 Nan 13433 6779 Nan
11 1 16 43 1.88 12324 10106 Nan 7888 Nan
11 2 16 43 1.95 13433 11215 Nan 8997 Nan
Don't know if I understood you, but you can use shift() method to add shifted columns, like:
df['z-1'] = df.groupby('a')['z'].transform(lambda x:x.shift(-1))
update
If you want selection by values, you can use apply():
def lkp_data(c,d,v):
d = df[(df['c'] == c) & (df['d'] == d) & (df['v'] == v)]['z']
return None if len(d) == 0 else d.values[0]
df['z[c-1]'] = df.apply(lambda x: lkp_data(x['c'] - 1, x['d'], x['v']), axis=1)
df['z[c+1]'] = df.apply(lambda x: lkp_data(x['c'] + 1, x['d'], x['v']), axis=1)
df['z[d-1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] - 1, x['v']), axis=1)
df['z[d+1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] + 1, x['v']), axis=1)
c d z v z[c-1] z[c+1] z[d-1] z[d+1]
0 15 42 5460 1 NaN 7540 NaN 9620
1 15 42 6500 2 NaN 8580 NaN 10660
2 16 42 7540 1 5460 NaN NaN 3452
3 16 42 8580 2 6500 NaN NaN 4561
4 15 43 9620 1 NaN 3452 5460 NaN
5 15 43 10660 2 NaN 4561 6500 NaN
6 16 43 3452 1 9620 NaN 7540 NaN
7 16 43 4561 2 10660 NaN 8580 NaN
But I think, this one would be really inefficient