Python how to subset a data frame with just the column indices?

Python how to subset a data frame with just the column indices? - python

I have a huge dataframe with 282 columns and 500K rows. I wish to remove a list of columns from the dataframe using the column indices. The below code works for sequential columns.
df1 = df.ix[:,[0:2]]
The problem is that my column indices are not sequential.
For example, I want to remove columns 0,1,2 and 5 from df. I tried the following code:
df1 = df.ix[:,[0:2,5]]
I am getting the following error:
SyntaxError: invalid syntax
Any suggestions?

Select columns other than 0,1,2,5 with:
df.ix[:, [3,4]+list(range(6,282))]
Or a little more dynamic:
df.ix[:, [3,4]+list(range(6,df.shape[1]))]

Is it a numpy array you've got? Try
df1 = df.ix[:, (0,1,2,5)]
or
df1 = df.ix[:, [0,1,2,5]]
or
data[:, [i for i in range(3)]+[5]]

Use np.r_[...] for concatenating slices along the first axis
DF:
In [98]: df = pd.DataFrame(np.random.randint(10, size=(5, 12)))
In [99]: df
Out[99]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0 7 2 9 9 0 7 3 5 8 8 1
1 4 9 0 4 0 2 4 8 8 7 1 9
2 2 1 1 2 7 4 4 6 1 2 9 8
3 1 5 0 8 2 2 4 1 1 4 8 4
4 4 6 3 2 2 4 1 6 2 6 9 0
Solution:
In [107]: df.iloc[:, np.r_[3:5, 6:df.shape[1]]]
Out[107]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0
In [108]: np.r_[3:5, 6:df.shape[1]]
Out[108]: array([ 3, 4, 6, 7, 8, 9, 10, 11])
or
In [110]: df.columns.difference([0,1,2,5])
Out[110]: Int64Index([3, 4, 6, 7, 8, 9, 10, 11], dtype='int64')
In [111]: df[df.columns.difference([0,1,2,5])]
Out[111]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0

Related

Iteratively pop and append to generate new lists using pandas

I have a list of elements mylist = [1, 2, 3, 4, 5, 6, 7, 8] and would like to iteratively:
copy the list
pop the first element of the copied list
and append it to the end of the copied list
repeat this for the next row, etc.
Desired output:
index A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
I suspect a for loop is needed but am having trouble iteratively generating rows based on the prior row.

I think slicing (Understanding slicing) is what you are looking for:
next_iteration = my_list[1:] + [my_list[0]]
and the full loop:
output = []
for i in range(len(my_list)):
output.append(my_list[i:] + my_list[:i])

Use this numpy solution with rolls create by np.arange:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
a = np.array(mylist)
rolls = np.arange(0, -8, -1)
print (rolls)
[ 0 -1 -2 -3 -4 -5 -6 -7]
df = pd.DataFrame(a[(np.arange(len(a))[:,None]-rolls) % len(a)],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
If need loop solution (slow) is possible use numpy.roll:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
rolls = np.arange(0, -8, -1)
df = pd.DataFrame([np.roll(mylist, i) for i in rolls],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7

try this:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
ar = np.roll(np.array(mylist), 1)
data = [ar := np.roll(ar, -1) for _ in range(ar.size)]
df = pd.DataFrame(data, columns=[*'ABCDEFGH'])
print(df)
>>>
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7

Filter and to stay only rows with the same index

I have two data frame: X_oos_top_10 and y_oos_top_10. I need to filter them by X_oos_top_10["comm"] == 1.
I do it for one:
X_oos_top_10_comm1 = X_oos_top_10[X_oos_top_10["comm"] == 1]
But for another I have the problem: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
y_oos_top_10_comm1 = y_oos_top_10[X_oos_top_10["comm"] == 1]
I haven't ideas how I can do it.

Assuming, X and y have the same length, you can use indexing.
Setup a minimal reproducible example:
X_oos_top_10 = pd.DataFrame({'comm': np.random.randint(1, 10, 10)})
y_oos_top_10 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
print(X_oos_top_10)
# Output:
comm
0 5
1 6
2 2
3 6
4 1
5 6
6 1
7 4
8 5
9 8
print(y_oos_top_10)
# Output:
A B C D
0 2 9 1 6
1 9 8 5 4
2 1 6 7 6
3 6 3 6 5
4 2 6 8 3
5 2 6 6 5
6 4 4 3 5
7 6 3 7 5
8 2 8 8 7
9 4 9 1 4
1st method
idx = X_oos_top_10[X_oos_top_10["comm"] == 1].index
out = y_oos_top_10.loc[idx]
print(out)
# Output:
A B C D
4 2 6 8 3
6 4 4 3 5
2nd method
Xy_oos_top_10 = X_oos_top_10.join(y_oos_top_10)
out = Xy_oos_top_10[Xy_oos_top_10['comm'] == 1]
print(out)
# Output:
comm A B C D
4 1 2 6 8 3
6 1 4 4 3 5

How to replace value in specific index in each row with corresponding value in numpy array

My dataframe looks like this:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 5
2 7 2 3 5
3 4 2 3 2
4 6 4 4 7
5 7 3 8 9
and I have a numpy array like this:
index_arr = [3, 2, 0, 1, 2]
This numpy array refers to the index in each row, respectively, that I want to replace. The values I want to use in the replacement are in another numpy array:
replace_arr = [14, 12, 23, 17, 15]
so that the updated dataframe looks like this:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9
What is the best way to go about doing this replacement quickly? I've tried using enumerate and iterrows but couldn't get the syntax to work. Would appreciate any help - thank you

Here's one way with np.put_along_axis -
In [50]: df
Out[50]:
datetime1 datetime2 datetime3 datetime4
1 5 6 5 5
2 7 2 3 5
3 4 2 3 2
4 6 4 4 7
5 7 3 8 9
In [51]: index_arr = np.array([3, 2, 0 ,1 ,2])
In [52]: replace_arr = np.array([14, 12, 23, 17 ,15])
In [53]: np.put_along_axis(df.to_numpy(),index_arr[:,None],replace_arr[:,None],axis=1)
In [54]: df
Out[54]:
datetime1 datetime2 datetime3 datetime4
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9

IIUC, you can just assign to .values (or .to_numpy(copy=False)):
# <= 0.23
df.values[np.arange(len(df)), index_arr] = replace_arr
# 0.24+
df.to_numpy(copy=False)[np.arange(len(df)), index_arr] = replace_arr
df
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9

End up using .iat
for x, y ,z in zip(np.arange(len(df)),index_arr ,replace_arr ):
df.iat[x,y]=z
df
Out[657]:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9

Slicing Pandas data frame into two parts

Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?

Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4

Pandas: How to get max and min values and write for every row?

I have a data like that;
>> df
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
I want to take max and minimum values of the B column depending on the column A(For the each same value of column A, I want to find the min and max in column B) and want to write results on the original table. My code is:
df1 = df.groupby(['A']).B.transform(max)
df1 = df1.rename(columns={'B':'B_max'})
df2 = df.groupby.(['A']).B.transform(min)
df1 = df1.rename(columns={'B':'B_min'})
df3 = df.join(df1['B_max']).join(df2['B_min'])
This is the result.
A B C B_max B_min
0 1 5 1
1 1 7 1 7
2 1 6 1
3 1 4 1 4
4 2 5 1
5 2 8 1 8
6 2 6 1 6
7 3 7 1 7
8 3 9 1 9
9 4 6 1
10 4 7 1 7
11 4 1 1 1
But I want to table look like this;
A B C B_max B_min
0 1 5 1 7 4
1 1 7 1 7 4
2 1 6 1 7 4
3 1 4 1 7 4
4 2 5 1 8 6
5 2 8 1 8 6
6 2 6 1 8 6
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
interpret the code for the result to look like this

I think you need only assign values to new columns, because transform return Series with same length as df:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'B': [5, 7, 6, 7, 5, 8, 6, 7, 9, 6, 7, 1],
'C': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
print (df)
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
df['B_max'] = df.groupby(['A']).B.transform(max)
df['B_min'] = df.groupby(['A']).B.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
g = df.groupby('A').B
df['B_max'] = g.transform(max)
df['B_min'] = g.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python how to subset a data frame with just the column indices? - python

Select columns other than 0,1,2,5 with: df.ix[:, [3,4]+list(range(6,282))] Or a little more dynamic: df.ix[:, [3,4]+list(range(6,df.shape[1]))]

Is it a numpy array you've got? Try df1 = df.ix[:, (0,1,2,5)] or df1 = df.ix[:, [0,1,2,5]] or data[:, [i for i in range(3)]+[5]]

Related

Iteratively pop and append to generate new lists using pandas

Filter and to stay only rows with the same index

How to replace value in specific index in each row with corresponding value in numpy array

Slicing Pandas data frame into two parts

Pandas: How to get max and min values and write for every row?

Categories

Resources