Python how to subset a data frame with just the column indices? - python

I have a huge dataframe with 282 columns and 500K rows. I wish to remove a list of columns from the dataframe using the column indices. The below code works for sequential columns.
df1 = df.ix[:,[0:2]]
The problem is that my column indices are not sequential.
For example, I want to remove columns 0,1,2 and 5 from df. I tried the following code:
df1 = df.ix[:,[0:2,5]]
I am getting the following error:
SyntaxError: invalid syntax
Any suggestions?

Select columns other than 0,1,2,5 with:
df.ix[:, [3,4]+list(range(6,282))]
Or a little more dynamic:
df.ix[:, [3,4]+list(range(6,df.shape[1]))]

Is it a numpy array you've got? Try
df1 = df.ix[:, (0,1,2,5)]
or
df1 = df.ix[:, [0,1,2,5]]
or
data[:, [i for i in range(3)]+[5]]

Use np.r_[...] for concatenating slices along the first axis
DF:
In [98]: df = pd.DataFrame(np.random.randint(10, size=(5, 12)))
In [99]: df
Out[99]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0 7 2 9 9 0 7 3 5 8 8 1
1 4 9 0 4 0 2 4 8 8 7 1 9
2 2 1 1 2 7 4 4 6 1 2 9 8
3 1 5 0 8 2 2 4 1 1 4 8 4
4 4 6 3 2 2 4 1 6 2 6 9 0
Solution:
In [107]: df.iloc[:, np.r_[3:5, 6:df.shape[1]]]
Out[107]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0
In [108]: np.r_[3:5, 6:df.shape[1]]
Out[108]: array([ 3, 4, 6, 7, 8, 9, 10, 11])
or
In [110]: df.columns.difference([0,1,2,5])
Out[110]: Int64Index([3, 4, 6, 7, 8, 9, 10, 11], dtype='int64')
In [111]: df[df.columns.difference([0,1,2,5])]
Out[111]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0

Related

Iteratively pop and append to generate new lists using pandas

I have a list of elements mylist = [1, 2, 3, 4, 5, 6, 7, 8] and would like to iteratively:
copy the list
pop the first element of the copied list
and append it to the end of the copied list
repeat this for the next row, etc.
Desired output:
index A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
I suspect a for loop is needed but am having trouble iteratively generating rows based on the prior row.
I think slicing (Understanding slicing) is what you are looking for:
next_iteration = my_list[1:] + [my_list[0]]
and the full loop:
output = []
for i in range(len(my_list)):
output.append(my_list[i:] + my_list[:i])
Use this numpy solution with rolls create by np.arange:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
a = np.array(mylist)
rolls = np.arange(0, -8, -1)
print (rolls)
[ 0 -1 -2 -3 -4 -5 -6 -7]
df = pd.DataFrame(a[(np.arange(len(a))[:,None]-rolls) % len(a)],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
If need loop solution (slow) is possible use numpy.roll:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
rolls = np.arange(0, -8, -1)
df = pd.DataFrame([np.roll(mylist, i) for i in rolls],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
try this:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
ar = np.roll(np.array(mylist), 1)
data = [ar := np.roll(ar, -1) for _ in range(ar.size)]
df = pd.DataFrame(data, columns=[*'ABCDEFGH'])
print(df)
>>>
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7

Filter and to stay only rows with the same index

I have two data frame: X_oos_top_10 and y_oos_top_10. I need to filter them by X_oos_top_10["comm"] == 1.
I do it for one:
X_oos_top_10_comm1 = X_oos_top_10[X_oos_top_10["comm"] == 1]
But for another I have the problem: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
y_oos_top_10_comm1 = y_oos_top_10[X_oos_top_10["comm"] == 1]
I haven't ideas how I can do it.
Assuming, X and y have the same length, you can use indexing.
Setup a minimal reproducible example:
X_oos_top_10 = pd.DataFrame({'comm': np.random.randint(1, 10, 10)})
y_oos_top_10 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
print(X_oos_top_10)
# Output:
comm
0 5
1 6
2 2
3 6
4 1
5 6
6 1
7 4
8 5
9 8
print(y_oos_top_10)
# Output:
A B C D
0 2 9 1 6
1 9 8 5 4
2 1 6 7 6
3 6 3 6 5
4 2 6 8 3
5 2 6 6 5
6 4 4 3 5
7 6 3 7 5
8 2 8 8 7
9 4 9 1 4
1st method
idx = X_oos_top_10[X_oos_top_10["comm"] == 1].index
out = y_oos_top_10.loc[idx]
print(out)
# Output:
A B C D
4 2 6 8 3
6 4 4 3 5
2nd method
Xy_oos_top_10 = X_oos_top_10.join(y_oos_top_10)
out = Xy_oos_top_10[Xy_oos_top_10['comm'] == 1]
print(out)
# Output:
comm A B C D
4 1 2 6 8 3
6 1 4 4 3 5

How to replace value in specific index in each row with corresponding value in numpy array

My dataframe looks like this:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 5
2 7 2 3 5
3 4 2 3 2
4 6 4 4 7
5 7 3 8 9
and I have a numpy array like this:
index_arr = [3, 2, 0, 1, 2]
This numpy array refers to the index in each row, respectively, that I want to replace. The values I want to use in the replacement are in another numpy array:
replace_arr = [14, 12, 23, 17, 15]
so that the updated dataframe looks like this:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9
What is the best way to go about doing this replacement quickly? I've tried using enumerate and iterrows but couldn't get the syntax to work. Would appreciate any help - thank you
Here's one way with np.put_along_axis -
In [50]: df
Out[50]:
datetime1 datetime2 datetime3 datetime4
1 5 6 5 5
2 7 2 3 5
3 4 2 3 2
4 6 4 4 7
5 7 3 8 9
In [51]: index_arr = np.array([3, 2, 0 ,1 ,2])
In [52]: replace_arr = np.array([14, 12, 23, 17 ,15])
In [53]: np.put_along_axis(df.to_numpy(),index_arr[:,None],replace_arr[:,None],axis=1)
In [54]: df
Out[54]:
datetime1 datetime2 datetime3 datetime4
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9
IIUC, you can just assign to .values (or .to_numpy(copy=False)):
# <= 0.23
df.values[np.arange(len(df)), index_arr] = replace_arr
# 0.24+
df.to_numpy(copy=False)[np.arange(len(df)), index_arr] = replace_arr
df
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9
End up using .iat
for x, y ,z in zip(np.arange(len(df)),index_arr ,replace_arr ):
df.iat[x,y]=z
df
Out[657]:
datetime1 datetime2 datetime3 datetime4
id
1 5 6 5 14
2 7 2 12 5
3 23 2 3 2
4 6 17 4 7
5 7 3 15 9

Slicing Pandas data frame into two parts

Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4

Pandas: How to get max and min values and write for every row?

I have a data like that;
>> df
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
I want to take max and minimum values of the B column depending on the column A(For the each same value of column A, I want to find the min and max in column B) and want to write results on the original table. My code is:
df1 = df.groupby(['A']).B.transform(max)
df1 = df1.rename(columns={'B':'B_max'})
df2 = df.groupby.(['A']).B.transform(min)
df1 = df1.rename(columns={'B':'B_min'})
df3 = df.join(df1['B_max']).join(df2['B_min'])
This is the result.
A B C B_max B_min
0 1 5 1
1 1 7 1 7
2 1 6 1
3 1 4 1 4
4 2 5 1
5 2 8 1 8
6 2 6 1 6
7 3 7 1 7
8 3 9 1 9
9 4 6 1
10 4 7 1 7
11 4 1 1 1
But I want to table look like this;
A B C B_max B_min
0 1 5 1 7 4
1 1 7 1 7 4
2 1 6 1 7 4
3 1 4 1 7 4
4 2 5 1 8 6
5 2 8 1 8 6
6 2 6 1 8 6
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
interpret the code for the result to look like this
I think you need only assign values to new columns, because transform return Series with same length as df:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'B': [5, 7, 6, 7, 5, 8, 6, 7, 9, 6, 7, 1],
'C': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
print (df)
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
df['B_max'] = df.groupby(['A']).B.transform(max)
df['B_min'] = df.groupby(['A']).B.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
g = df.groupby('A').B
df['B_max'] = g.transform(max)
df['B_min'] = g.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1

Categories