Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
Related
I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
I have two pandas dataframes that I am attempting to join. Both have the same length and index.
df1.index
RangeIndex(start=0, stop=1857, step=1)
df2.index
RangeIndex(start=0, stop=1857, step=1)
I do the following to only join columns that don't overlap.
cols = df2.columns.difference(df1.columns)
cols
df_merged = pd.merge(df1, df2[cols], left_index=True, right_index=True)
While df_merged has the shape that I expect, 1857, there are several rows that are NaNs. df1 has no NaN rows.
What am I missing here? How I add merge dataframes based on indices?
Although your code works well, try to use df.join or pd.concat:
df1 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('CDEF'))
cols = df2.columns.difference(df1.columns)
df_merged = pd.merge(df1, df2[cols], left_index=True, right_index=True)
Output:
>>> df1.index
RangeIndex(start=0, stop=10, step=1)
>>> df2.index
RangeIndex(start=0, stop=10, step=1)
>>> cols
Index(['E', 'F'], dtype='object')
>>> df_merged
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
Join
>>> df1.join(df2[cols])
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
Concat
>>> pd.concat([df1, df2[cols]], axis=1)
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
My df looks as follows:
import pandas as pd
d = {'col1': [1,2,3,3,1,2,2,3,4,1,1,2]
df= pd.DataFrame(data=d)
Now I want to add a new column with the following schemata:
col1
new_col
1
1
2
2
3
3
3
3
3
3
1
4
2
5
2
5
3
6
4
7
1
8
1
8
2
9
Once it starts again at 1 it should just keep counting.
At the moment I am at the point where I just add a column with difference:
df['diff'] = df['col1'].diff()
How to extend this approach?
Try with
df.col1.diff().ne(0).cumsum()
Out[94]:
0 1
1 2
2 3
3 3
4 4
5 5
6 5
7 6
8 7
9 8
10 8
11 9
Name: col1, dtype: int32
Try:
df["new_col"] = df["col1"].ne(df["col1"].shift()).cumsum()
>>> df
col1 new_col
0 1 1
1 2 2
2 3 3
3 3 3
4 1 4
5 2 5
6 2 5
7 3 6
8 4 7
9 1 8
10 1 8
11 2 9
I have a dataframe as below
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to multiply every 3rd column after the 2 column in the last 2 rows by 5 to get the ouput as below.
How to acomplish this?
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 10 3 4 25 6 7 40 9
1 10 3 4 25 6 7 40 9
I am able to select the cells i need with df.iloc[-2:,1::3]
which results in the df as below but I am not able to proceed further.
B E H
2 5 8
2 5 8
I know that I can select the same cells with loc instead of iloc, then the calcualtion is straign forward, but i am not able to figure it out.
The column names & cell values CANNOT Be used since these change (the df here is just a dummy data)
You can assign back to same selection of rows/ columns like:
df.iloc[-2:,1::3] = df.iloc[-2:,1::3].mul(5)
#alternative
#df.iloc[-2:,1::3] = df.iloc[-2:,1::3] * 5
print (df)
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9
2 1 2 3 4 5 6 7 8 9
3 1 2 3 4 5 6 7 8 9
4 1 2 3 4 5 6 7 8 9
5 1 10 3 4 25 6 7 40 9
6 1 10 3 4 25 6 7 40 9
I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4