I have two pandas dataframes that I am attempting to join. Both have the same length and index.
df1.index
RangeIndex(start=0, stop=1857, step=1)
df2.index
RangeIndex(start=0, stop=1857, step=1)
I do the following to only join columns that don't overlap.
cols = df2.columns.difference(df1.columns)
cols
df_merged = pd.merge(df1, df2[cols], left_index=True, right_index=True)
While df_merged has the shape that I expect, 1857, there are several rows that are NaNs. df1 has no NaN rows.
What am I missing here? How I add merge dataframes based on indices?
Although your code works well, try to use df.join or pd.concat:
df1 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('CDEF'))
cols = df2.columns.difference(df1.columns)
df_merged = pd.merge(df1, df2[cols], left_index=True, right_index=True)
Output:
>>> df1.index
RangeIndex(start=0, stop=10, step=1)
>>> df2.index
RangeIndex(start=0, stop=10, step=1)
>>> cols
Index(['E', 'F'], dtype='object')
>>> df_merged
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
Join
>>> df1.join(df2[cols])
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
Concat
>>> pd.concat([df1, df2[cols]], axis=1)
A B C D E F
0 8 3 4 3 6 3
1 2 5 3 4 5 9
2 4 1 7 7 5 1
3 2 4 6 7 7 8
4 6 6 4 8 5 8
5 8 6 8 4 4 5
6 7 9 7 7 6 6
7 8 4 2 3 7 1
8 5 7 1 1 8 5
9 8 2 5 8 5 9
Related
I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
I have two data frame: X_oos_top_10 and y_oos_top_10. I need to filter them by X_oos_top_10["comm"] == 1.
I do it for one:
X_oos_top_10_comm1 = X_oos_top_10[X_oos_top_10["comm"] == 1]
But for another I have the problem: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
y_oos_top_10_comm1 = y_oos_top_10[X_oos_top_10["comm"] == 1]
I haven't ideas how I can do it.
Assuming, X and y have the same length, you can use indexing.
Setup a minimal reproducible example:
X_oos_top_10 = pd.DataFrame({'comm': np.random.randint(1, 10, 10)})
y_oos_top_10 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
print(X_oos_top_10)
# Output:
comm
0 5
1 6
2 2
3 6
4 1
5 6
6 1
7 4
8 5
9 8
print(y_oos_top_10)
# Output:
A B C D
0 2 9 1 6
1 9 8 5 4
2 1 6 7 6
3 6 3 6 5
4 2 6 8 3
5 2 6 6 5
6 4 4 3 5
7 6 3 7 5
8 2 8 8 7
9 4 9 1 4
1st method
idx = X_oos_top_10[X_oos_top_10["comm"] == 1].index
out = y_oos_top_10.loc[idx]
print(out)
# Output:
A B C D
4 2 6 8 3
6 4 4 3 5
2nd method
Xy_oos_top_10 = X_oos_top_10.join(y_oos_top_10)
out = Xy_oos_top_10[Xy_oos_top_10['comm'] == 1]
print(out)
# Output:
comm A B C D
4 1 2 6 8 3
6 1 4 4 3 5
If I have a Pandas Dataframe like this:
A
1 8
2 9
3 7
4 2
How do I repeat it x number of times? For example, if I wanted to repeat it 3 times I would get something like this:
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
Use concat:
n = 3
pd.concat([df] * (n+1), axis=1, ignore_index=True)
0 1 2 3
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
If you want the columns renamed, use rename:
(pd.concat([df] * (n+1), axis=1, ignore_index=True)
.rename(lambda x: chr(ord('A')+x), axis=1))
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
You can use Numpy to repeat the values and reconstruct the dataframe.
n = 3
pd.DataFrame(np.tile(df.values, n + 1), columns = df.columns.tolist()+list('BCD'))
A B C D
0 8 8 8 8
1 9 9 9 9
2 7 7 7 7
3 2 2 2 2
You can use concat like #coldspeed did.
Or you can set them manually.
df['B'] = df.A
df['C'] = df.A
df['D'] = df.A
print(df)
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
I have a huge dataframe with 282 columns and 500K rows. I wish to remove a list of columns from the dataframe using the column indices. The below code works for sequential columns.
df1 = df.ix[:,[0:2]]
The problem is that my column indices are not sequential.
For example, I want to remove columns 0,1,2 and 5 from df. I tried the following code:
df1 = df.ix[:,[0:2,5]]
I am getting the following error:
SyntaxError: invalid syntax
Any suggestions?
Select columns other than 0,1,2,5 with:
df.ix[:, [3,4]+list(range(6,282))]
Or a little more dynamic:
df.ix[:, [3,4]+list(range(6,df.shape[1]))]
Is it a numpy array you've got? Try
df1 = df.ix[:, (0,1,2,5)]
or
df1 = df.ix[:, [0,1,2,5]]
or
data[:, [i for i in range(3)]+[5]]
Use np.r_[...] for concatenating slices along the first axis
DF:
In [98]: df = pd.DataFrame(np.random.randint(10, size=(5, 12)))
In [99]: df
Out[99]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0 7 2 9 9 0 7 3 5 8 8 1
1 4 9 0 4 0 2 4 8 8 7 1 9
2 2 1 1 2 7 4 4 6 1 2 9 8
3 1 5 0 8 2 2 4 1 1 4 8 4
4 4 6 3 2 2 4 1 6 2 6 9 0
Solution:
In [107]: df.iloc[:, np.r_[3:5, 6:df.shape[1]]]
Out[107]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0
In [108]: np.r_[3:5, 6:df.shape[1]]
Out[108]: array([ 3, 4, 6, 7, 8, 9, 10, 11])
or
In [110]: df.columns.difference([0,1,2,5])
Out[110]: Int64Index([3, 4, 6, 7, 8, 9, 10, 11], dtype='int64')
In [111]: df[df.columns.difference([0,1,2,5])]
Out[111]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0