As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074
Related
I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -
Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0
Trying to take a slice of a column from one Pandas dataframe, transpose the slice, and insert it into a similarly sized row slice, in a different dataframe. Labels and indexes in both dataframes are different. With large dataframes, am currently running a for loop to copy each individual value, cell-by-cell as it were, but incredibly inefficient.
Other than for-loop, have tried .loc, .iloc, with transpose, but no success. pivot, pivot_table, melt dont seem to be applicable here, or I cannot get my head around how to apply them to this seemingly simple problem.
# Two dataframes here
import pandas as pd
import numpy as np
numRng = np.arange(20).reshape((5, 4))
df1 = pd.DataFrame(numRng)
newCols = ('A', 'B', 'C', 'D', 'E', 'F')
for newCol in newCols:
df1[newCol] = np.nan
numRng2 = np.arange(1000,976,-1).reshape((6, 4))
df2 = pd.DataFrame(numRng2)
df2.columns = ['M', 'N', 'O', 'P']
df1
df2
# From df1, trying to copy a column-slice, transpose it, and insert it
# into df2 row-slice, has no effect
df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose()
df1
# 'Manual' implementation to produce desired df1 geometry
df1.loc[1, 'B'] = 996
df1.loc[1, 'C'] = 992
df1.loc[1, 'D'] = 988
df1.loc[1, 'E'] = 984
df1
In example df's above, in df1 row1 columns B, C, D, E show numbers 996, 992, 988 and 984 in the row slice.
How to extract slice, transpose and insert without for-looping over every value?
Convert values to numpy array for avoid data alignment - pandas try match index and columns each other and if failed, create missing values or not assign values:
#pandas 0.22+
df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose().to_numpy()
#pandas below
#df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose().values
print (df1)
0 1 2 3 A B C D E F
0 0 1 2 3 NaN NaN NaN NaN NaN NaN
1 4 5 6 7 NaN 996.0 992.0 988.0 984.0 NaN
2 8 9 10 11 NaN NaN NaN NaN NaN NaN
3 12 13 14 15 NaN NaN NaN NaN NaN NaN
4 16 17 18 19 NaN NaN NaN NaN NaN NaN
I'm trying to stack two 3 column data frames using either concat, append, or merge. The result is a 5 column dataframe where the original columns have a different order in places. Here are some of the things I've tried:
dfTrain = pd.read_csv("agr_hi_train.csv")
dfTrain2 = pd.read_csv("english/agr_en_train.csv")
dfTrain2.reset_index()
frames = [dfTrain, dfTrain2]
test = dfTrain2.append(dfTrain, ignore_index=True)
test2 = dfTrain2.append(dfTrain)
test3 = pd.concat(frames, axis=0, ignore_index=True)
test4 = pd.merge(dfTrain,dfTrain2, right_index=True, left_index=True)
With the following results:
print(dfTrain.shape)
print(dfTrain2.shape)
print(test.shape)
print(test2.shape)
print(test3.shape)
print(test4.shape)
Output is:
(20198, 5)
(20198, 5)
(11998, 6)
(8200, 6)
(8200, 3)
(11998, 3)
I want the result to be:
(20198,3) # i.e. last two stacked on top of each other. . .
Any ideas why I'm getting the extra columns, etc.?
If you have different column names, then your append will separate the columns. For example:
dfTrain = pd.DataFrame(np.random.rand(8200, 3), columns=['A', 'B', 'C'])
dfTrain2 = pd.DataFrame(np.random.rand(11998, 3), columns=['D', 'E', 'F'])
test = dfTrain.append(dfTrain2)
print(test)
has the output:
A B C D E F
0 0.617294 0.507264 0.330792 NaN NaN NaN
1 0.439806 0.355340 0.757864 NaN NaN NaN
2 0.740674 0.332794 0.530613 NaN NaN NaN
...
20195 NaN NaN NaN 0.295392 0.621741 0.255251
20196 NaN NaN NaN 0.096586 0.841174 0.392839
20197 NaN NaN NaN 0.071756 0.998280 0.451681
If you rename the columns in both dataframes to match, then it'll line up.
dfTrain2.columns = ['A','B','C']
test2 = dfTrain.append(dfTrain2)
print(test2)
A B C
0 0.545936 0.103332 0.939721
1 0.258807 0.274423 0.262293
2 0.374780 0.458810 0.955040
...
[20198 rows x 3 columns]
As per the title here's a reproducible example:
raw_data = {'x': ['this', 'that', 'this', 'that', 'this'],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan],
'y': [np.nan, np.nan, np.nan, np.nan, np.nan],
np.nan: [np.nan, np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(raw_data, columns = ['x', np.nan, 'y', np.nan])
df
x NaN y NaN
0 this NaN NaN NaN
1 that NaN NaN NaN
2 this NaN NaN NaN
3 that NaN NaN NaN
4 this NaN NaN NaN
Aim is to drop only the columns with nan as the col name (so keep column y). dropna() doesn't work as it conditions on the nan values in the column, not nan as the col name.
df.drop(np.nan, axis=1, inplace=True) works if there's a single column in the data with nan as the col name, but not with multiple columns with nan as the col name, as in my data.
So how to drop multiple columns where the col name is nan?
In [218]: df = df.loc[:, df.columns.notna()]
In [219]: df
Out[219]:
x y
0 this NaN
1 that NaN
2 this NaN
3 that NaN
4 this NaN
You can try
df.columns = df.columns.fillna('to_drop')
df.drop('to_drop', axis = 1, inplace = True)
As of pandas 1.4.0
df.drop is the simplest solution, as it now handles multiple NaN headers properly:
df = df.drop(columns=np.nan)
# x y
# 0 this NaN
# 1 that NaN
# 2 this NaN
# 3 that NaN
# 4 this NaN
Or the equivalent axis syntax:
df = df.drop(np.nan, axis=1)
Note that it's possible to use inplace instead of assigning back to df, but inplace is not recommended and will eventually be deprecated.
My dataframe looks something like this, only much larger.
d = {'Col_1' : pd.Series(['A', 'B']),
'Col_2' : pd.Series(['B', 'A', 'C']),
'Col_3' : pd.Series(['B', 'A']),
'Col_4' : pd.Series(['C', 'A', 'B', 'D']),
'Col_5' : pd.Series(['A', 'C']),}
df = pd.DataFrame(d)
Col_1 Col_2 Col_3 Col_4 Col_5
A B B C A
B A A A C
NaN C NaN B NaN
NaN NaN NaN D NaN
First, I'm trying to sort each column individually. I've tried playing around with something like: df.sort([lambda x: x in df.columns], axis=1, ascending=True, inplace=True) however have only ended up with errors. How do I sort each column individually to end up with something like:
Col_1 Col_2 Col_3 Col_4 Col_5
A A A A A
B B B B C
NaN C NaN C NaN
NaN NaN NaN D NaN
Second, I'm looking to concatenate the rows within the columns
df = pd.concat([df,pd.DataFrame(df.sum(axis=0),columns=['Concatenation']).T])
I can combine everything with the line above after replacing np.nan with '', but the result comes out smashed ('AB') together and would require an additional step to clean (into something like 'A:B').
pandas.Series.order is deprecated since pandas=0.17. Instead, use sort_values as follows:
for col in df:
df[col] = df[col].sort_values(ignore_index=True)
Here is one way:
>>> pandas.concat([df[col].order().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
11: 0 1 2 3 4
0 A A A A A
1 B B B B C
2 NaN C NaN C NaN
3 NaN NaN NaN D NaN
[4 rows x 5 columns]
However, what you're doing is somewhat strange. DataFrames aren't just collections of unrelated columns. In a DataFrame, each row represents a record, so the value in one column is semantically linked to the values in other columns in that same row. By sorting the columns independently, you're discarding this information, so the rows are now meaningless. That's why the reset_index is needed in my example. Also, because of this, there's no way to do this in-place, which your example suggests you want.
I don't know if this is any better, but here are a couple other ways to do it.
pd.DataFrame({key: sorted(value.values(), reverse=True) \
for key, value in df.to_dict().iteritems()})
pd.DataFrame({key: sorted(values, reverse=True) \
for key, values in df.transpose().iterrows()})