pandas stacking dataframe reshapes data - python

I'm trying to stack two 3 column data frames using either concat, append, or merge. The result is a 5 column dataframe where the original columns have a different order in places. Here are some of the things I've tried:
dfTrain = pd.read_csv("agr_hi_train.csv")
dfTrain2 = pd.read_csv("english/agr_en_train.csv")
dfTrain2.reset_index()
frames = [dfTrain, dfTrain2]
test = dfTrain2.append(dfTrain, ignore_index=True)
test2 = dfTrain2.append(dfTrain)
test3 = pd.concat(frames, axis=0, ignore_index=True)
test4 = pd.merge(dfTrain,dfTrain2, right_index=True, left_index=True)
With the following results:
print(dfTrain.shape)
print(dfTrain2.shape)
print(test.shape)
print(test2.shape)
print(test3.shape)
print(test4.shape)
Output is:
(20198, 5)
(20198, 5)
(11998, 6)
(8200, 6)
(8200, 3)
(11998, 3)
I want the result to be:
(20198,3) # i.e. last two stacked on top of each other. . .
Any ideas why I'm getting the extra columns, etc.?

If you have different column names, then your append will separate the columns. For example:
dfTrain = pd.DataFrame(np.random.rand(8200, 3), columns=['A', 'B', 'C'])
dfTrain2 = pd.DataFrame(np.random.rand(11998, 3), columns=['D', 'E', 'F'])
test = dfTrain.append(dfTrain2)
print(test)
has the output:
A B C D E F
0 0.617294 0.507264 0.330792 NaN NaN NaN
1 0.439806 0.355340 0.757864 NaN NaN NaN
2 0.740674 0.332794 0.530613 NaN NaN NaN
...
20195 NaN NaN NaN 0.295392 0.621741 0.255251
20196 NaN NaN NaN 0.096586 0.841174 0.392839
20197 NaN NaN NaN 0.071756 0.998280 0.451681
If you rename the columns in both dataframes to match, then it'll line up.
dfTrain2.columns = ['A','B','C']
test2 = dfTrain.append(dfTrain2)
print(test2)
A B C
0 0.545936 0.103332 0.939721
1 0.258807 0.274423 0.262293
2 0.374780 0.458810 0.955040
...
[20198 rows x 3 columns]

Related

Why isn't transpose working to allow me to concat dataframe with df.loc["Y"] row

I have 2 dataframes:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['X', 'Y', 'Z'])
and
df1 = pd.DataFrame({'M': [10, 20, 30],
'N': [40, 50, 60]},
index=['S', 'T', 'U'])
i want to append the df1 with a row of from the df dataframe.
i use the following code to extract the row:
row = df.loc['Y']
when i print this i get:
A 2
B 5
Name: Y, dtype: int64
A and B are key values or column heading names. so i transpose this with
row_25 = row.transpose()
i print row_25 and get:
A 2
B 5
Name: Y, dtype: int64
this is the same as row, so it seems the transpose didn't happen
i then add this code to add the row to df1:
result = pd.concat([df1, row_25], axis=0, ignore_index=False)
print(result)
when i print df1 i get:
M N 0
S 10.0 40.0 NaN
T 20.0 50.0 NaN
U 30.0 60.0 NaN
A NaN NaN 2.0
B NaN NaN 5.0
i want A and B to be column headings (key values) and the name of row (Y) to be the row index.
what am i doing wrong?
Try
pd.concat([df1, df.loc[['Y']])
It generates:
M N A B
S 10.0 40.0 NaN NaN
T 20.0 50.0 NaN NaN
U 30.0 60.0 NaN NaN
Y NaN NaN 2.0 5.0
Not sure if this is what you want.
To exclude column names 'M' and 'N' from the result you can rename the columns beforehand:
>>> df1.columns = ['A', 'B']
>>> pd.concat([df1, df.loc[['Y']])
A B
S 10 40
T 20 50
U 30 60
Y 2 5
The reason why you need double square brackets is that single square brackets return a 1D Series, that cannot be transposed. And double brackets return a 2D DataFrame (in general double brackets are used to reference several columns, like df1.loc[['X', 'Y']]; it is called 'fancy indexing' in NumPy).
If you are allergic to double brackets, use
pd.concat([df1.rename(columns={'M': 'A', 'N': 'B'}),
df.filter('Y', axis=0)])
Finally, if you really want to transpose something, you can convert the series to a frame and transpose it:
>>> df.loc['Y'].to_frame().T
A B
Y 2 5

How to subtract two dataframe column in dask dataframe?

I have two dataframes like df1, df2.
In df1 i have 4 columns (A,B,C,D) and two rows,
In df2 i have 4 columns (A,B,C,D) and two rows.
Now I want to subtract the two dataframe LIKE df1['A'] - df2['A'] and so on. But I don't know how to do it.
df1-
df2 -
Just do the subtraction but keep in mind the indexes, for example, let's say I have df1 and df2 with same columns but different index:
df1 = dd.from_array(np.arange(8).reshape(2, 4), columns=['A','B','C','D'])
df2 = dd.from_pandas(pd.DataFrame(
np.arange(8).reshape(2, 4),
columns=['A','B','C','D'],
index=[1, 2]
), npartitions=1)
Then:
(df1 - df2).compute()
# A B C D
# 0 NaN NaN NaN NaN
# 1 4.0 4.0 4.0 4.0
# 2 NaN NaN NaN NaN
On the other hand, let's match index from df2 to df1 and subtract
df2 = df2.assign(idx=1)
df2 = df2.set_index(df2.idx.cumsum() - 1)
df2 = df2.drop(columns=['idx'])
(df1 - df2).compute()
# A B C D
# 0 0 0 0 0
# 1 0 0 0 0

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

How to transpose and insert a Pandas column slice into a row slice?

Trying to take a slice of a column from one Pandas dataframe, transpose the slice, and insert it into a similarly sized row slice, in a different dataframe. Labels and indexes in both dataframes are different. With large dataframes, am currently running a for loop to copy each individual value, cell-by-cell as it were, but incredibly inefficient.
Other than for-loop, have tried .loc, .iloc, with transpose, but no success. pivot, pivot_table, melt dont seem to be applicable here, or I cannot get my head around how to apply them to this seemingly simple problem.
# Two dataframes here
import pandas as pd
import numpy as np
numRng = np.arange(20).reshape((5, 4))
df1 = pd.DataFrame(numRng)
newCols = ('A', 'B', 'C', 'D', 'E', 'F')
for newCol in newCols:
df1[newCol] = np.nan
numRng2 = np.arange(1000,976,-1).reshape((6, 4))
df2 = pd.DataFrame(numRng2)
df2.columns = ['M', 'N', 'O', 'P']
df1
df2
# From df1, trying to copy a column-slice, transpose it, and insert it
# into df2 row-slice, has no effect
df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose()
df1
# 'Manual' implementation to produce desired df1 geometry
df1.loc[1, 'B'] = 996
df1.loc[1, 'C'] = 992
df1.loc[1, 'D'] = 988
df1.loc[1, 'E'] = 984
df1
In example df's above, in df1 row1 columns B, C, D, E show numbers 996, 992, 988 and 984 in the row slice.
How to extract slice, transpose and insert without for-looping over every value?
Convert values to numpy array for avoid data alignment - pandas try match index and columns each other and if failed, create missing values or not assign values:
#pandas 0.22+
df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose().to_numpy()
#pandas below
#df1.loc[1, 'B':'E'] = df2.loc[1:4, 'M'].transpose().values
print (df1)
0 1 2 3 A B C D E F
0 0 1 2 3 NaN NaN NaN NaN NaN NaN
1 4 5 6 7 NaN 996.0 992.0 988.0 984.0 NaN
2 8 9 10 11 NaN NaN NaN NaN NaN NaN
3 12 13 14 15 NaN NaN NaN NaN NaN NaN
4 16 17 18 19 NaN NaN NaN NaN NaN NaN

Pandas add keys while concatenating dataframes at column level

As per Pandas 0.19.2 documentation, I can provide keys argument to create a resulting multi-index DataFrame. An example (from pandas documents ) is :
result = pd.concat(frames, keys=['x', 'y', 'z'])
How would I concat the dataframe so that I can provide the keys at the column level instead of index level ?
What I basically need is something like this :
where df1 and df2 are to be concat.
This is supported by keys parameter of pd.concat when specifying axis=1:
df1 = pd.DataFrame(np.random.random((4, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.random((4, 3)), columns=list('BDF'), index=[2, 3, 6, 7])
df = pd.concat([df1, df2], keys=['X', 'Y'], axis=1)
The resulting output:
X Y
A B C D B D F
0 0.654406 0.495906 0.601100 0.309276 NaN NaN NaN
1 0.020527 0.814065 0.907590 0.924307 NaN NaN NaN
2 0.239598 0.089270 0.033585 0.870829 0.882028 0.626650 0.622856
3 0.983942 0.103573 0.370121 0.070442 0.986487 0.848203 0.089874
6 NaN NaN NaN NaN 0.664507 0.319789 0.868133
7 NaN NaN NaN NaN 0.341145 0.308469 0.884074

Categories