Create dataframe with values from other dataframe's indices and columns - python

I have a large dataframe df1 that looks like:
0 1 2
0 NaN 1 5
1 0.5 NaN 1
2 1.25 3 NaN
And I want to create another dataframe df2 with three columns where the values for the first two columns correspond to the df1 columns and indices, and the third column is the cell value.
So df2 would look like:
src dst cost
0 0 1 0.5
1 0 2 1.25
2 1 0 5
3 1 2 3
How can I do this?
Thanks

I'm sure there's probably a clever way to do this with pd.pivot or pd.melt but this works:
df2 = (
# reorganize the data to be row-wise with a multi-index
df1.stack()
# drop missing values
.dropna()
# name the axes
.rename_axis(['src', 'dst'])
# name the values
.to_frame('cost')
# return src and dst to columns
.reset_index(drop=False)
)

Related

How to replicate number of rows in dataframe1 to match n rows in dataframe 2 in pandas

I just started learning Python a few months ago and just started using StackOverflow as well, please bear with me:
We have the two data frames:
df1:
0.1,0.2,0.3,0.4
1.0,2.0,3.0,4.0
6.0,7.0,8.0,9.0
df2:
Sequence, dataset_ID
1,1
2,4
10,5
I am using python iterrows function to transpose df1 into:
for ind,row in df1.iterrows():
row.to_csv(path+'\df1Transposed')
df1Transposed:
0.1,1.0
0.2,2.0
0.3,3.0
0.4,4.0
0.1,6.0
0.2,7.0
0.3,8.0
0.4,9.0
I am trying to find a good way to group/replicate each row in df2 to match the number of rows in df1 transposed. For example, 1 transposed header and row in df 1 creates 4 rows and two columns in df1Transposed (0.1-0.4) and repeats again for the next row in df1. So the first row in df2 should repeat 4 times and then the second row should repeat another 4 times.
dfout:
Sequence, dataset_ID,V,I
1,1,0.1,1.0
1,1,0.2,2.0
1,1,0.3,3.0
1,1,0.4,4.0
2,4,0.1,6.0
2,4,0.2,7.0
2,4,0.3,8.0
2,4,0.4,9.0
You can use a combination of numpy's repeat and arange to get the index, then concatenate the two dataframes horizontally.
First, get the transpose thanks to #sammywemmy handy one-liner:
df1_T = pd.concat([df1.iloc[:2].T,
df1.iloc[::2].T.set_axis([0,1],axis=1)],ignore_index=True)
Second get the length of the transpose dataframe, select the number of rows you want to include from df2 and use the functions mentioned above:
df_1_l = df1_T.shape[0]
no_rows_from_df2 = 2
index = np.repeat(np.arange(no_rows_from_df2), df_1_l//rows_df2)
df3 = pd.concat([df1_T.reset_index(drop=True),
df2.iloc[index].reset_index(drop=True)], axis=1)
df3
# 0 1 Sequence dataset_ID
# 0 0.1 1.0 1 1
# 1 0.2 2.0 1 1
# 2 0.3 3.0 1 1
# 3 0.4 4.0 1 1
# 4 0.1 6.0 2 4
# 5 0.2 7.0 2 4
# 6 0.3 8.0 2 4
# 7 0.4 9.0 2 4
Few things, this works because the length of df1_T is a mutiple of the selected number of rows in df2, if for example you would like to repeat rows 0,1,2 then the length of df1 should be 3, 6, 9, 12 ...

pandas reset index after performing groupby and retain selective columns

I want to take a pandas dataframe, do a count of unique elements by a column and retain 2 of the columns. But I get a multi-index dataframe after groupby which I am unable to (1) flatten (2) select only relevant columns. Here is my code:
import pandas as pd
df = pd.DataFrame({
'ID':[1,2,3,4,5,1],
'Ticker':['AA','BB','CC','DD','CC','BB'],
'Amount':[10,20,30,40,50,60],
'Date_1':['1/12/2018','1/14/2018','1/12/2018','1/14/2018','2/1/2018','1/12/2018'],
'Random_data':['ax','','nan','','by','cz'],
'Count':[23,1,4,56,34,53]
})
df2 = df.groupby(['Ticker']).agg(['nunique'])
df2.reset_index()
print(df2)
df2 still comes out with two levels of index. And has all the columns: Amount, Count, Date_1, ID, Random_data.
How do I reduce it to one level of index?
And retain only ID and Random_data columns?
Try this instead:
1) Select only the relevant columns (['ID', 'Random_data'])
2) Don't pass a list to .agg - just 'nunique' - the list is what is causing the multi index behaviour.
df2 = df.groupby(['Ticker'])['ID', 'Random_data'].agg('nunique')
df2.reset_index()
Ticker ID Random_data
0 AA 1 1
1 BB 2 2
2 CC 2 2
3 DD 1 1
Use SeriesGroupBy.nunique and filter columns in list after groupby:
df2 = df.groupby('Ticker')['Date_1','Count','ID'].nunique().reset_index()
print(df2)
Ticker Date_1 Count ID
0 AA 1 1 1
1 BB 2 2 2
2 CC 2 2 2
3 DD 1 1 1

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

Python adding column to dataframe causes NaN

I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.

Pandas Multiple Column Division

I am trying to do a division of column 0 by columns 1 and 2. From the below, I would like to return a dataframe of 10 rows, 3 columns. The first column should all be 1's. Instead I get a 10x10 dataframe. What am I doing wrong?
data = np.random.randn(10,3)
df = pd.DataFrame(data)
df[0] / df
First you should create a 10 by 3 DataFrame with all columns equal to the first column and then divide it with your DataFrame.
df[[0, 0, 0]] / df.values
or
df[[0, 0, 0]].values / df
If you want to keep the column names.
(I use .values to avoid reindexing which will fail due to duplicate column values.)
You need to match the dimension of the Series with the rows of the DataFrame. There are a few ways to do this but I like to use transposes.
data = np.random.randn(10,3)
df = pd.DataFrame(data)
(df[0] / df.T).T
0 1 2
0 1 -0.568096 -0.248052
1 1 -0.792876 -3.539075
2 1 -25.452247 1.434969
3 1 -0.685193 -0.540092
4 1 0.451879 -0.217639
5 1 -2.691260 -3.208036
6 1 0.351231 -1.467990
7 1 0.249589 -0.714330
8 1 0.033477 -0.004391
9 1 -0.958395 -1.530424

Categories