Rearranging a non-consecutive order of columns in pandas dataframe - python

I have a pandas data frame (result) df with n (variable) columns that I generated using the merge of two other data frames:
result1 = df1.merge(df2, on='ID', how='left')
result1 dataframe is expected to have a variable # of columns (this is part of a larger script). i want to arrange the columns in a way that the last 2 columns will be the second and third consecutively, then all the remaining columns will follow (while the first column stays as first column). If result1 is known to have 6 columns, then i could use:
result2=result1.iloc[:,[0,4,5,1,2,3]] #this works fine.
BUT, i need the 1,2,3 to be in a range format as it is not practical to enter the whole of the numbers for each df. So, i thought of using:
result2=result1.iloc[:,[0,len(result1.columns), len(result1.columns)-1, 1:len(result1.columns-2]]
#Assuming 6 columns : 0, 5 , 4 , 1, 2, 3
That would be the idea way but this is creating syntax errors. Any suggestions to fix this?

Instead of using slicing syntax, I'd just build a list and use that:
>>> df
0 1 2 3 4 5
0 0 1 2 3 4 5
1 0 1 2 3 4 5
2 0 1 2 3 4 5
3 0 1 2 3 4 5
4 0 1 2 3 4 5
>>> ncol = len(df.columns)
>>> df.iloc[:,[0, ncol-1, ncol-2] + list(range(1,ncol-2))]
0 5 4 1 2 3
0 0 5 4 1 2 3
1 0 5 4 1 2 3
2 0 5 4 1 2 3
3 0 5 4 1 2 3
4 0 5 4 1 2 3

Related

pop rows from dataframe based on conditions

From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8

How to compare the column values of two Dataframs and assign the value of a third column in Python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 10 months ago.
I have two different Dataframes, a long dataframe (df1) and a short dataframe (df2). Both data frames contain the columns A and B:
Data1 = {'A': [2,2,2,1,2,1,1,2], 'B': [1,2,1,2,1,1,1,2]}
Data2 ={'A': [1,1,2,2],'B': [1,2,1,2],'X': [9,5,7,3]}
df1 = pd.DataFrame(Data1)
df2 = pd.DataFrame(Data2)
print(df1)
print(df2)
A B
0 2 1
1 2 2
2 2 1
3 1 2
4 2 1
5 1 1
6 1 1
7 2 2
A B X
0 1 1 9
1 1 2 5
2 2 1 7
3 2 2 3
The df2 data frame contains a new column called X that contains some values.
I need to create a third data frame (df3), that includes a new ‘X’ column in the data frame df1. This X column should show the values assigned in df2 depending on A and B column values. This should be the result:
Df3:
A B X
0 2 1 7
1 2 2 3
2 2 1 7
3 1 2 5
4 2 1 7
5 1 1 9
6 1 1 9
7 2 2 3
I have tried different ways of merging the dataframes without success. Any help would be greatly appreciated.
Just merge the two on A and B. I've added how='left' to join on the index of df1 (if this is removed, it still works but returns the new df sorted differently).
df1.merge(df2, on=['A', 'B'], how='left')
Output:
A B X
0 2 1 7
1 2 2 3
2 2 1 7
3 1 2 5
4 2 1 7
5 1 1 9
6 1 1 9
7 2 2 3

pandas get original dataframe after vertical concatenation

Let us take a sample dataframe
df = pd.DataFrame(np.arange(10).reshape((5,2)))
df
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and concatenate the two columns into a single column
temp = pd.concat([df[0], df[1]]).to_frame()
temp
0
0 0
1 2
2 4
3 6
4 8
0 1
1 3
2 5
3 7
4 9
What would be the most efficient way to get the original dataframe i.e df from temp?
The following way using groupby works. But is there any more efficient way (like without groupby-apply, pivot) to do this whole task from concatenation (and then doing some operation) and then reverting back to the original dataframe?
pd.DataFrame(temp.groupby(level=0)[0]
.apply(list)
.to_numpy().tolist())
I think we can do pivot after assign the column value with cumcount
check = temp.assign(c=temp.groupby(level=0).cumcount()).pivot(columns='c',values='0')
Out[66]:
c 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can use groupby + cumcount to create a sequential counter per level=0 group then append it to the index of the dataframe and use unstack to reshape:
temp.set_index(temp.groupby(level=0).cumcount(), append=True)[0].unstack()
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can try this:
In [1267]: temp['g'] = temp.groupby(level=0)[0].cumcount()
In [1273]: temp.pivot(columns='g', values=0)
Out[1279]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
OR:
In [1281]: temp['g'] = (temp.index == 0).cumsum() - 1
In [1282]: temp.pivot(columns='g', values=0)
Out[1282]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df = pd.DataFrame(np.arange(10).reshape((5,2)))
temp = pd.concat([df[0], df[1]]).to_frame()
duplicated_index = temp.index.duplicated()
pd.concat([temp[~duplicated_index], temp[duplicated_index]], axis=1)
Works for this specific case (as pointed out in the comments, it will fail if you have more than one set of duplicate index values) so I don't think it's a better solution.

Setting values in DataFrames using .loc

I have a configuration where it would be extremely useful to modify value of a dataframe using a combination of loc and iloc.
df = pd.DataFrame([[1,2],[1,3],[1,4],[2,6],[2,5],[2,7]],columns=['A','B'])
Basically in the dataframe above, I would like to take only the column that are equal to something (i.e. A = 2). which would give :
A B
3 2 6
4 2 5
5 2 7
And then modify the value of B of the second index (which is actually the index 4 in this case)
I can access to the value I want using this command :
df.loc[df['A'] == 2,'B'].iat[1]
(or .iloc instead of .iat, but I heard that for changing a lot of single row, iat is faster)
It yields me : 5
However I cannot seems to be able to modify it using the same command :
df.loc[df['A'] == 2,'B'].iat[1] = 0
It gives me :
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 5
5 2 7
I would like to get this :
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7
Thank you !
We should not chain .loc and .iloc (iat,at)
df.loc[df.index[df.A==2][1],'B']=0
df
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7
You can go around with cumsum, which counts the instances:
s = df['A'].eq(2)
df.loc[s & s.cumsum().eq(2), 'B'] = 0
Output:
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7

multiple rows for row in pandas dataframe python

For a column in a pandas DataFrame with several rows I want to create a new column that has a specified number of rows that form sub-levels of the rows of the previous column. I'm trying this in order to create a large data matrix containing ranges of values as an input for a model later on.
As an example I have a small DataFrame as follows:
df:
A
1 1
2 2
3 3
. ..
To this DataFrame I would like to add 3 rows per row in the 'A' column of the DataFrame, forming a new column named 'B'. The result should be something like this:
df:
A B
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
. .. ..
I have tried various things of which a list comprehension combined with an if statement and using something to iterate over the rows in the DataFrame like iterrows() and subsequently 'append' the new rows seems most logic to me, however I cannot get it done. Especially the duplication of the 'A' column's rows.
Does anyone know how to do this?
Any suggestion is appreciated, many thanks in advance
I think you need numpy.repeat and numpy.tile with DataFrame constructor:
df = pd.DataFrame({'A':np.repeat(df['A'].values, 3),
'B':np.tile(df['A'].values, 3)})
print (df)
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
In [28]: pd.DataFrame({'A':np.repeat(df.A.values, 3), 'B':np.tile(df.A.values,3)})
Out[28]:
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
Here's another NumPy way with np.repeat to create one column and then re-using it for the another -
In [282]: df.A
Out[282]:
1 4
2 9
3 5
Name: A, dtype: int64
In [288]: r = np.repeat(df.A.values[:,None],3,axis=1)
In [289]: pd.DataFrame(np.c_[r.ravel(), r.T.ravel()], columns=[['A','B']])
Out[289]:
A B
0 4 4
1 4 9
2 4 5
3 9 4
4 9 9
5 9 5
6 5 4
7 5 9
8 5 5

Categories