Removing columns in Pandas - python

I work on a big Python dataframe and notice that some columns have same values for each row BUT columns' names are different.
Also, some values are text, or timeseries data.
Any easy was to get rid of these columns duplicates and keep first each time?
Many thanks

Let create a dummy data frame, where two columns with different names are duplicate.
import pandas as pd
df=pd.DataFrame({
'col1':[1,2,3,'b',5,6],
'col2':[11,'a',13,14,15,16],
'col3':[1,2,3,'b',5,6],
})
col1 col2 col3
0 1 11 1
1 2 a 2
2 3 13 3
3 b 14 b
4 5 15 5
5 6 16 6
To remove duplicate columns, first, take transpose, then apply drop_duplicate and again take transpose
df.T.drop_duplicates().T
result
col1 col2
0 1 11
1 2 a
2 3 13
3 b 14
4 5 15
5 6 16

Related

Pandas merge multiple value columns into a value and type column

I have a pandas dataframe where there are multiple integer value columns denoting a count. I want to transform this dataframe such that the value columns are merged into one column but another column is created denoting the column the value was taken from.
Input
a b c
0 2 5 8
1 3 6 9
2 4 7 10
Output
count type
0 2 a
1 3 a
2 4 a
3 5 b
4 6 b
5 7 b
6 8 c
7 9 c
8 10 c
Im sure this is possible by looping over the entries and creating however many rows for each original row but im sure there is a pandas way to achieve this and I would like to know what it is called.
You could do that with the following
pd.melt(df, value_vars=['a','b','c'], value_name='count', var_name='type')

Mapping column values with index of List

I have below dataframe :
Df1
Col1
3
5
6
7
9
and I have a below List
Mapping_list= ["Sales","Pre-Sales","Marketing", "Digital-Banking", "Payments", "telecom", "Core-Banking","Infra", "Cards", "Commercial-Banking" ]
I want to map column values with the index of the list like below:
Col1 Values
3 Digital-Banking
5 telecom
6 Core-Banking
7 Infra
9 Commercial-Banking
I could have done this if instead of list i need to map it with another dataframe index, but with list i am facing issue.
You can convert your list to series and map it, as map takes a series:
df['Values'] = df.Col1.map(pd.Series(Mapping_list))
Prints:
Col1 Values
0 3 Digital-Banking
1 5 telecom
2 6 Core-Banking
3 7 Infra
4 9 Commercial-Banking

How to prevent pandas from only assigning value from one df to column of another for only one row?

I have a df that looks like this:
id col1 col2
1 2 3
4 5 6
7 8 9
when I go to add a new column and assign a value like this:
df['new_col'] = old_df['email']
The assignment only assigns the value to the first like so:
id col1 col2 new_col
1 2 3 a#a.com
4 5 6 NaN
7 8 9 NaN
How do I have the assignment for all rows like so:
id col1 col2 new_col
1 2 3 a#a.com
4 5 6 a#a.com
7 8 9 a#a.com
edit:
old_df:
id col3 col4 email
1 2 3 a#a.com
Pandas series assignment works by index. Since old_df only contains index 0, only index 0, i.e. the first row, of df is updated.
For your particular problem, you can use iat and assign a scalar to a series:
df['new_col'] = old_df['email'].iat[0]
This works because Pandas broadcasts scalars to the whole series irrespective of index.

Pandas: update a column based on index in another dataframe

I want to update couple of columns in a dataframe using a multiplying factor in another df (both the dfs have a 'KEY' column). Though I was able to achieve this, it takes a lot of processing time since I have a few million records. Looking for a more optimum solution if any.
Let me explain my scenario using dummy dfs. I have a dataframe df1 as below
In [8]: df1
Out[8]:
KEY col2 col3 col4
0 1 1 10 5
1 2 7 13 8
2 1 12 15 12
3 4 3 23 1
4 3 14 5 6
Now I want to change col2 and col3 by a factor that I fetch from the below df2 dataframe based on the KEY.
In [11]: df2
Out[11]:
FACTOR
KEY
1 100
2 3000
3 1000
4 200
5 50
I'm using the below for loop to achieve what I need.
In [12]: for index, row in df2.iterrows():
df1.loc[(df1['KEY']==index), ['col2', 'col3']] *= df2.loc[index]['FACTOR']
In [13]: df1
Out[13]:
KEY col2 col3 col4
0 1 100 1000 5
1 2 21000 39000 8
2 1 1200 1500 12
3 4 600 4600 1
4 3 14000 5000 6
This does the job. But my actual data has a few million records that come in real time and takes about 15 seconds to complete for each batch of incoming data. I am looking for a better solution since the for loop seems to be doing it in O(n) complexity
you should use a merge:
c=df1.merge(df2,on="KEY")
the c dataframe will now contain the "FACTOR" column which is the result you want to achieve
if one of the fields to merge is the index, you can use :
c=df1.merge(df2,left_on="KEY",right_index=True)

How to merge two DataFrames of unequal size based on row value

I have two pandas DataFrames of the type
DataFrame 1
Index Name Property1 Property2
0 ("a","b") 1 2
1 ("c","d") 3 4
2 ("e","f") 5 6
And the second one , has common values , but not at the same index ( which I dont care about).
DataFrame 2
Index Name Property3 Property4
0 ("g","h") 7 8
1 ("i","j") 9 10
2 ("k","l") 11 12
3 ("a","b") 13 14
4 ("c","d") 15 16
5 ("e","f") 17 18
Is there a way to get these to be combined such that the resultant DataFrame is the common rows with the Name shared between the tables?
i.e Result of pandas operation should be
Result Frame
Index Name Property1 Property2 Property3 Property4
0 ("a","b") 1 2 13 14
1 ("c","d") 3 4 15 16
2 ("e","f") 5 6 17 18
Sorry I am not giving you actual pandas code to create the DataFrames above. But I want to conceptually understand how to join two unequal sized DataFrames with different "indexes" based on a column name . I tried merge and concat and join but dont get the result I want.
A default merge works fine here, assuming your index actually is your index:
In [22]:
df1.merge(df2)
Out[22]:
Name Property1 Property2 Property3 Property4
0 ("a","b") 1 2 13 14
1 ("c","d") 3 4 15 16
2 ("e","f") 5 6 17 18
Here the merge looks for common columns and performs an inner merge on those columns.
You can be explicit and specify that you want to merge on the 'Name' column:
df1.merge(df2, on='Name')
but in this case it's not necessary because the only common column is 'Name' anyway.

Categories