I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.
I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Related
I basically want the solution from this question to be applied to 2 blank rows.
Insert Blank Row In Python Data frame when value in column changes?
I've messed around with the solution but don't understand the code enough to alter it correctly.
You can do:
num_empty_rows = 2
df = (df.groupby('Col1',as_index=False).apply(lambda g: g.append(
pd.DataFrame(data=[['']*len(df.columns)]*num_empty_rows,
columns=df.columns))).reset_index(drop=True).iloc[:-num_empty_rows])
As you can see, after each group df is appended by a dataframe to accommodate num_empty_rows and then at the end reset_index is performed. The last iloc[:-num_empty_rows] is optional i.e. to remove empty rows at the end.
Example input:
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'C'],
'Col2':['s','s','b','b','l'],
'Col3':['b','j','d','a','k'],
'Col4':['d','k','q','d','p']
})
Output:
Col1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4
5 B b a d
6
7
8 C l k p
I have DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]}) that looks like
Input:
col1 col2
0 a 1
1 b 3
2 c 3
3 d 2
4 e 6
I would like to remove rows from "col1" that share a common value in "col2". The expected output would look something like...
Output:
col1 col2
0 a 1
3 d 2
4 e 6
What would be the process of doing this?
using this short code should do the trick
df.drop_duplicates(subset=['col2'], keep=False)
Explanation
we use drop_duplicates to (ovbiously) drop the duplicates, and we set the column(s) we want to drop from to be col2 as you requested, In order to drop all occurences (and not keep the first occurence of each duplicate for example) we use keep=False.
This will do the trick
from collections import Counter
df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]})
c = Counter(df['col2'])
ls = [k for k,v in c.items() if v==1]
_fltr = df['col2'].isin(ls)
df.loc[_fltr,:]
According to this thread, we could use map or replace to remap values of a Dataframe using a defined dictionary. I have tried this and it did correctly remap the values, but the output result only produces the column I performed the operation on (of type series) instead of the full Dataframe.
How can I perform the mapping but keep the other columns (with 'last') in the new data3 ?
data3 = data['last'].map(my_dict)
I think what you are trying to do is this:
data['last'] = data['last'].map(my_dict)
Updating based on comment with relation to the link:
In [1]: di = {1: "A", 2: "B"}
In [5]: from numpy import NaN
In [6]: df = DataFrame({'col1':['w', 1, 2], 'col2': ['a', 2, NaN]})
In [7]: df
Out[7]:
col1 col2
0 w a
1 1 2
2 2 NaN
In [8]: df['col1'].map(di)
Out[8]:
0 NaN
1 A
2 B
Name: col1, dtype: object
In [9]: df
Out[9]:
col1 col2
0 w a
1 1 2
2 2 NaN
In [10]: df['col1'] = df['col1'].map(di)
In [11]: df
Out[11]:
col1 col2
0 NaN a
1 A 2
2 B NaN
If you want this to happen in data3 instead of data then you could assign the Series result of the map to a column in data3.
I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]
I would like to map the values in df2['col2'] to df['col1']:
df col1 col2
0 w a
1 1 2
2 2 3
I would like to use a column from the dataframe as a dictionary to get:
col1 col2
0 w a
1 A 2
2 B 3
However the data dictionary is just a column in df2, which looks like
df2 col1 col2
1 1 A
2 2 B
I have tried using this:
di = {"df2['col1']: df2['col2']}
final = df1.replace({"df2['col2']": di})
But get an error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
I have about a 200,000 rows. Any help would be appreciated.
Edit:
The sample dictionary would look like di = {1: "A", 2: "B"}, but is in df2['col1']: df2['col2']. I have 200k+ rows, can I convert df2['col1']: df2['col2'] to a tuple, etc?
You can build a lookup dictionary based on the col1:col2 of df2 and then use that to replace the values in df1.col1.
import pandas as pd
df1 = pd.DataFrame({'col1':['w',1,2],'col2':['a',2,3]})
df2 = pd.DataFrame({'col1':[1,2],'col2':['A','B']})
print(df1)
# col1 col2
#0 w a
#1 1 2
#2 2 3
print(df2)
# col1 col2
#0 1 A
#1 2 B
dataLookUpDict = {row[1]:row[2] for row in df2[['col1','col2']].itertuples()}
final = df1.replace({'col1': dataLookUpDict})
print(final)
# col1 col2
#0 w a
#1 A 2
#2 B 3