How do I create a function to perform label encoding

How do I create a function to perform label encoding - python

I have the dataframe -
df = pd.DataFrame({'colA':['a', 'a', 'a', 'b' ,'b'], 'colB':['a', 'b', 'a', 'c', 'b'], 'colC':['x', 'x', 'y', 'y', 'y']})
I would like to write a function to replace each value with it's frequency count in that column. For example colA will now be [3, 3, 3, 2, 2]
I have attempted to do this by creating a dictionary with the value and the frequency count, assign that dictionary to a variable freq, then map the column values to freq. I have written the following function
def LabelEncode_method1(col):
freq = col.value_counts().to_dict()
col = col.map(freq)
return col.head()```
When I run the following LabelEncode_method1(df.colA), I get the result 3, 3, 3, 2, 2. However when I call the dataframe df, the values for colA are still 'a', 'a', 'a', 'b', 'b'
What am I doing wrong. How do I fix my function?
How do I write another function that loops through all columns and maps the values to freq, as opposed to calling the function 3 separate times for each column.

You can do groupby + transform
df['new'] = df.groupby('colA')['colA'].transform('count')

You can use map + value_counts (Which you have already found, you just need to assign the result back to your DataFrame).
df['colA'].map(df['colA'].value_counts())
0 3
1 3
2 3
3 2
4 2
Name: colA, dtype: int64
For all columns, which will create a new DataFrame:
pd.concat([
df[col].map(df[col].value_counts()) for col in df
], axis=1)
colA colB colC
0 3 2 2
1 3 2 2
2 3 2 3
3 2 1 3
4 2 2 3

Related

Python looping over a list to check if any of the list elements are equal to variable values in pandas dataframe

I have a pandas dataframe and I want to create a new dummy variable based on if the values of a variable in my dataframe equal values in a list.
df = pd.DataFrame({'variable1':[1,2,3,4,5,6,7,8],
'variable2':['a', 'r', 'b', 'w', 'c', 'p', 'l', 'a']})
my_list = ['a', 'b', 'c', 'd', 'e']
How can I create a new dummy variable for the dataframe, called variable 3, that equals 1 if variable 2 is present in the list and 0 if not?
I tried this using:
df['variable3'] = np.where(
dataset['variable2'] in my_list,
1, 0)
However, this throws a ValueError: The truth value of a Series is ambiguous.
I've been looking for an answer for this for a long time but none were sufficient for this problem.
Do you have any suggestions?

You're almost there. When you want to check if the value of a dataframe column matches some list or another dataframe column, you can use df.isin.
df['variable3'] = np.where(
df['variable2'].isin(my_list),
1, 0)
df
Out[16]:
variable1 variable2 variable3
0 1 a 1
1 2 r 0
2 3 b 1
3 4 w 0
4 5 c 1
5 6 p 0
6 7 l 0
7 8 a 1

How can I retain all index levels from 2 multi-index series in pandas multiply?

In my application I am multiplying two Pandas Series which both have multiple index levels. Sometimes, a level contains only a single unique value, in which case I don't get all the index levels from both Series in my result.
To illustrate the problem, let's take two series:
s1 = pd.Series(np.random.randn(4), index=[[1, 1, 1, 1], [1,2,3,4]])
s1.index.names = ['A', 'B']
A B
1 1 -2.155463
2 -0.411068
3 1.041838
4 0.016690
s2 = pd.Series(np.random.randn(4), index=[['a', 'a', 'a', 'a'], [1,2,3,4]])
s2.index.names = ['C', 'B']
C B
a 1 0.043064
2 -1.456251
3 0.024657
4 0.912114
Now, if I multiply them, I get the following:
s1.mul(s2)
A B
1 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
While my desired result would be
A C B
1 a 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
How can I keep index level C in the multiplication?
I have so far been able to get the right result as shown below, but would much prefer a neater solution which keeps my code more simple and readable.
s3 = s2.mul(s1).to_frame()
s3['C'] = 'a'
s3.set_index('C', append=True, inplace=True)

You can use Series.unstack with DataFrame.stack:
s = s2.unstack(level=0).mul(s1, level=1, axis=0).stack().reorder_levels(['A','C','B'])
print (s)
A C B
1 a 1 0.827482
2 -0.476929
3 -0.473209
4 -0.520207
dtype: float64

counting number of customers per week during 6 years [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)

The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.

You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2

The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)

You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2

lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Mapping Python dictionary with multiple keys into dataframe with multiple columns matching keys

I have a dictionary that I would like to map onto a current dataframe and create a new column. I have keys in a tuple, which map onto two different columns in my dataframe.
dct = {('County', 'State'):'CountyType'}
df = pd.DataFrame(data=['County','State'])
I would like to create a new column, CountyType, using dict to map onto the two columns in df. However, doing the following gives me an error. How else could this be done?
df['CountyType'] = (list(zip(df.County,df.State)))
df = df.replace({'CountyType': county_type_dict)

You can create a MultiIndex from two series and then map. Data from #ALollz.
df['CountyType'] = df.set_index(['County', 'State']).index.map(dct.get)
print(df)
County State CountyType
0 A 1 One
1 A 2 None
2 B 1 None
3 B 2 Two
4 B 3 Three

If you have the following dictionary with tuples as keys and a DataFrame with columns corresponding to the tuple values
import pandas as pd
dct = {('A', 1): 'One', ('B', 2): 'Two', ('B', 3): 'Three'}
df = pd.DataFrame({'County': ['A', 'A', 'B', 'B', 'B'],
'State': [1, 2, 1, 2, 3]})
You can create a Series of the tuples from your df and then just use .map()
df['CountyType'] = pd.Series(list(zip(df.County, df.State))).map(dct)
Results in
County State CountyType
0 A 1 One
1 A 2 NaN
2 B 1 NaN
3 B 2 Two
4 B 3 Three

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne

I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I create a function to perform label encoding - python

You can do groupby + transform df['new'] = df.groupby('colA')['colA'].transform('count')

Related

Python looping over a list to check if any of the list elements are equal to variable values in pandas dataframe

How can I retain all index levels from 2 multi-index series in pandas multiply?

counting number of customers per week during 6 years [duplicate]

Mapping Python dictionary with multiple keys into dataframe with multiple columns matching keys

Filtering rows from pandas dataframe using concatenated strings

Categories

Resources