replace strings in every column with numbers

replace strings in every column with numbers - python

This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
set
set
1
b
volvo
None
swe
0
0
1
45
set
set
2
c
bmw
p
us
0
0
1
56
test
test
3
d
bmw
p
us
0
1
1
43
test
test
4
e
bmw
d
germany
1
0
1
34
set
set
5
f
audi
d
germany
1
0
1
59
set
set
6
g
volvo
d
swe
1
0
0
65
test
set
7
h
audi
d
swe
1
0
0
78
test
set
8
i
volvo
d
us
1
1
1
32
set
set
To convert a column with String entries, one should do a map and then pandas.replace().
For example:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
1
1
1
b
volvo
None
swe
0
0
1
45
1
1
2
c
bmw
p
us
0
0
1
56
2
2
3
d
bmw
p
us
0
1
1
43
2
2
4
e
bmw
d
germany
1
0
1
34
1
1
5
f
audi
d
germany
1
0
1
59
1
1
6
g
volvo
d
swe
1
0
0
65
2
1
7
h
audi
d
swe
1
0
0
78
2
1
8
i
volvo
d
us
1
1
1
32
1
1
As seen above, the last two column's strings are replaced with numbers representing these strings.
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?
Something that makes the DataFrame end up like:
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
1
1
1
1
1
0
1
23
1
1
1
2
1
2
1
0
0
1
45
1
1
2
3
2
1
2
0
0
1
56
2
2
3
4
2
1
2
0
1
1
43
2
2
4
5
2
3
3
1
0
1
34
1
1
5
6
3
3
3
1
0
1
59
1
1
6
7
1
3
1
1
0
0
65
2
1
7
8
3
3
1
1
0
0
78
2
1
8
9
1
3
2
1
1
1
32
1
1
Also output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.

You can adapte the code given in this response
https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested
all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))

You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes:
df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes
If you need the mapping:
dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical

From the other answers, I've written this function to do solve the problem:
import pandas as pd
def convertStringColumnsToNum(data):
columns = data.columns
columns_dtypes = data.dtypes
maps = []
for col_idx in range(0, len(columns)):
# don't change columns already comprising of numbers
if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
continue
# inspired from Shivam Roy's answer
col = columns[col_idx]
tmp = pd.Categorical(data[col])
data[col] = tmp.codes
maps.append(tmp.categories)
return maps
This function returns the mapss used to replace strings with a numeral code. The code is the index in which a string resides inside the list. This function works, yet it comes with the SettingWithCopyWarning.
if it ain't broke don't fix it, right? ;)
*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. Yet it works *shrugs* *

Related

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?

Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Looking for a sequential pattern with condition

I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).

Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64

I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])

You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False

Logical AND of multiple columns in pandas

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function

You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64

Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31

how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Determine size within each each group having the same value in another column

I have dataframe like so,
ID,CLASS_ID,ACTIVE
1,123,0
2,123,0
3,456,1
4,123,0
5,456,1
11,123,1
18,123,0
7,456,0
19,123,0
8,456,1
I'm trying to get the cumulative counts of the CLASS_ID having same value for ACTIVE. In case of the dataframe given above, CLASS_ID is continuously having ACTIVE as 0, until the 4th record post which next value is 1. So up until 4th record, count should be 3. This process has to be continued and the count has to be resetted every time value of ACTIVE changes for the CLASS_ID The expected output is as follows..
ID,CLASS_ID,ACTIVE,ACTIVE_COUNT
1,123,0,3
2,123,0,3
3,456,1,2
4,123,0,3
5,456,1,2
11,123,1,1
18,123,0,2
7,456,0,1
19,123,0,2
8,456,1,1
I tried using df.groupby(..).transform(..) but its not working out for me. Could someone help me out a bit?

You can do this with groupby:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
df['ACTIVE_COUNT'] = df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
df
ID CLASS_ID ACTIVE ACTIVE_COUNT
0 1 123 0 3
1 2 123 0 3
2 3 456 1 2
3 4 123 0 3
4 5 456 1 2
5 11 123 1 1
6 18 123 0 2
7 7 456 0 1
8 19 123 0 2
9 8 456 1 1
Details
First, create an indicator column marking rows with the same value per group:
ind = df.groupby('CLASS_ID').ACTIVE.apply(
lambda x: x.ne(x.shift()).cumsum()
)
ind
0 1
1 1
2 1
3 1
4 1
5 2
6 3
7 2
8 3
9 3
Name: ACTIVE, dtype: int64
We then use ind as a grouper argument to df.groupby along with "CLASS_ID", and then compute the size of each group using transform.
df.groupby(['CLASS_ID', ind]).ACTIVE.transform('count')
0 3
1 3
2 2
3 3
4 2
5 1
6 2
7 1
8 2
9 1
Name: ACTIVE, dtype: int64

python pandas replacing strings in dataframe with numbers

Is there any way to use the mapping function or something better to replace values in an entire dataframe?
I only know how to perform the mapping on series.
I would like to replace the strings in the 'tesst' and 'set' column with a number
for example set = 1, test =2
Here is a example of my dataset: (Original dataset is very large)
ds_r
respondent brand engine country aware aware_2 aware_3 age tesst set
0 a volvo p swe 1 0 1 23 set set
1 b volvo None swe 0 0 1 45 set set
2 c bmw p us 0 0 1 56 test test
3 d bmw p us 0 1 1 43 test test
4 e bmw d germany 1 0 1 34 set set
5 f audi d germany 1 0 1 59 set set
6 g volvo d swe 1 0 0 65 test set
7 h audi d swe 1 0 0 78 test set
8 i volvo d us 1 1 1 32 set set
Final result should be
ds_r
respondent brand engine country aware aware_2 aware_3 age tesst set
0 a volvo p swe 1 0 1 23 1 1
1 b volvo None swe 0 0 1 45 1 1
2 c bmw p us 0 0 1 56 2 2
3 d bmw p us 0 1 1 43 2 2
4 e bmw d germany 1 0 1 34 1 1
5 f audi d germany 1 0 1 59 1 1
6 g volvo d swe 1 0 0 65 2 1
7 h audi d swe 1 0 0 78 2 1
8 i volvo d us 1 1 1 32 1 1

What about DataFrame.replace?
In [9]: mapping = {'set': 1, 'test': 2}
In [10]: df.replace({'set': mapping, 'tesst': mapping})
Out[10]:
Unnamed: 0 respondent brand engine country aware aware_2 aware_3 age \
0 0 a volvo p swe 1 0 1 23
1 1 b volvo None swe 0 0 1 45
2 2 c bmw p us 0 0 1 56
3 3 d bmw p us 0 1 1 43
4 4 e bmw d germany 1 0 1 34
5 5 f audi d germany 1 0 1 59
6 6 g volvo d swe 1 0 0 65
7 7 h audi d swe 1 0 0 78
8 8 i volvo d us 1 1 1 32
tesst set
0 2 1
1 1 2
2 2 1
3 1 2
4 2 1
5 1 2
6 2 1
7 1 2
8 2 1
As #Jeff pointed out in the comments, in pandas versions < 0.11.1, manually tack .convert_objects() onto the end to properly convert tesst and set to int64 columns, in case that matters in subsequent operations.

I know this is old, but adding for those searching as I was. Create a dataframe in pandas, df in this code
ip_addresses = df.source_ip.unique()
ip_dict = dict(zip(ip_addresses, range(len(ip_addresses))))
That will give you a dictionary map of the ip addresses without having to write it out.

You can use the applymap DataFrame function to do this:
In [26]: df = DataFrame({"A": [1,2,3,4,5], "B": ['a','b','c','d','e'],
"C": ['b','a','c','c','d'], "D": ['a','c',7,9,2]})
In [27]: df
Out[27]:
A B C D
0 1 a b a
1 2 b a c
2 3 c c 7
3 4 d c 9
4 5 e d 2
In [28]: mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
In [29]: df.applymap(lambda s: mymap.get(s) if s in mymap else s)
Out[29]:
A B C D
0 1 1 2 1
1 2 2 1 3
2 3 3 3 7
3 4 4 3 9
4 5 5 4 2

To convert Strings like 'volvo','bmw' into integers first convert it to a dataframe then pass it to pandas.get_dummies()
df = DataFrame.from_csv("myFile.csv")
df_transform = pd.get_dummies( df )
print( df_transform )
Better alternative: passing a dictionary to map() of a pandas series (df.myCol)
(by specifying the column brand for example)
df.brand = df.brand.map( {'volvo':0 , 'bmw':1, 'audi':2} )

The simplest way to replace any value in the dataframe:
df=df.replace(to_replace="set",value="1")
df=df.replace(to_replace="test",value="2")
Hope this will help.

You can also do this with pandas rename_categories. You would first need to define the column as dtype="category" e.g.
In [66]: s = pd.Series(["a","b","c","a"], dtype="category")
In [67]: s
Out[67]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
and then rename them:
In [70]: s.cat.rename_categories([1,2,3])
Out[70]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [1, 2, 3]
You can also pass a dict-like object to map the renaming, e.g.:
In [72]: s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})

When no of features are not much :
mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
df.applymap(lambda s: mymap.get(s) if s in mymap else s)
When it's not possible manually :
temp_df2 = pd.DataFrame({'data': data.data.unique(), 'data_new':range(len(data.data.unique()))})# create a temporary dataframe
data = data.merge(temp_df2, on='data', how='left')# Now merge it by assigning different values to different strings.

You can build dictionary from column values itself and fill like below
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])

df.replace(to_replace=['set', 'test'], value=[1, 2]) from #Ishnark comment on accepted answer.

pandas.factorize() does exactly this.
>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> codes
array([0, 0, 1, 2, 0]...)
>>> uniques
array(['b', 'a', 'c'], dtype=object)
With a DataFrame:
df["tesst"], tesst_key = pandas.factorize(df["tesst"])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

replace strings in every column with numbers - python

You can adapte the code given in this response https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested all_brands = df.brand.unique() brand_dic = dict(zip(all_brands, range(len(all_brands))))

Related

Get longest streak of consecutive weeks by group in pandas

Looking for a sequential pattern with condition

Logical AND of multiple columns in pandas

Determine size within each each group having the same value in another column

python pandas replacing strings in dataframe with numbers

Categories

Resources