Replacing multiple string values in a column with numbers in pandas - python

I am currently working on a data frame in pandas named df. One column contains
multiple labels (more than 100, to be exact).
I know how to replace values when there are a smaller amount of values.
For instance, in the typical Titanic example:
titanic.Sex.replace({'male': 0,'female': 1}, inplace=True)
Of course, doing so for 100+ values would be extremely time-consuming. I have seen similar questions, but all answers involve typing the data. Is there a faster way to do this?

I think you're looking for factorize:
df = pd.DataFrame({'col': list('ABCDEBJZACA')})
df['factor'] = df['col'].factorize()[0]
output:
col factor
0 A 0
1 B 1
2 D 2
3 C 3
4 E 4
5 B 1
6 J 5
7 Z 6
8 A 0
9 C 3
10 A 0

Related

split pandas data frame into multiple of 4 rows

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

Sliding minimum value in a pandas column

I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1

Pandas: How to merge columns containing the same name within a single data frame?

I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6

Pandas Sum & Count Across Only Certain Columns

I have just started learning pandas, and this is a very basic question. Believe me, I have searched for an answer, but can't find one.
Can you please run this python code?
import pandas as pd
df = pd.DataFrame({'A':[1,0], 'B':[2,4], 'C':[4,4], 'D':[1,4],'count__4s_abc':[1,2],'sum__abc':[7,8]})
df
How do I create column 'count__4s_abc' in which I want to count how many times the number 4 appears in just columns A-C? (While ignoring column D.)
How do I create column 'sum__abc' in which I want to sum the amounts in just columns A-C? (While ignoring column D.)
Thanks much for any help!
Using drop
df.assign(
count__4s_abc=df.drop('D', 1).eq(4).sum(1),
sum__abc=df.drop('D', 1).sum(1)
)
Or explicitly choosing the 3 columns.
df.assign(
count__4s_abc=df[['A', 'B', 'C']].eq(4).sum(1),
sum__abc=df[['A', 'B', 'C']].sum(1)
)
Or using iloc to get first 3 columns.
df.assign(
count__4s_abc=df.iloc[:, :3].eq(4).sum(1),
sum__abc=df.iloc[:, :3].sum(1)
)
All give
A B C D count__4s_abc sum__abc
0 1 2 4 1 1 7
1 0 4 4 4 2 8
One additional option:
In [158]: formulas = """
...: new_count__4s_abc = (A==4)*1 + (B==4)*1 + (C==4)*1
...: new_sum__abc = A + B + C
...: """
In [159]: df.eval(formulas)
Out[159]:
A B C D count__4s_abc sum__abc new_count__4s_abc new_sum__abc
0 1 2 4 1 1 7 1 7
1 0 4 4 4 2 8 2 8
DataFrame.eval() method can (but not always) be faster compared to regular Pandas arithmetic

Efficient way to call previous row in python

I want to substitute the previous row value whenever a 0 value is found in the column of the dataframe in python. I used the following code,
if not a[j]:
a[j] = a[j-1]
and also
if a[j]==0:
a[j]=a[j-1]
Update:
Complete code updated:
for i in pd.unique(r.a):
sub=r[r.vehicle_id==i]
sub=DataFrame(sub,columns= ['a','b','c','d','e'])
sub=sub.drop_duplicates(["a","b","c","d"])
sub['c']=pd.to_datetime(sub['c'],unit='s')
for j in range(1, len(sub[1:])):
if not sub.d[j]:
sub.d[j] = sub.d[j-1]
if not sub.e[j]:
sub.e[j]=sub.e[j-1]
sub=sub.drop_duplicates(["lash_angle","lash_check_count"])
This is the starting of my code. the sub.d[j] line is only getting delayed
These both seem to work well when using integer values. One of the column contains decimal values. When using the code for that column, it is taking a huge time to complete(Nearly 15-20 secs) for the statement to complete. I am looping through nearly 10000 ids and wasting 15 secs at this step is making my entire code inefficient. Is there a better way, I can do this for the float (decimal) values, so that it would be much faster?
Thanks
Assuming that by "column of the dataframe" you mean you're actually talking about a column (Series) of a pandas DataFrame, then one trick is to replace the 0 by nan and then forward-fill. For example:
>>> df = pd.DataFrame(np.random.randint(0,4, 10**6))
>>> df.head(10)
0
0 0
1 3
2 3
3 0
4 1
5 2
6 3
7 2
8 0
9 3
>>> df[0] = df[0].replace(0, np.nan).ffill()
>>> df.head(10)
0
0 NaN
1 3
2 3
3 3
4 1
5 2
6 3
7 2
8 2
9 3
where you can decide for yourself how you want to handle the case of a 0 at the start, where you have no value to fill. This assumes that there aren't already NaN values you want to leave alone, but if there are, you can just use a mask with .loc to select only the ones you want to change.

Categories