Pandas Sum & Count Across Only Certain Columns - python

I have just started learning pandas, and this is a very basic question. Believe me, I have searched for an answer, but can't find one.
Can you please run this python code?
import pandas as pd
df = pd.DataFrame({'A':[1,0], 'B':[2,4], 'C':[4,4], 'D':[1,4],'count__4s_abc':[1,2],'sum__abc':[7,8]})
df
How do I create column 'count__4s_abc' in which I want to count how many times the number 4 appears in just columns A-C? (While ignoring column D.)
How do I create column 'sum__abc' in which I want to sum the amounts in just columns A-C? (While ignoring column D.)
Thanks much for any help!

Using drop
df.assign(
count__4s_abc=df.drop('D', 1).eq(4).sum(1),
sum__abc=df.drop('D', 1).sum(1)
)
Or explicitly choosing the 3 columns.
df.assign(
count__4s_abc=df[['A', 'B', 'C']].eq(4).sum(1),
sum__abc=df[['A', 'B', 'C']].sum(1)
)
Or using iloc to get first 3 columns.
df.assign(
count__4s_abc=df.iloc[:, :3].eq(4).sum(1),
sum__abc=df.iloc[:, :3].sum(1)
)
All give
A B C D count__4s_abc sum__abc
0 1 2 4 1 1 7
1 0 4 4 4 2 8

One additional option:
In [158]: formulas = """
...: new_count__4s_abc = (A==4)*1 + (B==4)*1 + (C==4)*1
...: new_sum__abc = A + B + C
...: """
In [159]: df.eval(formulas)
Out[159]:
A B C D count__4s_abc sum__abc new_count__4s_abc new_sum__abc
0 1 2 4 1 1 7 1 7
1 0 4 4 4 2 8 2 8
DataFrame.eval() method can (but not always) be faster compared to regular Pandas arithmetic

Related

split pandas data frame into multiple of 4 rows

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

How can I swap half of two columns in a pandas dataframe in Python?

I am trying to create a machine learning model and teaching myself as I go. I will be working with a large dataset, but before I get to that, I am practicing with a smaller dataset to make sure everything is working as expected. I will need to swap half of the rows of two columns in my dataset, and I am not sure how to accomplish this.
Say I have a dataframe like the below:
index
number
letter
0
1
A
1
2
B
2
3
C
3
4
D
4
5
E
5
6
F
I want to randomly swap half of the rows of the number and letter columns, so one output could look like this:
index
number
letter
0
1
A
1
B
2
2
3
C
3
D
4
4
5
E
5
F
6
Is there a way to do this in python?
edit: thank you for all of your answers, I greatly appreciate it! :)
Here's one way to implement this.
import pandas as pd
from random import sample
df = pd.DataFrame({'index':range(6),'number':range(1,7),'letter':[*'ABCDEF']}).set_index('index')
n = len(df)
idx = sample(range(n),k=n//2) # randomly select which rows to switch
df = df.iloc[idx,:] = df.iloc[idx,::-1].values # switch those rows
An example result is
number letter
index
0 1 A
1 2 B
2 C 3
3 4 D
4 E 5
5 F 6
Update
To select randomly rows, use np.random.choice:
import numpy as np
idx = np.random.choice(df.index, len(df) // 2, replace=False)
df.loc[idx, ['letter', 'number']] = df.loc[idx, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 2 B
2 3 C
3 D 4
4 E 5
5 F 6
Old answer
You can try:
df.loc[df.index % 2 == 1, ['letter', 'number']] = \
df.loc[df.index % 2 == 1, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 B 2
2 3 C
3 D 4
4 5 E
5 F 6
For more readability, use an intermediate variable as a boolean mask:
mask = df.index % 2 == 1
df.loc[mask, ['letter', 'number']] = df.loc[mask, ['number', 'letter']].to_numpy()
You can create a copy of your original data, sample it, and then update it inplace- converting to a numpy ndarray to prevent index-alignment from occuring.
swapped_df = df.copy()
sample = swapped_df.sample(frac=0.5, random_state=0)
swapped_df.loc[sample.index, ['number', 'letter']] = sample[['letter', 'number']].to_numpy()
print(swapped_df)
number letter
index
0 1 A
1 B 2
2 C 3
3 4 D
4 E 5
5 6 F
>>>
Similar to previous answers but slightly more readable (in my opinion) if you are trying to build your sense for basic pandas operations:
rows_to_change = df.sample(frac=0.5)
rows_to_change = rows_to_change.rename(columns={'number':'letter', 'letter':'number'})
df.loc[rows_to_change.index] = rows_to_change

Replacing multiple string values in a column with numbers in pandas

I am currently working on a data frame in pandas named df. One column contains
multiple labels (more than 100, to be exact).
I know how to replace values when there are a smaller amount of values.
For instance, in the typical Titanic example:
titanic.Sex.replace({'male': 0,'female': 1}, inplace=True)
Of course, doing so for 100+ values would be extremely time-consuming. I have seen similar questions, but all answers involve typing the data. Is there a faster way to do this?
I think you're looking for factorize:
df = pd.DataFrame({'col': list('ABCDEBJZACA')})
df['factor'] = df['col'].factorize()[0]
output:
col factor
0 A 0
1 B 1
2 D 2
3 C 3
4 E 4
5 B 1
6 J 5
7 Z 6
8 A 0
9 C 3
10 A 0

How to assign values to multiple non existing columns in a pandas dataframe?

So what I want to do is to add columns to a dataframe and fill them (all rows respectively) with a single value.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns = ["A","B"])
arr = np.array([7,8])
# this is what I would like to do
df[["C","D"]] = arr
# and this is what I want to achieve
# A B C D
# 0 1 2 7 8
# 1 3 4 7 8
# but it yields an "KeyError" sadly
# KeyError: "['C' 'D'] not in index"
I do know about the assign-functionality and how I would tackle this issue if I only were to add one column at once. I just want to know whether there is a clean and simple way to do this with multiple new columns as I was not able to find one.
For me working:
df[["C","D"]] = pd.DataFrame([arr], index=df.index)
Or join:
df = df.join(pd.DataFrame([arr], columns=['C','D'], index=df.index))
Or assign:
df = df.assign(**pd.Series(arr, index=['C','D']))
print (df)
A B C D
0 1 2 7 8
1 3 4 7 8
You can using assign and pass a dict in it
df.assign(**dict(zip(['C','D'],[arr.tolist()]*2)))
Out[755]:
A B C D
0 1 2 7 7
1 3 4 8 8

Rename Dataframe column based on column index

Is there a built in function to rename a pandas dataframe by index?
I thought I knew the name of my column headers, but it turns out the second column has some hexadecimal characters in it. I will likely come across this issue with column 2 in the future based on the way I receive my data, so I cannot hard code those specific hex characters into a dataframe.rename() call.
Is there a function that would be appropriately named rename_col_by_index() that I have not been able to find?
Ex:
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df.rename_col_by_index(1, 'new_name')
>>> df
a new_name
0 1 3
1 2 4
#MaxU's answer is better
df.rename(columns={"col1": "New name"})
More in docs
UPDATE: thanks to #Vincenzzzochi:
In [138]: df.rename(columns={df.columns[1]: 'new'})
Out[138]:
a new c
0 1 3 5
1 2 4 6
In [140]: df
Out[140]:
a b c
0 1 3 5
1 2 4 6
or bit more flexible:
In [141]: mapping = {df.columns[0]:'new0', df.columns[1]: 'new1'}
In [142]: df.rename(columns=mapping)
Out[142]:
new0 new1 c
0 1 3 5
1 2 4 6

Categories