How to change text column to values using panda - python

With Panda dataframes, how can I change a generic text column ['A'] into an integer value column ['A'], so that some word becomes some value. I'm not asking to calculate a value, I'm asking to replace some text by some number.
table before table after
A A
0 w 0 2
1 q 1 11
2 st 2 1
3 R 3 7
4 Prt 4 6
Replace it so R becomes 7, st becomes 1
Pseudo code:
df['A'] = df.convert('w'=2, 'q'=11, 'st'=1 )

You can use replace with a dictionary indicating how the replace should be done.
import pandas as pd
df_before = pd.DataFrame({'A':['w','q','st','R','Prt']})
d = {'w':2, 'q':11, 'st':1, 'R': 7, 'Prt':6}
df_after = df_before.replace(d)
print(df_after)
Output:
A
0 2
1 11
2 1
3 7
4 6

Related

How can I swap half of two columns in a pandas dataframe in Python?

I am trying to create a machine learning model and teaching myself as I go. I will be working with a large dataset, but before I get to that, I am practicing with a smaller dataset to make sure everything is working as expected. I will need to swap half of the rows of two columns in my dataset, and I am not sure how to accomplish this.
Say I have a dataframe like the below:
index
number
letter
0
1
A
1
2
B
2
3
C
3
4
D
4
5
E
5
6
F
I want to randomly swap half of the rows of the number and letter columns, so one output could look like this:
index
number
letter
0
1
A
1
B
2
2
3
C
3
D
4
4
5
E
5
F
6
Is there a way to do this in python?
edit: thank you for all of your answers, I greatly appreciate it! :)
Here's one way to implement this.
import pandas as pd
from random import sample
df = pd.DataFrame({'index':range(6),'number':range(1,7),'letter':[*'ABCDEF']}).set_index('index')
n = len(df)
idx = sample(range(n),k=n//2) # randomly select which rows to switch
df = df.iloc[idx,:] = df.iloc[idx,::-1].values # switch those rows
An example result is
number letter
index
0 1 A
1 2 B
2 C 3
3 4 D
4 E 5
5 F 6
Update
To select randomly rows, use np.random.choice:
import numpy as np
idx = np.random.choice(df.index, len(df) // 2, replace=False)
df.loc[idx, ['letter', 'number']] = df.loc[idx, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 2 B
2 3 C
3 D 4
4 E 5
5 F 6
Old answer
You can try:
df.loc[df.index % 2 == 1, ['letter', 'number']] = \
df.loc[df.index % 2 == 1, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 B 2
2 3 C
3 D 4
4 5 E
5 F 6
For more readability, use an intermediate variable as a boolean mask:
mask = df.index % 2 == 1
df.loc[mask, ['letter', 'number']] = df.loc[mask, ['number', 'letter']].to_numpy()
You can create a copy of your original data, sample it, and then update it inplace- converting to a numpy ndarray to prevent index-alignment from occuring.
swapped_df = df.copy()
sample = swapped_df.sample(frac=0.5, random_state=0)
swapped_df.loc[sample.index, ['number', 'letter']] = sample[['letter', 'number']].to_numpy()
print(swapped_df)
number letter
index
0 1 A
1 B 2
2 C 3
3 4 D
4 E 5
5 6 F
>>>
Similar to previous answers but slightly more readable (in my opinion) if you are trying to build your sense for basic pandas operations:
rows_to_change = df.sample(frac=0.5)
rows_to_change = rows_to_change.rename(columns={'number':'letter', 'letter':'number'})
df.loc[rows_to_change.index] = rows_to_change

How can I insert a row in between every other row in a dataframe?

I have a data frame that looks like:
a
b
1
1
2
2
1
2
3
1
2
and a row that looks like: [1,2]
How can I insert this row in between rows 1 & 2, 2 & 3, and so on?
In other words, how do I insert a row every other row in a dataframe?
If you just want to add [1,2] in the table that contains only 1,2 then you can repeat those values:
df=df.reindex(df.index.repeat(2)).reset_index(drop=True)
Otherwise if there is different value you can try:
df.index=[x for x in range (len(df)*2) if x%2!=0]
for x in range (2,(len(df)*2)+2):
if x%2==0:
df.loc[x]=[2,3]
df=df.sort_index()
output of df:
a b
1 1 2
2 2 3
3 1 2
4 2 3
5 1 2
6 2 3
This reminds me of a mathematical problem about a hotel with infinite number of rooms.
But here is the solution, we multiply the index by 2, and concatenate a new dataframe with odd indexes. Then sort by index.
import pandas as pd
from io import StringIO
rows = [[3,4]]
df = pd.read_csv(StringIO(
"""a b
1 2
1 2
1 2"""), sep="\s+")
nrows = df.shape[0] - 1
df.index = df.index*2
new_df = pd.DataFrame(rows * nrows, columns=["a", "b"])
new_df.index = new_df.index*2 + 1
>>> pd.concat([df, new_df]).sort_index()
a b
0 1 2
1 3 4
2 1 2
3 3 4
4 1 2

Iterate over a groupby dataframe to operate in each row

I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.
IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.

Selecting Column values based on dictionary keys

I have a dictionary from which I want decided which columns value I want to choose sort of like an if condition using a dictionary.
import pandas as pd
dictname = {'A': 'Select1', 'B':'Select2','C':'Select3'}
DataFrame = pd.DataFrame([['A',1,2,3,4],['B',1,2,3,4],['B',1,3,4,5],['C',1,5,6,7]], columns=['Name','Score','Select1','Select2','Select3'])
So I want to create a new coilumn called ChosenValue which selects values based on the row value in the column 'Name' e.e. ChosenValue should equal to column 'Select1'' s value if the row value in 'Name' = 'A' and then ChosenValue should equal to 'Select2''s value if the row value in 'Name' = 'B' and so forth. I really want something to link it to the dictionary 'dictname'
Use Index.get_indexer to get a list of indices. After that, you can just index into the underlying numpy array.
idx = df.columns.get_indexer(df.Name.map(dictname))
df['ChosenValue'] = df.values[np.arange(len(df)), idx]
df
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
If you know that every Name is in the dictionary, you could use lookup:
In [104]: df["ChosenValue"] = df.lookup(df.index, df.Name.map(dictname))
In [105]: df
Out[105]:
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7

cumsum pandas up to specific value - python pandas

Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1

Categories