With Panda dataframes, how can I change a generic text column ['A'] into an integer value column ['A'], so that some word becomes some value. I'm not asking to calculate a value, I'm asking to replace some text by some number.
table before table after
A A
0 w 0 2
1 q 1 11
2 st 2 1
3 R 3 7
4 Prt 4 6
Replace it so R becomes 7, st becomes 1
Pseudo code:
df['A'] = df.convert('w'=2, 'q'=11, 'st'=1 )
You can use replace with a dictionary indicating how the replace should be done.
import pandas as pd
df_before = pd.DataFrame({'A':['w','q','st','R','Prt']})
d = {'w':2, 'q':11, 'st':1, 'R': 7, 'Prt':6}
df_after = df_before.replace(d)
print(df_after)
Output:
A
0 2
1 11
2 1
3 7
4 6
Related
I am trying to create a machine learning model and teaching myself as I go. I will be working with a large dataset, but before I get to that, I am practicing with a smaller dataset to make sure everything is working as expected. I will need to swap half of the rows of two columns in my dataset, and I am not sure how to accomplish this.
Say I have a dataframe like the below:
index
number
letter
0
1
A
1
2
B
2
3
C
3
4
D
4
5
E
5
6
F
I want to randomly swap half of the rows of the number and letter columns, so one output could look like this:
index
number
letter
0
1
A
1
B
2
2
3
C
3
D
4
4
5
E
5
F
6
Is there a way to do this in python?
edit: thank you for all of your answers, I greatly appreciate it! :)
Here's one way to implement this.
import pandas as pd
from random import sample
df = pd.DataFrame({'index':range(6),'number':range(1,7),'letter':[*'ABCDEF']}).set_index('index')
n = len(df)
idx = sample(range(n),k=n//2) # randomly select which rows to switch
df = df.iloc[idx,:] = df.iloc[idx,::-1].values # switch those rows
An example result is
number letter
index
0 1 A
1 2 B
2 C 3
3 4 D
4 E 5
5 F 6
Update
To select randomly rows, use np.random.choice:
import numpy as np
idx = np.random.choice(df.index, len(df) // 2, replace=False)
df.loc[idx, ['letter', 'number']] = df.loc[idx, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 2 B
2 3 C
3 D 4
4 E 5
5 F 6
Old answer
You can try:
df.loc[df.index % 2 == 1, ['letter', 'number']] = \
df.loc[df.index % 2 == 1, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 B 2
2 3 C
3 D 4
4 5 E
5 F 6
For more readability, use an intermediate variable as a boolean mask:
mask = df.index % 2 == 1
df.loc[mask, ['letter', 'number']] = df.loc[mask, ['number', 'letter']].to_numpy()
You can create a copy of your original data, sample it, and then update it inplace- converting to a numpy ndarray to prevent index-alignment from occuring.
swapped_df = df.copy()
sample = swapped_df.sample(frac=0.5, random_state=0)
swapped_df.loc[sample.index, ['number', 'letter']] = sample[['letter', 'number']].to_numpy()
print(swapped_df)
number letter
index
0 1 A
1 B 2
2 C 3
3 4 D
4 E 5
5 6 F
>>>
Similar to previous answers but slightly more readable (in my opinion) if you are trying to build your sense for basic pandas operations:
rows_to_change = df.sample(frac=0.5)
rows_to_change = rows_to_change.rename(columns={'number':'letter', 'letter':'number'})
df.loc[rows_to_change.index] = rows_to_change
I have a data frame that looks like:
a
b
1
1
2
2
1
2
3
1
2
and a row that looks like: [1,2]
How can I insert this row in between rows 1 & 2, 2 & 3, and so on?
In other words, how do I insert a row every other row in a dataframe?
If you just want to add [1,2] in the table that contains only 1,2 then you can repeat those values:
df=df.reindex(df.index.repeat(2)).reset_index(drop=True)
Otherwise if there is different value you can try:
df.index=[x for x in range (len(df)*2) if x%2!=0]
for x in range (2,(len(df)*2)+2):
if x%2==0:
df.loc[x]=[2,3]
df=df.sort_index()
output of df:
a b
1 1 2
2 2 3
3 1 2
4 2 3
5 1 2
6 2 3
This reminds me of a mathematical problem about a hotel with infinite number of rooms.
But here is the solution, we multiply the index by 2, and concatenate a new dataframe with odd indexes. Then sort by index.
import pandas as pd
from io import StringIO
rows = [[3,4]]
df = pd.read_csv(StringIO(
"""a b
1 2
1 2
1 2"""), sep="\s+")
nrows = df.shape[0] - 1
df.index = df.index*2
new_df = pd.DataFrame(rows * nrows, columns=["a", "b"])
new_df.index = new_df.index*2 + 1
>>> pd.concat([df, new_df]).sort_index()
a b
0 1 2
1 3 4
2 1 2
3 3 4
4 1 2
I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.
IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.
I have a dictionary from which I want decided which columns value I want to choose sort of like an if condition using a dictionary.
import pandas as pd
dictname = {'A': 'Select1', 'B':'Select2','C':'Select3'}
DataFrame = pd.DataFrame([['A',1,2,3,4],['B',1,2,3,4],['B',1,3,4,5],['C',1,5,6,7]], columns=['Name','Score','Select1','Select2','Select3'])
So I want to create a new coilumn called ChosenValue which selects values based on the row value in the column 'Name' e.e. ChosenValue should equal to column 'Select1'' s value if the row value in 'Name' = 'A' and then ChosenValue should equal to 'Select2''s value if the row value in 'Name' = 'B' and so forth. I really want something to link it to the dictionary 'dictname'
Use Index.get_indexer to get a list of indices. After that, you can just index into the underlying numpy array.
idx = df.columns.get_indexer(df.Name.map(dictname))
df['ChosenValue'] = df.values[np.arange(len(df)), idx]
df
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
If you know that every Name is in the dictionary, you could use lookup:
In [104]: df["ChosenValue"] = df.lookup(df.index, df.Name.map(dictname))
In [105]: df
Out[105]:
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1