I have a pandas DataFrame like so:
import pandas as pd
df = pd.DataFrame({
'cohort': [1, 1, 1, 1, 2, 2, 2],
'age': [-1, 0, 1, 2, 0, 1, 2],
'bal': [100, 1000, 1400, 1500, 1000, 1200, 1300]
})
Where applicable, I want to add bal where age is less than 0 to the bal values where age is zero for each cohort. Ultimately I want df to look like this:
df
cohort age bal
1 1 -1 100
2 1 0 1100
3 1 1 1400
4 1 2 1500
5 2 0 1000
6 2 1 1200
7 2 2 1300
This is how pandas will achieve it
df.loc[df.age==0, 'bal']= df.loc[df.age==0, 'bal']+ df.loc[df.age<0, 'bal'].values
df
Out[339]:
age bal cohort
0 -1 100 1
1 0 1100 1
2 1 1400 1
3 2 1500 1
4 -1 120 2
5 0 1120 2
6 1 1200 2
7 2 1300 2
Update :
df.loc[df.age==0, 'bal']=df.loc[df.age<=0].groupby('cohort').bal.sum().values
Related
I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 + shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])
I have a df looking something like this:
df = pd.DataFrame({
'Time' : [1,2,7,10,15,16,77,98,999,1000,1121,1245,1373,1490,1555],
'Act_cat' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4],
'Count' : [6, 2, 4, 1, 2, 1, 8, 4, 3, 1, 4, 13, 3, 1, 2],
'Moving': [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1]})
I would like to group by same values in "Act_cat" following each other and by "Moving" ==1 and for these groups take the mean of the "count" column and map it back onto the df.
I have tried something below but here all rows of the "Count" column where averaged and not only the ones where "moving" ==1.
group1 = (df['moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
mean_values = df.groupby(group1)["Count"].mean()
df['newcol'] = group1.map(mean_values)
Please let me know how I could solve this!
Thank you,
Tahnee
IIUC use:
group1 = (df['Moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
mean_values = df[df['Moving'].eq(1)].groupby(group1)["Count"].mean()
df['newcol'] = group1.map(mean_values)
Alterntive solution:
group1 = (df['Moving'].eq(1) & df['Act_cat'].diff().abs() > 0).cumsum()
df['newcol'] = df['Count'].where(df['Moving'].eq(1)).groupby(group1).transform('mean')
print (df)
Time Act_cat Count Moving newcol
0 1 1 6 1 4.6
1 2 1 2 0 4.6
2 7 1 4 1 4.6
3 10 1 1 0 4.6
4 15 1 2 1 4.6
5 16 2 1 0 4.6
6 77 2 8 1 4.6
7 98 2 4 0 4.6
8 999 2 3 1 4.6
9 1000 2 1 0 4.6
10 1121 4 4 1 3.0
11 1245 4 13 0 3.0
12 1373 4 3 1 3.0
13 1490 4 1 0 3.0
14 1555 4 2 1 3.0
Consider looping through my DataFrame:
import pandas as pd
df = pd.DataFrame({
'Price': [1000, 1000, 1000, 2000, 2000, 2000, 2000, 1400, 1400],
'Count': [0, 0, 0, 0, 0, 0, 0, 0, 0]
})
for idx in df.index:
if df['Price'].iloc[idx] > 1500:
if idx > 0:
df['Count'].iloc[idx] = df['Count'].iloc[idx - 1] + 1
Resulting in:
Price
Count
0
1000
0
1
1000
0
2
1000
0
3
2000
1
4
2000
2
5
2000
3
6
2000
4
7
1400
0
8
1400
0
Is there a more efficient way to do this?
Create pseudo-groups using Series.cumsum, then use groupby.cumcount to generate the within-group counts:
groups = df.Price.le(1500).cumsum()
df['Count'] = df.Price.gt(1500).groupby(groups).cumcount()
# Price Count
# 0 1000 0
# 1 1000 0
# 2 1000 0
# 3 2000 1
# 4 2000 2
# 5 2000 3
# 6 2000 4
# 7 1400 0
# 8 1400 0
Use mask to hide values below 1500 and use cumsum to create the counter:
df['Count'] = df.mask(df['Price'] <= 1500)['Count'].add(1).cumsum().fillna(0).astype(int)
print(df)
# Output:
Price Count
0 1000 0
1 1000 0
2 1000 0
3 2000 1
4 2000 2
5 2000 3
6 2000 4
7 1400 0
8 1400 0
I've got a following Data Frame:
example_df = pd.DataFrame({'id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'seq_start': {0: 0.0, 1: 2800.0, 2: 6400.0, 3: 8400.0, 4: 9800.0},
'seq_end': {0: 1400.0, 1: 4700.0, 2: 8400.0, 3: 9800.0, 4: 11400.0}})
I'd like to obtain a Data Frame that has a sequences of values from example_df['seq_start'] to example_df['seq_end'] so that I could later use newly created column in a join.
So the expected output would look like below:
out_df = pd.DataFrame({'id': np.concatenate([[0] * 15, [1] * 20, [2] * 21]),
'expected_output': np.concatenate([np.arange(0, 1500, 100),
np.arange(2800, 4800, 100),
np.arange(6400, 8500, 100)])})
id expected_output
0 0 0
1 0 100
2 0 200
3 0 300
4 0 400
5 0 500
...
12 0 1200
13 0 1300
14 0 1400
15 1 2800
16 1 2900
17 1 3000
...
31 1 4400
32 1 4500
33 1 4600
34 1 4700
35 2 6400
36 2 6500
37 2 6600
...
54 2 8300
55 2 8400
How can I approach this?
Using pandas.DataFrame.explode:
def listify(x, step=100, right_closed=True):
lower, upper = sorted(x)
return range(lower, upper+step*right_closed, step)
example_df['expected'] = example_df[['seq_end', 'seq_start']].astype(int).apply(listify, 1)
new_df = example_df[['id','expected']].explode('expected')
print(new_df)
Output:
id expected
0 0 0
0 0 100
0 0 200
0 0 300
0 0 400
.. .. ...
4 4 11000
4 4 11100
4 4 11200
4 4 11300
4 4 11400
I am learning machine learning and generated a pandas dataframe containing the following columns Id Category Cost_price Sold. The shape of the dataframe is (100000, 4).
Here the target variable is the Sold column(1=Sold, 0=not sold). But no machine learning algorithm is able to get a good enough accuracy as all the columns in the dataframe is very random. To introduce a pattern to the dataframe I am trying to manipulate some of the values in the Sold column.
What i want to do is to change 6000 of the sold values to 1 where the cost_price is less than 800. But i am not able to do that.
I am new to machine learning and python. Please help me
Thanks in advance
Use:
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 6000, replace=False), 'Sold'] = 1
Sample:
df = pd.DataFrame({
'Sold':[1,0,0,1,1,0] * 3,
'cost_price':[500,300,6000,900,100,400] * 3,
})
print (df)
Sold cost_price
0 1 500
1 0 300
2 0 6000
3 1 900
4 1 100
5 0 400
6 1 500
7 0 300
8 0 6000
9 1 900
10 1 100
11 0 400
12 1 500
13 0 300
14 0 6000
15 1 900
16 1 100
17 0 400
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False), 'Sold'] = 1
print (df)
Sold cost_price
0 1 500
1 1 300
2 0 6000
3 1 900
4 1 100
5 1 400
6 1 500
7 1 300
8 0 6000
9 1 900
10 1 100
11 1 400
12 1 500
13 1 300
14 0 6000
15 1 900
16 1 100
17 1 400
Explanation:
First filter index values by condition with boolean indexing:
print (df.index[df['cost_price'] < 800])
Int64Index([0, 1, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17], dtype='int64')
Then select random N values by numpy.random.choice:
print (np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False))
[16 1 7 13 17 12 10 6 5 11]
And last set 1 by index values with DataFrame.loc.
I will assume you will randomly choose those 6000 rows.
idx = df.Sold[df.Cost_price < 800].tolist()
r = random.sample(idx, 6000)
df.Sold.loc[r] = 1
IIUC use DataFrame.at
df.at[df.Sold[df.cost_price < 800][:6000].index, 'Sold'] = 1
If you randomly choose the rows use .sample
df.at[df[df.cost_price < 800].sample(6000).index, 'Sold'] = 1