How to update original array with groupby in python - python

I have a dataset and I am trying to iterate each group and based on each group, I am trying to update original groups:
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for group_name, b in df.groupby("grain"):
if group_name == "A":
// do some processing
if group_name == "B":
// do another processing
I expect to see original df is updated. Is there any way to do it?

Here is a way to change the original data, this example requires a non-duplicate index. I am not sure what would be the benefit of this approach compared to using classical pandas operations.
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for g_name, g_df in df.groupby("grain"):
if g_name == "A":
df.loc[g_df.index, 'target'] *= 10
if g_name == "B":
df.loc[g_df.index, 'target'] *= -1
Output:
>>> df
grain target
0 A 10
1 B -2
2 A 40
3 B -7
4 A 110
5 B -16
6 A 220
7 B -29
8 A 370
9 B -46

Related

How to export heterogenous arrays to csv

Basically what im trying to do is export np.arrays to a .csv file where my data can be stored neatly. I've tried doing this code, but it doesn't work as i would like to.
import numpy as np
import pandas as pd
T11= np.array([1, 2, 3])
T21= np.array([4, 5, 6])
T31= np.array([7, 8, 9, 10])
T41= np.array([11, 12, 13, 14])
T51= np.array([15])
Tx=[T11,T21,T31,T41,T51]
Tx1=pd.DataFrame.transpose(Tx)
df2 = pd.DataFrame(np.array(Tx1),columns=['a', 'b', 'c', 'e', 'f'])
df.to_csv('file.csv', index=False, header=False)
The code gives me this errors, but I don't know how to fix them
'list' object has no attribute 'dtypes'
ValueError: Shape of passed values is (5, 1), indices imply (5, 5)
I'd really like my output to be something like:
a
b
c
d
e
1
4
7
11
15
2
5
8
12
3
6
9
13
10
14
Try this:
df = pd.DataFrame([T11,T21,T31,T41,T51], index=['a', 'b', 'c', 'e', 'f']).T
df.to_csv('file.csv', index=False, header=True)

More efficient way to search through Pandas groups

I want to get the list of values from col2 that belong to the same groupId, given corresponding value in col1. Col1 values can belong to multiple groups and in that case only top-most group should be considered (group 2 but not group 3 in my example). Col1 values are always identical within the same groupId.
groupId
col1
col2
2
a
10
1
b
20
2
a
30
1
b
40
3
a
50
3
a
60
1
b
70
My current solution takes over 30s for a df with 2000 rows and 32 values to search for in col1 ('a' in this case):
group_id_groups = df.groupby('groupId')
for group_id, group in group_id_groups:
col2_values = list(group[group['col1'] == 'a']['col2'])
if col2_values:
print(col2_values)
break
result: [10, 30]
The sort parameter of groupby defaults to true, which means the first group will be the topmost by default. You can change the col_to_search to b and get the other answer.
import pandas as pd
df = pd.DataFrame({'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': ['a', 'b', 'a', 'b', 'a', 'a', 'b'],
'col2': [10, 20, 30, 40, 50, 60, 70]})
col_to_search = 'a'
(
df.loc[df['col1'].eq(col_to_search)]
.groupby('groupId')['col2']
.apply(list)
.iloc[0]
)
Output
[10, 30]
I am still not sure what you want. Does this help you? I am sure that pandas.DataFrame.groupby() is your friend here.
Full code
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame({
'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': list('ababaab'),
'col2': range(10, 80, 10)
})
print(df)
g = df.groupby(['groupId', 'col1']).agg(list)
print(g)
result = g.loc[(2, 'a')]
print(result)
Step by step
Your initial data in df looks like this
groupId col1 col2
0 2 a 10
1 1 b 20
2 2 a 30
3 1 b 40
4 3 a 50
5 3 a 60
6 1 b 70
Then you simply group your data by your two "search columns". The result per group is stored as a list.
g = df.groupby(['groupId', 'col1']).agg(list)
The result:
col2
groupId col1
1 b [20, 40, 70]
2 a [10, 30]
3 a [50, 60]
No you can do your search:
result = g.loc[(2, 'a')]
That gives you
col2 [10, 30]
Name: (2, a), dtype: object
It seems to me that you mostly need to create a mask without using a groupby.
import pandas as pd
# data
data = {'groupId': {0: '2', 1: '1', 2: '2', 3: '1', 4: '3', 5: '3', 6: '1'},
'col1': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'a', 6: 'b'},
'col2': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60, 6: 70}}
df = pd.DataFrame(data)
# First group where condition is satisfied
first_group = df[df["col1"].eq("a")].iloc[0]["groupId"]
# Output
df[df["col1"].eq("a") &
df["groupId"].eq(first_group)]["col2"].to_list()
And the output is [10, 30] as expected.
You can use pandas.groupby with agg(list) then search what you want with .loc and return the first find.
>>> grp = df.groupby(['groupId', 'col1']).agg(list).reset_index()
>>> grp.loc[grp['col1'].eq('a'), 'col2'].to_list()[0]
[10, 30]
>>> grp.loc[grp['col1'].eq('a'), 'col2']
1 [10, 30]
2 [50, 60]
Name: col2, dtype: object

How to recognize changed value when apply transform() pandas

I have a DF like below
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 0, 10, 30, 30, 0, 0, 10],
'price': [5, 0, 2, 10, 10, 0, 0, 3],
'flag' : [0,0,0,0,0,0,0,0]
})
I would like to change 0 in ['size'] column into the max value of the category so
df['size'] = np.where(df['size'].eq(0), df.groupby('category')['size'].transform('max'), df['size'])
df['price'] = np.where(df['price'].eq(0), df.groupby('category')['price'].transform('max'), df['price'])
And output would be like
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 20, 10, 30, 30, 30, 10, 10],
'price': [5, 5, 2, 10, 10, 10, 3, 3],
'flag' : [0,0,0,0,0,0,0,0]
})
(process so far confirmed)
But now I would like to know which row has been changed so I assigned a ['flag'] column and would like to change the value 0 to 1 when any other value in the same row has been changed.
So desired output would be like below
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 20, 10, 30, 30, 30, 10, 10],
'price': [5, 5, 2, 10, 10, 10, 3, 3],
'flag' : [0,1,0,0,0,1,1,0]
})
Is there anyway I can do this one line with the transform sentence? Or any other good way?
Can you just label what will be changed before applying your operations? i.e. find the places where size == 0:
df['flag'] = (df['size'] == 0).astype(int)
# then do
df['size'] = np.where(df['size'].eq(0), df.groupby('category')['size'].transform('max'), df['size'])
df['price'] = np.where(df['price'].eq(0), df.groupby('category')['price'].transform('max'), df['price'])
Or for either price or size:
df['flag'] = ((df['size'] == 0) | (df['price'] == 0)).astype(int)
Check here I add the addtional condition due to the whole value price and size in that group maybe 0
cond1 = df.groupby('category')['size'].transform('max')
cond2 = df.groupby('category')['price'].transform('max')
df['changed'] = ((df['size'].ne(cond1)&df['size'].eq(0)) | (df['price'].ne(cond2)&df['price'].eq(0))) .astype(int)
df['size'] = np.where(df['size'].eq(0), cond1, df['size'])
df['price'] = np.where(df['price'].eq(0), cond2, df['price'])
Out[406]:
category size price flag changed
0 A 20 5 0 0
1 A 20 5 0 1
2 A 10 2 0 0
3 B 30 10 0 0
4 B 30 10 0 0
5 B 30 10 0 1
6 C 10 3 0 1
7 C 10 3 0 0
Or if the max value always more than 0
df['changes'] = df[['size','price']].eq(0).any(1).astype(int)

Pandas sample by filter criteria

I have a data frame like the one below
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
df
var1 var2 class
0 1 5 a
1 2 6 a
2 3 7 c
3 4 8 b
I would like to be able to change the proportion of the class column. For example I would like to down-sample at random the a class by 50% but keep the number of rows for the other classes the same. the results would be:
df
var1 var2 class
0 1 5 a
1 3 7 c
2 4 8 b
How would this be done.
I used the approach to split the DataFrame into df_selection and df_remaining first.
I then reduced df_selection by REMOVE_PERCENTAGE and merged the resulting DataFrame with df_remaining again.
import numpy as np
import pandas as pd
d = {'var1': [1, 2, 3, 4], 'var2': [5, 6, 7, 8], 'class': ['a', 'a', 'c', 'b']}
df = pd.DataFrame(data=d)
REMOVE_PERCENTAGE = 0.5 # between 0 and 1
df = df.set_index(['class'])
df_selection = df.loc['a'] \
.reset_index()
df_remaining = df.drop('a') \
.reset_index()
rows_to_remove = int(REMOVE_PERCENTAGE * len(df_selection.index))
drop_indices = np.random.choice(df_selection.index, rows_to_remove, replace=False)
df_selection_reduced = df_selection.drop(drop_indices)
df_result = pd.concat([df_selection_reduced, df_remaining]) \
.reset_index(drop=True)
print(df_result)

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories