Pandas create new column based on division of two other columns - python

Hi I have the following df in which I want the new column to be the result of B/A unless B == 0 in which case take the average of C&D and divide by A so ((C+D)/2)/A.
I know how to do df["New Column"] = df["B"]/df["A"] But I am not sure how you would do it how I want. DO I need to iterate through each row of the df and use conditional if statements?
A B C D New Column Desired Column
5 3 2 4 0.6 0.6
6 2 2 3 0.333 0.333333333
8 4 3 4 0.5 0.5
9 0 3 4 0 0.388888889
14 3 3 4 0.214 0.214285714
5 0 2 4 0 0.6

Here you go:
import numpy as np
df["new Column"] = np.where(df["B"] != 0, df["B"]/df["A"], (df["C"]+df["D"])/2/df["A"])

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Separation of the dataframes by row values

I want to split my dataframe based on the first row to generate four separate dataframes (for subgroup analysis). I have a 172x106 Excel file, where the first row consists of either a 1, 2, 3, or 4. The other 171 lines are radiomic features, which I want to copy to the ''new'' dataset. The columns do not have headernames. My data looks like the following:
{0: [4.0, 0.65555056, 0.370511262, 16.5876203, 44.76954415, 48.0, 32.984845, 49.47726751, 49.47726751, 13133.33333, 29.34869973, 0.725907513, 3708.396349, 0.282365204, 13696.0, 2.122884402, 3.039611259, 1419.058749, 1.605529827, 0.488297449], 1: [2.0, 0.82581372, 0.33201741, 20.65753167, 62.21821817, 50.59644256, 62.60990337, 55.56977596, 77.35631842, 23890.66667, 51.38065822, 0.521666786, 7689.706847, 0.321870752, 25152.0, 1.022813615, 1.360453239, 548.2156387, 0.314035581, 0.181204079]}
I wanted to use groupby, but since the column headers have no name, it makes it hard to implement. This is my current code:
import numpy as np
df = pd.read_excel(r'H:\Documenten\MATLAB\sample_file.xlsx',header=None)
Class_1=df.groupby(df.T.loc[:,0])
df_new = Class_1.get_group("1")
print(df_new)
The error I get is the following:
Traceback (most recent call last):
File "H:/PycharmProjects/RadiomicsPipeline/spearman_subgroups.py", line 5, in <module>
df_new = Class_1.get_group("1")
File "C:\Users\cpullen\AppData\Roaming\Python\Python37\site-packages\pandas\core\groupby\groupby.py", line 754, in get_group
raise KeyError(name)
KeyError: '1'
How do I implement the separation of the dataframes by row values?
I am not sure that this is the result you want. If not, please clearly show the desired output.
You can simply achieve what you want by transposing your dataframe.
import numpy as np
import pandas as pd
np.random.seed(0)
n = 5
data = np.stack((np.random.choice([1, 2, 3, 4], n), np.random.rand(n), np.random.rand(n), np.random.rand(n)), axis=0)
df = pd.DataFrame(data)
df.head():
0 1 2 3 4
0 1.000000 4.000000 2.000000 1.000000 4.000000
1 0.857946 0.847252 0.623564 0.384382 0.297535
2 0.056713 0.272656 0.477665 0.812169 0.479977
3 0.392785 0.836079 0.337396 0.648172 0.368242
df_transpose = df.transpose()
df_transpose.columns = ['group'] + list(df_transpose.columns.values[1:])
df_transpose.head():
group 1 2 3
0 1.0 0.857946 0.056713 0.392785
1 4.0 0.847252 0.272656 0.836079
2 2.0 0.623564 0.477665 0.337396
3 1.0 0.384382 0.812169 0.648172
4 4.0 0.297535 0.479977 0.368242
list_df = dict()
for group in set(df_transpose['group']):
list_df[group] = df_transpose.loc[df_transpose['group'] == group]
list_df:
{
1.0:
group 1 2 3
0 1.0 0.857946 0.056713 0.392785
3 1.0 0.384382 0.812169 0.648172,
2.0:
group 1 2 3
2 2.0 0.623564 0.477665 0.337396,
4.0:
group 1 2 3
1 4.0 0.847252 0.272656 0.836079
4 4.0 0.297535 0.479977 0.368242
}
Individual dataframes:
list_df[1.0], list_df[2.0], list_df[3.0] (which does not exist in this example), list_df[4.0]
(BTW. I defined a wrong variable name. The name should be dict_df, not list_df.)
First of all, after importing the dataframe, sort the value of the first row in order
df
Out[26]:
0 1 2 3
0 0 1 0 1
1 1 2 5 8
2 2 3 6 9
3 3 4 7 0
df = df.sort_values(by = 0, axis = 1)
Out[30]:
0 2 1 3
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
You should have a dataframe with ordered 1st row values. After that, you can use df.iloc to rename your column name and you will drop the first row.
df.columns = df.iloc[0]
0 0 0 1 1
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
df.drop(0, inplace=True)
df.drop(0)
Out[51]:
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
Eventually, you can do slicing based on the column name.
df_1 = df[1]
df_1
Out[56]:
0 1 1
1 2 8
2 3 9
3 4 0

Multipy every five rows in python

I have two dataframes in pandas of the following form:
df1 df2
column factor
0 2 0 0.0
1 4 1 0.25
2 12 2 0.50
3 5 3 0.99
4 4 4 1.00
5 15
6 32
The work is to sumproduct every 5 row in df1 with df2 and put the new results in df3 (my actual data has about 500 rows in df1). The results should be like this:
df3
results **description (no need to add this column)**
0 15.95 df1.iloc[:4,0].dot(df2)
1 24.46 df1.iloc[1:5,0].dot(df2)
3 50.10 df1.iloc[2:6,0].dot(df2
Try:
import numpy as np
mx=df2.to_numpy()
df1.rolling(5).apply(lambda x: np.dot(x, mx), raw=True).iloc[4:]
Outputs:
column
4 15.95
5 24.46
6 50.10

Assign same random value to A-B , B-A pairs in python Dataframe

I have a Dataframe like
Sou Des
1 3
1 4
2 3
2 4
3 1
3 2
4 1
4 2
I need to assign random value for each pair between 0 and 1 but have to assign the same random value for both similar pairs like "1-3", "3-1" and other pairs. I'm expecting a result dataframe like
Sou Des Val
1 3 0.1
1 4 0.6
2 3 0.9
2 4 0.5
3 1 0.1
3 2 0.9
4 1 0.6
4 2 0.5
How to assign same random value similar pairs like "A-B" and "B-A" in python pandas .
Let's create first a sorted by axis=1 helper DF:
In [304]: x = pd.DataFrame(np.sort(df, axis=1), df.index, df.columns)
In [305]: x
Out[305]:
Sou Des
0 1 3
1 1 4
2 2 3
3 2 4
4 1 3
5 2 3
6 1 4
7 2 4
now we can group by its columns:
In [306]: df['Val'] = (x.assign(c=1)
.groupby(x.columns.tolist())
.transform(lambda x: np.random.rand(1)))
In [307]: df
Out[307]:
Sou Des Val
0 1 3 0.989035
1 1 4 0.918397
2 2 3 0.463653
3 2 4 0.313669
4 3 1 0.989035
5 3 2 0.463653
6 4 1 0.918397
7 4 2 0.313669
This is new way
s=pd.crosstab(df.Sou,df.Des)
b = np.random.random_integers(-2000,2000,size=(len(s),len(s)))
sy = (b + b.T)/2
s.mul(sy).replace(0,np.nan).stack().reset_index()
Out[292]:
Sou Des 0
0 1 3 -60.0
1 1 4 -867.0
2 2 3 269.0
3 2 4 1152.0
4 3 1 -60.0
5 3 2 269.0
6 4 1 -867.0
7 4 2 1152.0
The trick here is to do a bit of work away from the dataframe. You can break this down into three steps:
assemble a list of all tuples (a,b)
assign a random value to each pair so that (a,b) and (b,a) have the same value
fill in the new column
Assuming your dataframe is called df, we can make a list of all the pairs ordered so that a <= b. I think this will be easier than trying to keep track of both (a,b) and (b,a).
pairs = set([(a,b) if a <= b else (b,a)
for a, b in df.itertuples(index=False,name=None))
It's simple enough to assign a random number to each of these pairs and store it in a dictionary, so I'll leave that to you. Call it pair_dict.
Now, we just have to lookup the values. We'll ultimately want to write
df['Val'] = df.apply(<some function>, axis=1)
where our function looks up the appropriate value in pair_dict.
Rather than try to cram it into a lambda (though we could), let's write it separately.
def func(row):
if row['Sou'] <= row['Des']:
key = (row['Sou'], row['Des'])
else:
key = (row['Des'], row['Sou'])
return pair_dict[key]
if you are ok having the "random" value coming from the hash() method you can achieve with frozenset()
df = pd.DataFrame([[1,1,2,2,3,3,4,4],[3,4,3,4,1,2,1,2]]).T
df.columns = ['Sou','Des']
df['Val']= df.apply(lambda x: hash(frozenset([x["Sou"],x["Des"]])),axis=1)
print df
which gives:
Sou Des Val
0 1 3 1580307032
1 1 4 -1736016661
2 2 3 741508915
3 2 4 -1930135584
4 3 1 1580307032
5 3 2 741508915
6 4 1 -1736016661
7 4 2 -1930135584
reference:
Why aren't Python sets hashable?

Categories