Pandas Dataframe: substract and divided two rows value based on column condition - python

I have DataFrame:
A
B
C
D
1
0.1
0.2
0.3
2
0.4
0.7
0.2
3
0
4.25
100
3
-2.5
4.20
70
3
-2.5
3.5
80
4
0
3.6
81
4
-5
3.5
77
4
-5
3.4
75
4
-5
3.1
74
5
0
3.2
75
5
0.1
3.3
73
Now , i want to skip first two rows. after that i want to substarct last row value and first row value for number '3' with coumn 'C' value then divide with same substactction but with coumn 'B' Value. basically, |3.5-4.25|/|-2.5-0| = | 0.3 |
i tried and skiped first two rows with
cols = ['A']
df[cols] = df[df[cols] > 2][cols]
df = df.dropna()
Expected output:
NEW_COL
Result
1
0.3
2
0.1
3
1
could you please help me?

IIUC, you can compute the last-first per group, then divide C/B and drop NAs:
out = (df.groupby('A')
.agg(lambda g: abs(g.iloc[-1]-g.iloc[0]))
.eval('C/B')
.dropna()
.reset_index(drop=True)
)
output:
0 0.3
1 0.1
2 1.0
dtype: float64

Related

assign values of one dataframe column to another dataframe column based on condition

am trying to compare two dataframes based on different columns and assign a value to a dataframe based on it.
df1 :
date value1 value2
4/1/2021 A 1
4/2/2021 B 2
4/6/2021 C 3
4/4/2021 D 4
4/5/2021 E 5
4/6/2021 F 6
4/2/2021 G 7
df2:
Date percent
4/1/2021 0.1
4/2/2021 0.2
4/6/2021 0.6
output:
date value1 value2 per
4/1/2021 A 1 0.1
4/2/2021 B 2 0.2
4/6/2021 C 3 0.6
4/4/2021 D 4 0
4/5/2021 E 5 0
4/6/2021 F 6 0
4/2/2021 G 7 0.2
Code1:
df1['per'] = np.where(df1['date']==df2['Date'], df2['per'], 0)
error:
ValueError: Can only compare identically-labeled Series objects
Note: changed the column value of df2['Date] to df2['date] and then tried merging
code2:
new = pd.merge(df1, df2, on=['date'], how='inner')
error:
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
df1['per']=df1['date'].map(dict(zip(df2['Date'], df2['percent']))).fillna(0)
date value1 value2 per
0 4/1/2021 A 1 0.1
1 4/2/2021 B 2 0.2
2 4/6/2021 C 3 0.6
3 4/4/2021 D 4 0.0
4 4/5/2021 E 5 0.0
5 4/6/2021 F 6 0.6
6 4/2/2021 G 7 0.2
You could use pd.merge and perform a left join to keep all the rows from df1 and bring over all the date matching rows from df2:
pd.merge(df1,df2,left_on='date',right_on='Date', how='left').fillna(0).drop('Date',axis=1)
Prints:
date value1 value2 percent
0 04/01/2021 A 1 0.1
1 04/02/2021 B 2 0.2
2 04/06/2021 C 3 0.6
3 04/04/2021 D 4 0.0
4 04/05/2021 E 5 0.0
5 04/06/2021 F 6 0.6
6 04/02/2021 G 7 0.2
*I think there's a typo on your penultimate row. percent should be 0.6 IIUC.

Pivot Tables with Pandas

I have the following data saved as a pandas dataframe
Animal Day age Food kg
1 1 3 17 0.1
1 1 3 22 0.7
1 2 3 17 0.8
2 2 7 15 0.1
With pivot I get the following:
output = df.pivot(["Animal", "Food"], "Day", "kg") \
.add_prefix("Day") \
.reset_index() \
.rename_axis(None, axis=1)
>>> output
Animal Food Day1 Day2
0 1 17 0.1 0.8
1 1 22 0.7 NaN
2 2 15 NaN 0.1
However I would like to have the age column (and other columns) still included.
It could also be possible that for animal x the value age is not always the same, then it doesn't matter which age value is taken.
Animal Food Age Day1 Day2
0 1 17 3 0.1 0.8
1 1 22 3 0.7 NaN
2 2 15 7 NaN 0.1
How do I need to change the code above?
IIUC, what you want is to pivot the weight, but to aggregate the age.
To my knowledge, you need to do both operations separately. One with pivot, the other with groupby (here I used first for the example, but this could be anything), and join:
(df.pivot(index=["Animal", "Food"],
columns="Day",
values="kg",
)
.add_prefix('Day')
.join(df.groupby(['Animal', 'Food'])['age'].first())
.reset_index()
)
I am adding a non ambiguous example (here the age of Animal 1 changes on Day2).
Input:
Animal Day age Food kg
0 1 1 3 17 0.1
1 1 1 3 22 0.7
2 1 2 4 17 0.8
3 2 2 7 15 0.1
output:
Animal Food Day1 Day2 age
0 1 17 0.1 0.8 3
1 1 22 0.7 NaN 3
2 2 15 NaN 0.1 7
Use pivot, add other columns to index:
>>> df.pivot(df.columns[~df.columns.isin(['Day', 'kg'])], 'Day', 'kg') \
.add_prefix('Day').reset_index().rename_axis(columns=None)
Animal age Food Day1 Day2
0 1 3 17 0.1 0.8
1 1 3 22 0.7 NaN
2 2 7 15 NaN 0.1

Select value from based on interval Time in pandas [duplicate]

df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',
'1','2','2','2','2','2','3','3','3','3','3'],
'start':
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
'end':
[100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})
I wanted to conditionally loop through 'Chr' and 'position' in df1 to 'Chr' and intervals ( where the position in df1 falls between 'start' and 'end') in df2, then add 'logr' and 'seg'column in df1
my desired output is :
df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250],
'logr':[3, 4, 10,11, 18,13, "NA"],
'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})
Thank you in advance.
Use DataFrame.merge with outer join for all combinations, then filter by Series.between and boolean indexing with DataFrame.pop for extract columns and last left join for add missing rows:
df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
1 1 500 4.0 0.5
2 2 1030 10.0 0.2
3 2 2005 11.0 0.1
4 3 3575 18.0 0.3
5 3 50 13.0 0.1
6 4 250 NaN NaN
Another solution:
df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
try using pd.merge() and
np.where()
import pandas pd
import numpy as np
res_df = pd.merge(df1,df2,on=['Chr'],how='outer')
res_df['check_between'] = np.where((res_df['position']>=res_df['start'])&(res_df['position']<=res_df['end']),True,False)
df3 = res_df[(res_df['check_between']==True) |
(res_df['start'].isnull())|
(res_df['end'].isnull()) ]
df3.drop(['check_between','start','end'],axis=1,inplace=True)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
Doing left-merge with indicator=True. Next, query checks position between start, end or _merge value is left_only. Finally, drop unwanted columns
df1.merge(df2, 'left', indicator=True).query('(start<=position<=end) | _merge.eq("left_only")') \
.drop(['start', 'end', '_merge'],1)
Out[364]:
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN

Split panda frames when a specific column reaches a given value

I want to df.cut() with two different bin sizes for two specific parts of a dataframe. I believe the easiest way to do that is to read my dataframe and split it in two so I can use df.cut() in the two independent dataframe with two independent bins.
I understand I can use df.head(), but I had to keep changing the dataframe and they don't have always the same size. For example, with the following dataframe
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
5 0.4 0.1773
6 0.43 0.91773
7 0.5 0.891773
I want to have two dataframes for value of A higher or equal than 0.4 and lower than 0.4.
So I would have df2:
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
and df3:
A B
1 0.4 0.1773
2 0.43 0.91773
3 0.5 0.891773
Again, df.head(4) or df.tail(3) won't work.
df2 = df[df["A"] < 0.4]
df3 = df[df["A"] >= 0.4]
This should work:
import pandas as pd
data = {'A': [0.1,0.2,0.1,0.2,5,6,7,8], 'B': [5,0.2,4,8,11,9,10,14]}
df = pd.DataFrame(data)
df2 = df[df.A >= 0.4]
print(df2)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
df3 = df[df.A < 0.4]
print(df3)
# A B
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0
I added in some ficticious data as an example:
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df[df.A > 4]
df2 = df[df.A <13]
print(df1)
print(df2)
Output
>>> print(df1)
A B
4 5 11
5 6 12
6 7 13
7 8 14
>>> print(df2)
A B
0 1 5
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
6 7 13
7 8 14

Create multiple dataframes based on the original dataframe columns number

I've search for quite a time, but I haven't found any similar question. If there is, please let me know!
I am currently trying to divide one dataframe into n dataframes where the n is equal to the number of columns of the original dataframe. All the new resulting dataframes must always keep the first column of the original dataframe. An extra would be gather all togheter in a list, for example, for further access.
In order to visualize my intention, here goes an brief example:
>> original df
GeneID A B C D E
1 0.3 0.2 0.6 0.4 0.8
2 0.5 0.3 0.1 0.2 0.6
3 0.4 0.1 0.5 0.1 0.3
4 0.9 0.7 0.1 0.6 0.7
5 0.1 0.4 0.7 0.2 0.5
My desired output would be something like this:
>> df1
GeneID A
1 0.3
2 0.5
3 0.4
4 0.9
5 0.1
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
And so on, until all the columns from the original dataframe be covered.
What would be the better solution ?
You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]
You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}
You can create a list of column names, and manually loop through and create a new DataFrame each loop.
>>> import pandas as pd
>>> d = {'col1':[1,2,3], 'col2':[3,4,5], 'col3':[6,7,8]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2 col3
0 1 3 6
1 2 4 7
2 3 5 8
>>> newstuff=[]
>>> columns = list(df)
>>> for column in columns:
... newstuff.append(pd.DataFrame(data=df[column]))
Unless your dataframe is unreasonably massive, above code should serve its job.

Categories