I am trying to code a scenario where I need to iterate over a particular record columnwise but could not generalize it. Below is an example scenario :
input data
The output should be : if in a record, a particular column value is not null and less than 0 then code should iterate over next column value in the same index and look for any positive or null value appearing and then it should replace the negative value with that column value.
output data
so basically, in a particular record, if any cell contains negative value then code should check the next cell in the same record and should keep on checking until it finds a null or positive value and should replace the negative value with that value.
Hope the scenario is clear. I have attached input and expected output data for reference. I can write code using iterrows and if, else condition but I want to generalize the code for n number of columns. It is not possible to write 50 or 100 if else condition.
Thanks in advance for the help.
I think the easiest way is to transpose. Assume d is your dataframe:
dummy = abs(max(d.max())) + 1
d = d.fillna(dummy) #Replace none with the dummy value
d[d < 0] = None #Replace negative values with none
d = d.transpose()
d = d.fillna(method="bfill") #Backfill NA's by pulling from next row (next column in the original dataframe)
d = d.transpose() #Transpose back
d[d == dummy] = None #put dummy value back to None
Related
On python im trying to check if from a day to the next one (column by column), by ID, the values, if not all equal to zero, are correctly incremented by one or if at some point the value goes back to 0, then the next day it is either still equal to zero or incremented by one.
I have a dataframe with multiple columns, the first column is named "ID". All other columns are integers which names represent a day followed by the next day like this :
I need to check if for each ID :
all the columns (i.e for all of the days) are equal to 0 then create new column named "CHECK" which equals to 0 meaning there is no error ;
if not then look column after column and check if the value in the next column is greater than the previous column value (i.e the day before) and just incremented by 1 (e.g from 14 to 15 and not 14 to 16) then "CHECK" equals to 0 ;
if these conditions aren't satisfied it means that the next column is either equal to the previous column or lower (but not equal to zero) and in both cases it is an error then "CHECK" equals to 1 ;
But if the next value column is lower and equal to 0 then the next one to it must be either still 0 or incremented by 1. Each time it comes back to zero it must be followed by zero or incremented by 1
Each time there is a column in which the calculation comes back to zero, the next columns should either be equal to zero or getting +1 value.
If i explained everything correctly then in this example, the first two IDs are correct and their variable "CHECK" must be equal to 0 but the next ID should have a "CHECK" value 1.
I hope this is not confusing..thanks.
I tried this but would like to not use the column name but its index/position. Code not finished.`
df['check'] = np.where((df['20110531']<=df['20110530']) & ([df.columns!="ID"] != 0),1,0)
You could write a simple function to go along each row and check for your condition. In the example code below, I first set the index to ID.
df = df.set_index('ID')
def func(r):
start = r[0]
for j,i in enumerate(r):
if j == 0:
continue
if i == 0 or i == start +1:
start = i
else:
return 1
return 0
df['check'] = df.apply(func, axis = 1)
if you want to keep the original index the don't reset and use df['check'] = df.iloc[:,1:].apply(func, axis = 1)
I am trying to assign an object(it could be a list,tuple, string) to a specific cell in a dataframe, but it does not work. I am filtering first and then trying to assign the value.
enter image description here
I am using df.loc[df['name']=='aagicampus'].reset_index(drop=True).at[0,'words']='test'
The expected result is something like
enter image description here
It works if I create a copy of the dataframe, but I must keep the original dataframe to iterate later over a list and perform this procedure many times.
Thanks for your help.
You can do it by first getting the indices of the row(s) that you want to change, and then setting cells at one of those locations to the desired value.
This code gets the locations of rows that satisfy your condition of df['name'] == 'aagicampus
locations = df.index[df['name'] == 'aagicampus']
then you just .loc on locations[0] to change the first row that satisfies the condition. Here it is all together:
df = pd.DataFrame({'name':['something','aagicampus','something'], 'words':['unchanged', 'unchanged', 'unchanged'] })
locations = df.index[df['name'] == 'aagicampus']
df.words.loc[locations[0]] = 'CHANGED'
df.head()
this will return a table:
name words
0 something unchanged
1 aagicampus CHANGED
2 something unchanged
I am trying to populate values in the column motooutstandingbalance by subtracting the previous row actualmotordeductionfortheweek from previous row motooutstandingbalance. I am using pandas shift command but currently not getting the desired output which should be a consistent reduction in motooutstandingbalance week by week.
Final result should look like this
Here is my code
x['motooutstandingbalance']=np.where(x.salesrepid == x.shift(1).salesrepid, x.shift(1).motooutstandingbalance - x.shift(1).actualmotordeductionfortheweek, x.motooutstandingbalance)
Any ideas on how to achieve this?
This works:
start_value = 468300.0
df['motooutstandingbalance'] = (-df['actualmotordeductionfortheweek'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
Basically what I'm doing is I'm—
Taking the actualmotordeductionfortheweek column, negating it (all the values become negative), and reversing it
Adding the start value (which is positive, as opposed to all the other values which are negative) at index -1 (which is before 0, not at the very end as is usual in Python)
Reversing it back, so that the new -1 entry goes to the very beginning
Using cumsum() to add all the of values of the column. This actually work to subtract all the values from the start value, because the first value is positive and the rest of the values are negative (because x + (-y) = x - y)
I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.
I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.
I am trying to compare all rows within a group to check if a condition is fulfilled. If the condition is not fulfilled, I set the new column to True, else False. The issue I am having is finding a neat way to compare all rows within each group. I have something that works but will not work where there are a lot of rows in a group.
for i in range(8):
n = -i-1
cond=(((df['age']-df['age'].shift(n))*(df['weight']-df['weight'].shift(n)))<0)&(df['ref']==df['ref'].shift(n))&(df['age']<7)&(df['age'].shift(n)<7)
df['x'+i] = cond.groupby(df['ref']).transform('any')
df.loc[:,'WFA'] = 0
df.loc[(df['x0']==False)&(df['x1']==False)&(df['x2']==False)&(df['x3']==False)&(df['x4']==False)&(df['x5']==False)&(df['x6']==False)&(df['x7']==False),'WFA'] = 1
To iterate through each row, I have created a loop that compares adjacent rows (using shift). Each loop represents the next adjacent row. In effect, I am able to compare all rows within a group where the number of rows within a group is 8 or less. As you can imagine, it becomes pretty cumbersome as the number of rows grows large.
Instead of creating of column for each period in shift, I want to see if any row matches the condition with any other row. Then set the new column 'WFA' True or False.
If anyone is interested, I post the answer to my own question here (although it is very slow):
df.loc[:,'WFA'] = 0
for ref, gref in df.groupby('ref'):
count=0
for r_idx, row in gref.iterrows():
cond = ((((row['age']-gref.loc[gref['age']<7, 'age'])*(row['weight']-gref.loc[gref['age']<7, 'weight']))<0).any())&(row['age']<7)
if cond==False:
count+=1
if count==len(gref):
df.loc[df['ref']==ref, 'WFA'] = 1