Join Two DataFrames While Preserving Index - python

Consider data frame:
asdf=pd.DataFrame({'a1':[True,True,True,False,False],'a2':[False,False,False,True,True],'p':['p1','p2','p3','p2','p3']})
a1 a2 p
0 True False p1
1 True False p2
2 True False p3
3 False True p2
4 False True p3
I wish to create the data frame
a1 a2
p
p1 True False
p2 True True
p3 True True
Attempt at a solution: I have tried various permutations of "+", "pd.concat", and "pd.merge" (all attempts not shown for the sake of clarity):
print(asdf)
print('\n\n\n')
tmp1=asdf[asdf['a1']==True]
tmp1.set_index('p',inplace=True)
tmp2=asdf[asdf['a2']==True]
tmp2.set_index('p',inplace=True)
print(tmp1)
print(tmp2)
print()
print(tmp1+tmp2)
print()
print(pd.concat([tmp1,tmp2],axis=1,join='outer'))
print()
print(pd.merge(tmp1,tmp2,how='outer',on='p'))
However, this does not give the required result, i.e. there are multiple columns with same key values. Any suggestions or resources to consult? Thank you!

You can use GroupBy.max() to aggregate the True/False results of the same p, as follows:
asdf.groupby('p').max()
With GroupBy.max() on the boolean, you will get the correct aggregation result of boolean:
False + False = False
True + False = True
False + True = True
True + True = True
Result:
a1 a2
p
p1 True False
p2 True True
p3 True True

From the documentation of any:
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or
along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
>>> asdf.groupby('p').any()
a1 a2
p
p1 True False
p2 True True
p3 True True

Related

For loop from multiple columns with an if elif else statement - or Numpy - Python

I have a table looking kind of like this - the last one I have added manually to explain wanted outcome
Image of tablesetup
Create one column for breed combining the columns p1-p3 where if true in p1_dog it add p1 value, if false in p1_dog but true in p2_dog add p2 value, if false look in p3_dog and if true add p3 value, else print - 'not a dog'.
I have tried using a for loop and also played around with np.select.
Perhaps a np.where should work but dont know how
breeds = []
for row in df_copy:
if ['p1_dog']:
breeds.append(df_copy['p1'])
elif ['p2_dog']:
breeds.append(df_copy['p2'])
elif ['p3_dog']:
breeds.append(df_copy['p3'])
else:
breeds.append('not a dog')
This only give me the full list of values in each stor
Using np.select which I found here pandas if else conditions on multiple columns
This gives me a list of booleans
df_copy['breed'] = np.select([df_copy.p1_dog == True , df_copy.p2_dog == True], [df_copy.p1_dog, df_copy.p2_dog], default=df_copy.p3_dog)
Definitely not the prettiest and not nicely maintainable, but in this case it would work:
import pandas as pd
import numpy as np
df = pd.DataFrame({"p1": ["Chihuahua", "paper_towel", "basset", "Irish_terrier", "Pembroke"],
"p1_dog": [True, False, True, True, False],
"p2": ["malamute", "Labrador_retriever", "English_springer", "Irish_setter", "Cardigan"],
"p2_dog": [True, True, True, True, False],
"p3": ["kelpie", "spatula", "German_short-hared_pointer", "Chesapeake_Bay_Retriever", "Chihuahua"],
"p3_dog": [True, False, True, True, True]
})
df['Wanted result'] = np.where(df["p1_dog"], df["p1"], np.where(df["p2_dog"], df["p2"], np.where(df["p3_dog"], df["p3"], "not a dog")))
print(df.to_string())
Basically chaining np.where() where if the condition for column 1 is True, it outputs the p1 name, but if False, checks the next boolean column, and so on.
Output:
[5 rows x 6 columns]
p1 p1_dog p2 p2_dog p3 p3_dog Wanted result
0 Chihuahua True malamute True kelpie True Chihuahua
1 paper_towel False Labrador_retriever True spatula False Labrador_retriever
2 basset True English_springer True German_short-hared_pointer True basset
3 Irish_terrier True Irish_setter True Chesapeake_Bay_Retriever True Irish_terrier
4 Pembroke False Cardigan False Chihuahua True Chihuahua
Also, please post your dataframe as raw text and not as an image. I had to copy the dataframe word for word, and some names were even cut off.

How to define a column of a dataframe using conditions based on its own previous values?

Say I have a data-frame, df as below, that has a 'Value' column which I'd like to apply some boolean analysis too.
date Value
10/11 0.798
11/11 1.235
12/11 0.890
13/11 0.756
14/11 0.501
...
Essentially, I'd like to create a new column that switches to TRUE when the value is greater than 1, and remains true unless the value drops below 0.75. For example, it would look like the below using df:
column
FALSE
TRUE
TRUE
TRUE
FALSE
I am struggling to find an appropriate way to reference the previous value of a column I am defining without running into some error. The logic I want to use is as below:
df['column'] = (df['value'] >= 1) | ((df['column'].shift(1) == True) & (df['value'] >= 0.75))
Is there a way that I can achieve this without over-complicating things?
A possible solution:
val1, val2 = 1, 0.75
out = (df.assign(
new=df.Value.where(df.Value.gt(val1) | df.Value.lt(val2))
.ffill().gt(val1)))
print(out)
Output:
date Value new
0 10/11 0.798 False
1 11/11 1.235 True
2 12/11 0.890 True
3 13/11 0.756 True
4 14/11 0.501 False
Actually calling function with apply might help, with some "remembering" logic.
res = True
def CheckRow(row):
global res
if res == True:
if row['value']>1.0:
res = False #next time check for < 0.75
return True
else:
return False
else: #res == False
if row['value']<0.75:
res = True #next time check for above 1.0
return False
else:
return True
df['column'] = df.apply(lambda x: CheckRow(x), axis = 1)

Get index number when condition is true in 3 columns

I've a question regarding some code in python. I'm trying to extract the index of the first row when the condition TRUE is satisfied in 3 different columns. This is the data I'm using:
0 1 2 3 4
0 TRUE TRUE TRUE 0.41871395 0.492517879
1 TRUE TRUE TRUE 0.409863582 0.519425031
2 TRUE TRUE TRUE 0.390077415 0.593127232
3 FALSE FALSE FALSE 0.372020631 0.704367199
4 FALSE FALSE FALSE 0.373546556 0.810876797
5 FALSE FALSE FALSE 0.398876919 0.86855678
6 FALSE FALSE FALSE 0.432142094 0.875576037
7 FALSE FALSE FALSE 0.454115421 0.863063448
8 FALSE TRUE FALSE 0.460676901 0.855739006
9 FALSE TRUE FALSE 0.458693197 0.855128636
10 FALSE FALSE FALSE 0.459201839 0.856451104
11 FALSE FALSE FALSE 0.458693197 0.855739006
12 FALSE FALSE FALSE 0.458082827 0.856349376
13 FALSE FALSE FALSE 0.456556902 0.856959746
14 TRUE TRUE TRUE 0.455946532 0.858180486
15 TRUE TRUE TRUE 0.455030976 0.858790857
16 TRUE TRUE TRUE 0.454725791 0.858485672
17 FALSE FALSE FALSE 0.454420606 0.857875301
18 FALSE FALSE FALSE 0.454725791 0.858383943
19 FALSE TRUE FALSE 0.453199866 0.856654561
20 FALSE FALSE FALSE 0.451979125 0.856349376
21 FALSE FALSE FALSE 0.45167394 0.856959746
22 FALSE FALSE FALSE 0.451775669 0.857570116
23 FALSE FALSE FALSE 0.45106357 0.857264931
24 TRUE TRUE TRUE 0.450758385 0.856654561
25 TRUE TRUE TRUE 0.4504532 0.856044191
26 TRUE TRUE TRUE 0.449232459 0.856349376
27 TRUE TRUE TRUE 0.448316904 0.855535549
and I need to get the index number only when there are 3 'True' conditions:
0
14
24
Thank you!
I guess everyone missed the "extract the index of the first row" part. One of the way would be removing consecutive duplicates first and then obtaining index where all three is True so that you only get first row of the truth
df=df[['0', '1', '2']]
df=df[df.shift()!=df].dropna().all(axis=1)
print(df[df].index.tolist())
OUTPUT:
[0, 14, 24]
I tried this on a demo dataframe and it seems to work for me.
df = pd.DataFrame(data={'A':[True,True,True,True,True,False,True,True],'B':[True,True,False,True,True,False,True,True],'C':[True,False,True,True,True,False,True,True]})
i =df[(df['A']==True) & (df['B']==True) & (df['C']==True)].index.to_list()
i = [x for x in i if x-1 not in i]
EDIT 2: I have a new answer in response to some clarifications.
You're looking for each row that has TRUE in columns 0, 1, or 2, BUT you'd like to ignore such rows that are not the first in a streak of them. The first part of my answer is still the same, I think you should create a mask that selects your TRUE triplet rows:
condition = df[[0, 1, 2]].all(axis='columns')
But now I present a possible way to filter out the rows you want to ignore. To be not-first in a streak of TRUE triplet rows means that the previous row also satisfies condition.
idx = df[condition].index
ignore = idx.isin(idx + 1)
result = idx[~ignore]
In other words, ignore rows where the index value is the successor of an index value satisfying condition.
Hope this helps!
Keeping my original answer for record keeping:
I think you'll end up with the most readable solution by breaking this out into two steps:
First, find out which rows have the value True for all of the columns you're interested in:
condition = df[[0, 1, 2]].all(axis='columns')
Then, the index values you're interested in are simply df[condition].index.
EDIT: if, as Benoit points out may be the case, TRUE and FALSE are strings, that's fine, you just need a minor tweak to the first step:
condition = (df[[0, 1, 2]] == 'TRUE').all(axis='columns')
If the TRUE and FALSE in your DataFrame are actually the boolean values True and False then,
#This will look at the first 3 columns and return True if "all" are True else it will return False:
step1 = [all(q) for q in df[[0,1,2]].values]
id = []
cnt = 0
temp_cnt = 0
#this loop finds where the value is true and checks if the next 2 are also true
#it then appends the count-2 to a list named id, the -2 compensates for the index.
for q in step1:
if q:
cnt += 1
if cnt == 3:
id.append(temp_cnt - 2)
else:
cnt = 0
temp_cnt += 1
#Then when printing "id" it will return the first index where AT LEAST 3 True values occur in sequence.
id
Out[108]: [0, 14, 24]
I think this could do the trick. As a general advice though, it always helps to name the columns in pandas.
Say that your pandas data frame is named data:
data[(data[0] == True) & (data[1] == True) & (data[2] == True)].index.values
or
list(data[(data[0] == True) & (data[1] == True) & (data[2] == True)].index.values)
Based on the answer here, something like this will provide a list of indices for the rows that meet all conditions:
df[(df[0]==True) & (df[1]==True) & (df[2]==True)].index.tolist()
The following will work regardless of the position of the 3 columns you wish to check for True values, and gives you back a list indicating which rows have 3 True values present:
Edit:
Now updated to better align with the OP's original request:
#df.iloc[:,:3] = df.iloc[:,:3].apply(lambda x: str(x) == "TRUE") # If necessary
s = (df == True).apply(sum, axis=1) == 3
s = s[s.shift() != s]
s.index[s].tolist()

Incrementing counter within for loop

I am having trouble properly incrementing a counter variable within a for loop.
I have a variable called flag, and I want to create a new variable called num based on the values in flag.
Input:
'flag'
FALSE
TRUE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
Desired output:
'flag' 'num'
FALSE 1
TRUE 1
TRUE 1
TRUE 1
FALSE 2
TRUE 2
FALSE 3
TRUE 3
TRUE 3
The goal is to start the num counter at 1 and keep that counter at 1 until another False instance occurs in the flag column. This would be continued until the end of the df.
My code:
def num(z):
i = 0
for x in range (0,len(z)):
if z['flg'] == False:
return i + 1
else:
return i
df['num']=df.apply(num, axis=1)
I have tried incrementing the I counter in numerous places, but no luck...
df['num'] = (~df['flag']).cumsum()
uses the fact that False = 0, True == 1
if df['num'] is not a column of True and False values, but a column of strings, I suggest you change that, or change the test to df['num'] = (df['flag'] == 'FALSE').cumsum()
Here is your code in a corrected matter, but the solution by Maarten Fabré is nicer:
def num(z):
i = 0
numcol = []
for flag in z:
if not flag:
i += 1
numcol.append(i)
return numcol
df['num'] = num(df['flag'])
At first I wanted to do it with a generator, but I did not get it to work with the pandas dataframe.

pandas filtering consecutive rows

I got a Dataframe with a Matrix colum like this
11034-A
11034-B
1120-A
1121-A
112570-A
113-A
113.558
113.787-A
113.787-B
114-A
11691-A
11691-B
117-A RRS
12 X R
12-476-AT-A
12-476-AT-B
I'd like to filter only matrix that ends with A or B only when they are consecutive, so in the example above 11034-A and 11034-B, 113.787-A and 113.787-B, 11691-A and 11691-B, 12-476-AT-A and 12-476-AT-B
I wrote the function that will compare those 2 strings and return True or False, the problem is I fail to see how to apply / applymap to the consecutive rows:
def isAB(stringA, stringB):
if stringA.endswith('A') and stringB.endswith('B') and stringA[:-1] == stringB[:-1]:
return True
else:
return False
I tried df['result'] = isAB(df['Matrix'].str, df['Matrix'].shift().str) to no-avail
I seem to lack something in the way I designed this
edit :
I think this works, looks like I overcomplicated at 1st :
df['t'] = (df['Matrix'].str.endswith('A') & df['Matrix'].shift(-1).str.endswith('B')) | (df['Matrix'].str.endswith('B') & df['Matrix'].shift(1).str.endswith('A'))
df['p'] = (df['Matrix'].str[:-1] == df['Matrix'].shift(-1).str[:-1]) | (df['Matrix'].str[:-1] == df['Matrix'].shift(1).str[:-1])
df['e'] = df['p'] & df['t']
final = df[df['e']]
Here is how I would do it.
df['ShiftUp'] = df['matrix'].shift(-1)
df['ShiftDown'] = df['matrix'].shift()
def check_matrix(x):
if pd.isnull(x.ShiftUp) == False and x.matrix[:-1] == x.ShiftUp[:-1]:
return True
elif pd.isnull(x.ShiftDown) == False and x.matrix[:-1] == x.ShiftDown[:-1]:
return True
else:
return False
df['new'] = df.apply(check_matrix, axis=1)
df = df.drop(['ShiftUp', 'ShiftDown'], axis=1)
print df
prints
matrix new
0 11034-A True
1 11034-B True
2 1120-A False
3 1121-A False
4 112570-A False
5 113-A False
6 113.558 False
7 113.787-A True
8 113.787-B True
9 114-A False
10 11691-A True
11 11691-B True
12 117-A RRS False
13 12 X R False
14 12-476-AT-A True
15 12-476-AT-B True
Here's my solution, it requires a bit of work.
The strategy is the following: obtain a new column that has the same values as the current column but shifted one position.
Then, it's just a matter to check whether one column is A or B and the other one B or A.
Say your matrix colum is called "column_name".
Then:
myl = ['11034-A',
'11034-B',
'1120-A',
'1121-A',
'112570-A',
'113-A',
'113.558',
'113.787-A',
'113.787-B',
'114-A',
'11691-A',
'11691-B',
'117-A RRS',
'12 X R',
'12-476-AT-A',
'12-476-AT-B']
#toy data frame
mydf = pd.DataFrame.from_dict({'column_name':myl})
#get a new series which is the same one as the original
#but the first entry contains "nothing"
new_series = pd.Series( ['nothing'] +
mydf['column_name'][:-1].values.tolist() )
#add it to the original dataframe
mydf['new_col'] = new_series
You then define a simple function:
def do_i_want_this_row(x,y):
left_char = x[-1]
right_char = y[-1]
return ((left_char == 'A') & (right_char == 'B')) or ((left_char == 'B') & (right_char=='A'))
and voila:
print mydf[mydf.apply(lambda x: do_i_want_this_row( x.column_name, x.new_col), axis=1)]
column_name new_col
1 11034-B 11034-A
2 1120-A 11034-B
8 113.787-B 113.787-A
9 114-A 113.787-B
11 11691-B 11691-A
15 12-476-AT-B 12-476-AT-A
There is still the question of the last element, but I'm sure you can think of what to do with it if you decide to follow this strategy ;)
You can delete rows from a DataFrame using DataFrame.drop(labels, axis). To get a list of labels to delete, I would first get a list of pairs that match your criterion. With the labels from above in a list labels and your isAB function,
pairs = zip(labels[:-1], labels[1:])
delete_pairs = filter(isAB, pairs)
delete_labels = []
for a,b in delete_pairs:
delete_labels.append(a)
delete_labels.append(b)
Examinedelete_labels to make sure you've put it together correctly,
print(delete_labels)
And finally, delete the rows. With the DataFrame in question as x,
x.drop(delete_labels) # or x.drop(delete_labels, axis) if appropriate

Categories