Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({0:[10,20,30,40],1:[20,45,10,35],2:[34,24,54,22],
'0 to 1':[1,1,1,0],'0 to 2':[1,0,1,1],'1 to 2':[1,1,1,1],
})
d=d[[0,1,2,'0 to 1','0 to 2','1 to 2']]
d
0 1 2 0 to 1 0 to 2 1 to 2
0 10 20 34 1 1 1
1 20 45 24 1 0 1
2 30 10 54 1 1 1
3 40 35 22 0 1 1
I'd like to produce 3 new columns; one for each of the 3 columns on the left with the following criteria:
Include the original value.
If the original value is greater than the value being compared, and if there is a 1 in the comparison column (the columns with 'to'), list the other column(s) separated by commas.
For example:
Column 1, row 0 has a value of 20, which is greater than its corresponding value in column 0 (10). The comparison column between columns 0 and 1 is '0 to 1'. In this column, there is a value of 1 for row 0. There is another column comparing column 1 to column 2, but the value for column 2, row 0 is 34, so since it is greater than 20, ignore the 1 in '1 to 2'.
So the final value would be '20 (0)'.
Here is the desired resulting data frame:
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (0,1)
1 20 45 24 1 0 1 20 45 (0,2) 24
2 30 10 54 1 1 1 30 (1) 10 54 (0,1)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
Thanks in advance!
Note: Because my real data will have varying numbers of columns on the left (i.e. 0,1,2,3,4) and resulting comparisons, I need an approach that finds all conditions that apply. So, for a particular value, find all cases where the comparison column value is 1 and the value is higher than that being compared.
Update
To clarify:
'0 to 1' compares column 0 to column 1. If there is a significant difference between them, the value is 1, else 0. So for '0 Final', if 0 is larger than 1 and '0 to 1' is 1, there would be a (1) after the value to signify that 0 is significantly larger than 1 for that row.
Here's what I have so far:
d['0 Final']=d[0].astype(str)
d['1 Final']=d[1].astype(str)
d['2 Final']=d[2].astype(str)
d.loc[((d[0]>d[1])&(d['0 to 1']==1))|((d['0 to 2']==1)&(d[0]>d[2])),'0 Final']=d['0 Final']+' '
d.loc[((d[1]>d[0])&(d['0 to 1']==1))|((d['1 to 2']==1)&(d[1]>d[2])),'1 Final']=d['1 Final']+' '
d.loc[((d[2]>d[0])&(d['0 to 2']==1))|((d['1 to 2']==1)&(d[2]>d[1])),'2 Final']=d['2 Final']+' '
d.loc[(d['0 to 1']==1)&(d[0]>d[1]),'0 Final']=d['0 Final']+'1'
d.loc[(d['0 to 2']==1)&(d[0]>d[2]),'0 Final']=d['0 Final']+'2'
d.loc[(d['0 to 1']==1)&(d[1]>d[0]),'1 Final']=d['1 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[1]>d[2]),'1 Final']=d['1 Final']+'2'
d.loc[(d['0 to 2']==1)&(d[2]>d[0]),'2 Final']=d['2 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[2]>d[1]),'2 Final']=d['2 Final']+'1'
d.loc[d['0 Final'].str.contains(' '),'0 Final']=d[0].astype(str)+' ('+d['0 Final'].str.split(' ').str[1]+')'
d.loc[d['1 Final'].str.contains(' '),'1 Final']=d[1].astype(str)+' ('+d['1 Final'].str.split(' ').str[1]+')'
d.loc[d['2 Final'].str.contains(' '),'2 Final']=d[2].astype(str)+' ('+d['2 Final'].str.split(' ').str[1]+')'
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (01)
1 20 45 24 1 0 1 20 45 (02) 24
2 30 10 54 1 1 1 30 (1) 10 54 (01)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
It has 2 shortcomings:
I cannot predict how many columns I will have to compare, so the first 3 .loc lines will need to somehow account for this, assuming it can and it's the best approach.
I still need to figure out how to get a comma and space between each number in parentheses if there is more than 1.
Related
I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.
Using the following example and code from stackoverflow 25119524
import pandas as pd
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe
cowmast
The output is xmast (number of times mastitis) for each cow
Cow Lact DIM xmast
0 1 1 45 1
1 1 2 25 2
2 1 2 28 3
3 2 2 70 1
4 2 2 95 2
5 2 2 98 3
6 2 2 120 4
7 2 3 80 5
What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.
To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])
That produces the following output
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.
The output I am looking for is called "Adjusted" in the table below.
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.
Thank you for your help
John
Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast to df, you can set up xmast as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow, Lact]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast, Lxmast and Adjusted
This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
Given a DataFrame with a hierarchical index containing three levels (experiment, trial, slot) and a second DataFrame with a hierarchical index containing two levels (experiment, trial), how do I drop all the rows in the first DataFrame whose (experiment, trial) are not contained in the second dataframe?
Example data:
from io import StringIO
import pandas as pd
df1_data = StringIO(u',experiment,trial,slot,token\n0,btn144a10_p_RDT,0,0,4.0\n1,btn144a10_p_RDT,0,1,14.0\n2,btn144a10_p_RDT,1,0,12.0\n3,btn144a10_p_RDT,1,1,14.0\n4,btn145a07_p_RDT,0,0,6.0\n5,btn145a07_p_RDT,0,1,19.0\n6,btn145a07_p_RDT,1,0,17.0\n7,btn145a07_p_RDT,1,1,13.0\n8,chn004b06_p_RDT,0,0,6.0\n9,chn004b06_p_RDT,0,1,8.0\n10,chn004b06_p_RDT,1,0,2.0\n11,chn004b06_p_RDT,1,1,5.0\n12,chn008a06_p_RDT,0,0,12.0\n13,chn008a06_p_RDT,0,1,14.0\n14,chn008a06_p_RDT,1,0,6.0\n15,chn008a06_p_RDT,1,1,4.0\n16,chn008b06_p_RDT,0,0,3.0\n17,chn008b06_p_RDT,0,1,13.0\n18,chn008b06_p_RDT,1,0,12.0\n19,chn008b06_p_RDT,1,1,19.0\n20,chn008c04_p_RDT,0,0,17.0\n21,chn008c04_p_RDT,0,1,2.0\n22,chn008c04_p_RDT,1,0,1.0\n23,chn008c04_p_RDT,1,1,6.0\n')
df1 = pd.DataFrame.from_csv(df1_data).set_index(['experiment', 'trial', 'slot'])
df2_data = StringIO(u',experiment,trial,target\n0,btn145a07_p_RDT,1,13\n1,chn004b06_p_RDT,1,9\n2,chn008a06_p_RDT,0,15\n3,chn008a06_p_RDT,1,15\n4,chn008b06_p_RDT,1,1\n5,chn008c04_p_RDT,1,12\n')
df2 = pd.DataFrame.from_csv(df2_data).set_index(['experiment', 'trial'])
The first dataframe looks like:
token
experiment trial slot
btn144a10_p_RDT 0 0 4
1 14
1 0 12
1 14
btn145a07_p_RDT 0 0 6
1 19
1 0 17
1 13
chn004b06_p_RDT 0 0 6
1 8
1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 0 0 3
1 13
1 0 12
1 19
chn008c04_p_RDT 0 0 17
1 2
1 0 1
1 6
The second dataframe looks like:
target
experiment trial
btn145a07_p_RDT 1 13
chn004b06_p_RDT 1 9
chn008a06_p_RDT 0 15
1 15
chn008b06_p_RDT 1 1
chn008c04_p_RDT 1 12
The result I want:
token
experiment trial slot
btn145a07_p_RDT 1 0 17
1 13
chn004b06_p_RDT 1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 1 0 12
1 19
chn008c04_p_RDT 1 0 1
1 6
One way to do it would by using merge
merged = pd.merge(
df2.reset_index(),
df1.reset_index(),
left_on=['experiment', 'trial'],
right_on=['experiment', 'trial'],
how='left')
You just need to reindex merged to whatever you like (I could not tell exactly from the question).
What should work is
df1.loc[df2.index]
but multi indexing still has some problems. What does work is
df1.reset_index(2).loc[df2.index].set_index('slot', append=True)
which is a bit of a hack around this problem. Note that
df1.loc[df2.index[:1]]
gives garbage while
df.loc[df2.index[0]]
gives what you would expect. So passing multiple values from a m-level index to an n-level index where n > m > 2 doesn't work, though it should.
I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:
0 1 2
0 chr2L 1 4
1 chr2L 9 12
2 chr2L 17 20
3 chr2L 23 23
4 chr2L 26 27
5 chr2L 30 40
6 chr2L 45 47
7 chr2L 52 53
8 chr2L 56 56
9 chr2L 61 62
10 chr2L 66 80
I want to get something like this:
0 1 2 3
0 chr2L 0 1 0
1 chr2L 1 2 1
2 chr2L 2 3 1
3 chr2L 3 4 1
4 chr2L 4 5 0
5 chr2L 5 6 0
6 chr2L 6 7 0
7 chr2L 7 8 0
8 chr2L 8 9 0
9 chr2L 9 10 1
10 chr2L 10 11 1
11 chr2L 11 12 1
12 chr2L 12 13 0
And so on...
So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.
In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.
This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):
import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})
# splitting up a range, translated into df row terms:
def onebyone(dfrow):
a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
count = b - a
if count >= 2:
interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
interior = [0.5]
elif count == 0:
interior = []
return {'n':[n]*count, 'a':range(a, a + count),
'b':range(a + 1, a + count + 1),
'insideness':interior}
Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:
# Into a new dataframe:
intermediate = []
for label in set(df.n):
for row in df[df.n == label].iterrows():
intermediate.append(pd.DataFrame(onebyone(row)))
df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))
And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:
# for times in the overall range describing 'a'
for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())):
# if a time isn't in an existing 0.5-1-0.5 range:
if i not in newd[newd.n=='a'].a.values:
# these are the values to fill in a 0-row
print '%d, %d, 0'%(i, i+1)
Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().