I have a pandas dataframe like this:
col
0 3
1 5
2 9
3 5
4 6
5 6
6 11
7 6
8 2
9 10
that could be created in Python with the code:
import pandas as pd
df = pd.DataFrame(
{
'col': [3, 5, 9, 5, 6, 6, 11, 6, 2, 10]
}
)
I want to find the rows that have a value greater than 8, and also there is at least one row before them that has a value less than 4.
So the output should be:
col
2 9
9 10
You can see that index 0 has a value equal to 3 (less than 4) and then index 2 has a value greater than 8. So we add index 2 to the output and continue to check for the next rows. But we don't anymore consider indexes 0, 1, 2, and reset the work.
Index 6 has a value equal to 11, but none of the indexes 3, 4, 5 has a value less than 4, so we don't add index 6 to the output.
Index 8 has a value equal to 2 (less than 4) and index 9 has a value equal to 10 (greater than 8), so index 9 is added to the output.
It's my priority not to use any for-loops for the code.
Have you any idea about this?
Boolean indexing to the rescue:
# value > 8
m1 = df['col'].gt(8)
# get previous value <4
# check if any occurred previously
m2 = df['col'].shift().lt(4).groupby(m1[::-1].cumsum()).cummax()
df[m1&m2]
Output:
col
2 9
9 10
Check Below code using SHIFT:
df['val'] = np.where(df['col']>8, True, False).cumsum()
df['val'] = np.where(df['col']>8, df['val']-1, df['val'])
df.assign(min_value = df.groupby('val')['col'].transform('min')).\
query('col>8 and min_value<4')[['col']]
OUTPUT:
Related
I have a pandas dataframe like:
one
two
three
1
3
4
2
4
6
1
3
4
10
3
4
2
4
5
0
3
4
-10
3
4
Now observing the first column (labeled 'one') I would like to find the rows where the value is bigger than say 9. (in this case it would be the fourth )
Ideally, I also would like to find the rows where the absolute value of the value is bigger than say 9 (so that would be fourth and seventh)
How can I do this? (So far I only covert the columns into series and even into series of truths and false but my problem is that my dataframe is huge and I cannot visually inspect it. I need to get the row numbers automatically
you can apply abs and compare and filter by loc:
out = df.loc[df['one'].abs() > 9]
output :
>>>
one two three
3 10 3 4
6 -10 3 4
You could use abs() pandas-abs
df = pd.DataFrame({
'a': [1, 4, -6, 3, 7],
'b': [2, 3, 5, 3, 1],
'c': [4, 2, 7, 1, 3]
})
df[df.a.abs() > 5]
returns two rows 2, 4.
row = {}
for column in df:
row_temp = {}
index = df.loc[df[column].abs()>=9].index
row.update({column:list(index)})
Let's consider data frame following:
import pandas as pd
df = pd.DataFrame([[1, -2, 3, -5, 4 ,2 ,7 ,-8 ,2], [2, -4, 6, 7, -8, 9, 5, 3, 2], [2, 4, 6, 7, 8, 9, 5, 3, 2], [1, 2, 3, 4, 5, 6, 7, 8, 9]]).transpose()
df.columns = ["A", "B", "C", "D"]
A B C D
0 1 2 2 1
1 -2 -4 4 2
2 3 6 6 3
3 -5 7 7 4
4 4 -8 8 5
5 2 9 9 6
6 7 5 5 7
7 -8 3 3 8
8 2 2 2 9
I want to add at the end of the column name "pos" if column contain only positive values. What I would do with it is:
pos_idx = df.loc[:, (df>0).all()].columns
df[pos_idx].columns = df[pos_idx].columns + "pos"
However it seems not to work - it returns no error, however it does not change column names. Moreover, what is very interesting, is that code:
df.columns = df.columns + "anything"
actually add to column names word "anything". Could you please explain to me why it happens (works in general case, but it does not work on index case), and how to do this correctly?
You are saving the new column names onto a copy of the dataframe. The below statement is not overwriting column names of df, but only of the slice df[pos_idx]
df[pos_idx].columns = df[pos_idx].columns + "pos"
Your second code example directly acccesses df, that's why that one works
How to make it work? --> Define the "full columns list" (separately). Afterwards write it into df directly.
How to define the "full list"? Add "pos" as a suffix to all cols which don't have any occurrence of values that are <=0.
my_col_list = [col+(count==0)*"_pos" for col, count in (df <= 0).sum().to_dict().items()]
df.columns = my_col_list
First of all, use .rename() function to change the name of a column.
To add 'pos' to columns with non negative values you can use this:
renamed_columns = {i:i+' pos' for i in df.columns if df[i].min()>=0}
df.rename(columns=renamed_columns,inplace=True)
Here is my simplified example dataframe:
timestamp A B
1422404668 1 1
1422404670 2 2
1422404672 -3 3
1422404674 -4 4
1422404676 5 5
1422404678 -6 6
1422404680 -7 7
1422404680 8 8
Is there a way to groupby/filter only positive and negative values and get first value of each group in column A and mean values of column B as below output
Expected output:
timestamp A B
1422404668 1 3
1422404672 -3 7
1422404676 5 5
1422404678 -6 13
1422404680 8 8
Data:
{'timestamp': [1422404668, 1422404670, 1422404672, 1422404674,
1422404676, 1422404678, 1422404680, 1422404680],
'A': [1, 2, -3, -4, 5, -6, -7, 8], 'B': [1, 2, 3, 4, 5, 6, 7, 8]}
IIUC, you could drop consecutively duplicate signed "A"s (so like, the row with 2 in column "A" is dropped because it has the same sign as 1, the immediate previous value in column "A"):
out = df[df['A'].ge(0).astype(int).diff()!=0]
it turns out, you don't need to convert to int (thanks #Corralien):
out = df[df['A'].ge(0).diff()!=0]
Output:
timestamp A
0 1422404668 1
2 1422404672 -3
4 1422404676 5
5 1422404678 -6
7 1422404680 8
Edit:
Given OP's edit, we could use cumsum on the mask to create group numbers and groupby it and use agg to call different methods on different columns:
out = df.groupby(df['A'].ge(0).diff().ne(0).cumsum()).agg({'timestamp':'first', 'A':'first', 'B':'sum'}).reset_index(drop=True)
Output:
timestamp A B
0 1422404668 1 3
1 1422404672 -3 7
2 1422404676 5 5
3 1422404678 -6 13
4 1422404680 8 8
something like this?
I made two frames with negative values from A column and positive values from A column.
Then find first occurence for negative and positive and concat frame to out.
df_positive = df[df['A'] > 0]
df_negative = df[df['A'] < 0]
df_positive = df_positive.groupby('A').first().reset_index()
df_negative = df_negative.groupby('A').first().reset_index()
out = pd.concat([df_positive,df_negative ])[['timestamp', 'A']]
index [0, 1, 2, 3, 4, 5]
part_1 [4, 5, 6, 4, 8, 4]
part_2 [11, 12, 10, 12, 14, 13]
new [6, 4, 8, 8, na, na]
I'm a beginner in python & pandas asking for support. In a simple dataframe, I want to create a new column that gives me the last row of a cumulative sum that satisfies the condition
df.part_1.cumsum() > df.part_2
So e.g. for the new column at index 0 I would get the value 6 as (4+5+6) > 11.
Thanks!
IIUC here a NumPy based approach. The idea is to build an upper triangular matrix, with shifted versions of the input array in each row. By taking the cumulative sum of these, and comparing against the second column of the dataframe, we can find using argmax the first index where a value in the cumulative sequences is greater than the third dataframe column in the corresponding index:
a = df.to_numpy()
cs = np.triu(a[:,1]).cumsum(1)
ix = (cs >= a[:,2,None]).argmax(1)
# array([2, 3, 3, 4, 6, 7, 7, 0], dtype=int64)
df['first_ix'] = a[ix,1,None]
print(df)
index part_1 part_2 first_ix
0 0 4 11 6
1 1 5 12 4
2 2 6 10 4
3 3 4 12 8
4 4 8 14 6
5 5 4 13 8
6 6 6 11 8
7 7 8 10 4
I am using Excel for this task now, but I was wondering if any of you know a way to find and insert missing sequence numbers in python.
Say I have a dataframe:
import pandas as pd
data = {'Sequence': [1, 2, 4, 6, 7, 9, 10],
'Value': ["x", "x", "x", "x", "x", "x", "x"]
}
df = pd.DataFrame (data, columns = ['Sequence','Value'])
And now I want to use some code here to find missing sequence numbers in the column 'Sequence', and leave blank spaces at the column 'Values' for the rows of missing sequence numbers. To get the following output:
print(df)
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
Even better would be a solution in which you can also define the start and end of the sequence. For example when the sequence starts with 3 but you want it to start from 1 and end at 12. But a solution for only the first part will already help a lot. Thanks in advance!!
You can set_index and reindex using a range from the Sequence's min and max values:
(df.set_index('Sequence')
.reindex(range(df.Sequence.iat[0],df.Sequence.iat[-1]+1), fill_value='')
.reset_index())
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
Or do it by merging DataFrames:
seq = [1, 2, 4, 6, 7, 9, 10]
dfs0 = pd.DataFrame.from_dict({'Sequence': seq, 'Value': ['x']*len(seq)})
dfseq = pd.DataFrame.from_dict({'Sequence': range( min(seq), max(seq)+1 )})
.merge(dfs0, on='Sequence', how='outer').fillna('')
print(dfseq)
Sequence Value
0 1 x
1 2 x
2 3
3 4 x
4 5
5 6 x
6 7 x
7 8
8 9 x
9 10 x
You can try this :
Sequence = [1, 2, 4, 6, 7, 9, 10]
df = pd.DataFrame(np.arange(1,12), columns=["Sequence"])
df = df.loc[df.Sequence.isin(Sequence), 'Value'] = 'x'
df = df.fillna('')
First you create your DataFrame with the given range of values you want it to have for sequence.
Then you set 'Value' to 'x' for the rows where 'Sequence' is in your Sequence list. Finally you fill the missing values with ''.