my input:
first=pd.Series([0,1680,5000,14999,17000])
last =pd.Series([4999,7501,10000,16777,21387])
dd=pd.concat([first, last], axis=1)
I trying find&compare second value in first column (e.g. 1680) and "range" previous row between first value in first column to first value in second column(e.g. from 0 to 4999). So in my condition value 1680 fall in range previous row between 0 to 4999, also 3td value in first column 5000 fall in range previous row between 1680 to 7501, but other values (e.g. 14999, 17000) not in range of previous rows.
My expect output something like this:
[1680], [5000] so show only values that fall in my condition
I trying with diff(): dd[0].diff().gt(dd[1]) or reshape/shift but not really success
Use shift and between to compare a row with the previous one:
>>> df[0].loc[df[0].between(df[0].shift(), df[1].shift())]
1 1680
2 5000
Name: 0, dtype: int64
Details of shift:
>>> pd.concat([df[0], df.shift()], axis=1)
0 0 1
0 0 NaN NaN
1 1680 0.0 4999.0
2 5000 1680.0 7501.0
3 14999 5000.0 10000.0
4 17000 14999.0 16777.0
Related
data = [['BAL', 'BAL', 'NO', 'DAL'], ['DAL', 'DAL', 'TEN', 'SF']]
df = pd.DataFrame(data)
I want to count the number of occurrences of the value in the first column in each row, across that row.
In this example, the number of times "BAL" appears in the first row, "DAL" in the second row, etc.
Then assign that count to a new column df['Count'].
You could do something like this:
df.assign(count=df.eq(df.iloc[:,0], axis=0).sum(axis=1))
Create a series using iloc of the first column of your dataframe, then compare values use pd.DataFrame.eq with axis=0 and sum along axis=1.
Output:
0 1 2 3 count
0 BAL BAL NO DAL 2
1 DAL DAL TEN SF 2
We can compare the first column to all the remaining columns with DataFrame.eq then sum across the rows to add up the number of True values (matches):
df['count'] = df.iloc[:, 1:].eq(df.iloc[:, 0], axis=0).sum(axis=1)
df:
0 1 2 3 count
0 BAL BAL NO DAL 1
1 DAL DAL TEN SF 1
*Note this output is slightly different than the accepted answer in that it does not include the column containing the reference values in the row count.
I would like to loop through a dataframe column and when a specific value is reached, I would like to save the value in another column I created and called "index" in the same row. Here is an example df:
index value
0 a
2 b
3 c
9 a
23 d
i trying to code it like this:
for value in df["value"]:
if value == "a":
current_index = #get value in "index" in current row
I cannot simply save all indices of rows where the value is "a" because the rest of my code wouldn't work then.
I think this should be pretty easy but somehow I cannot find the solution.
Thank you all for your support!
IIUC:
try via boolean masking and loc accessor:
out=df.loc[df['value'].eq('a'),'index'].tolist()
output of out:
[0, 9]
OR
If you want to create a column then:
df['newcol']=df['index'].where(df['value'].eq('a'))
output of df:
index value newcol
0 0 a 0.0
1 2 b NaN
2 3 c NaN
3 9 a 9.0
4 23 d NaN
This is my pandas DataFrame
>>> df
grades
0 69.233627
1 70.130900
2 83.357011
3 88.206387
4 74.342212
sorting it gives this
df.sort_values(by=['grades'])
grades
0 69.233627
1 70.130900
4 74.342212
2 83.357011
3 88.206387
I'm trying to get a new column difference that the value of each row comes the difference subtracting the sorted one from the original.
However, this code doesn't work
df['difference'] = df - df.sort_values(by=['grades'])
giving me
grades
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
What am I missing?
It is expected, becaus epandas by default align by index values, so before subtracting reorder by original df.index, so get 0 values. for prevent it is possible convert values to numpy array and subtract only Series like:
df['difference'] = df['grades'] - df['grades'].sort_values().to_numpy()
If default index in original DataFrame also is possible set index to RangeIndex like:
df['difference'] = df['grades'] - df['grades'].sort_values().reset_index(drop=True)
I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa
I have a series and df
s = pd.Series([1,2,3,5])
df = pd.DataFrame()
When I add columns to df like this
df.loc[:, "0-2"] = s.iloc[0:3]
df.loc[:, "1-3"] = s.iloc[1:4]
I get df
0-2 1-3
0 1 NaN
1 2 2.0
2 3 3.0
Why am I getting NaN? I tried create new series with correct idxs, but adding it to df still causes NaN.
What I want is
0-2 1-3
0 1 2
1 2 3
2 3 5
Try either of the following lines.
df.loc[:, "1-3"] = s.iloc[1:4].values
# -OR-
df.loc[:, "1-3"] = s.iloc[1:4].reset_index(drop=True)
Your original code is trying unsuccessfully to match the index of the data frame df to the index of the subset series s.iloc[1:4]. When it can't find the 0 index in the series, it places a NaN value in df at that location. You can get around this by only keeping the values so it doesn't try to match on the index or resetting the index on the subset series.
>>> s.iloc[1:4]
1 2
2 3
3 5
dtype: int64
Notice the index values since the original, unsubset series is the following.
>>> s
0 1
1 2
2 3
3 5
dtype: int64
The index of the first row in df is 0. By dropping the indices with the values call, you bypass the index matching which is producing the NaN. By resetting the index in the second option, you make the indices the same.