Python use dataframe column value in iloc (or shift) - python

Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?

Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

Related

Pandas dataframe with N columns

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.
You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Dataframe groupby or filter

Here is my simplified example dataframe:
timestamp A B
1422404668 1 1
1422404670 2 2
1422404672 -3 3
1422404674 -4 4
1422404676 5 5
1422404678 -6 6
1422404680 -7 7
1422404680 8 8
Is there a way to groupby/filter only positive and negative values and get first value of each group in column A and mean values of column B as below output
Expected output:
timestamp A B
1422404668 1 3
1422404672 -3 7
1422404676 5 5
1422404678 -6 13
1422404680 8 8
Data:
{'timestamp': [1422404668, 1422404670, 1422404672, 1422404674,
1422404676, 1422404678, 1422404680, 1422404680],
'A': [1, 2, -3, -4, 5, -6, -7, 8], 'B': [1, 2, 3, 4, 5, 6, 7, 8]}
IIUC, you could drop consecutively duplicate signed "A"s (so like, the row with 2 in column "A" is dropped because it has the same sign as 1, the immediate previous value in column "A"):
out = df[df['A'].ge(0).astype(int).diff()!=0]
it turns out, you don't need to convert to int (thanks #Corralien):
out = df[df['A'].ge(0).diff()!=0]
Output:
timestamp A
0 1422404668 1
2 1422404672 -3
4 1422404676 5
5 1422404678 -6
7 1422404680 8
Edit:
Given OP's edit, we could use cumsum on the mask to create group numbers and groupby it and use agg to call different methods on different columns:
out = df.groupby(df['A'].ge(0).diff().ne(0).cumsum()).agg({'timestamp':'first', 'A':'first', 'B':'sum'}).reset_index(drop=True)
Output:
timestamp A B
0 1422404668 1 3
1 1422404672 -3 7
2 1422404676 5 5
3 1422404678 -6 13
4 1422404680 8 8
something like this?
I made two frames with negative values from A column and positive values from A column.
Then find first occurence for negative and positive and concat frame to out.
df_positive = df[df['A'] > 0]
df_negative = df[df['A'] < 0]
df_positive = df_positive.groupby('A').first().reset_index()
df_negative = df_negative.groupby('A').first().reset_index()
out = pd.concat([df_positive,df_negative ])[['timestamp', 'A']]

Replace part of df column with values defined in Series/dictionary

I have a column in a DataFrame that often has repeat indexes. Some indexes have exceptions and need to be changed based on another Series I've made, while the rest of the indices are fine as is. The Series indices are unique.
Here's a couple variables to illustrate
df = pd.DataFrame(data={'hi':[1, 2, 3, 4, 5, 6, 7]}, index=[1, 1, 1, 2, 2, 3, 4])
Out[52]:
hi
1 1
1 2
1 3
2 4
2 5
3 6
4 7
exceptions = pd.Series(data=[90, 95], index=[2, 4])
Out[36]:
2 90
4 95
I would like to do set the df to ...
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95
What's a clean way to do this? I'm a bit new to Pandas, my thoughts are just to loop but I don't think that's the proper way to solve this
Assuming that the index in exceptions is guaranteed to be a subset of df indexes we can use loc and the Series.index to assign the values:
df.loc[exceptions.index, 'hi'] = exceptions
We can use index.intersection if we have extra values in exceptions that does not or should not align in df:
exceptions = pd.Series(data=[90, 95, 100], index=[2, 4, 5])
df.loc[exceptions.index.intersection(df.index, sort=False), 'hi'] = exceptions
df:
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95

In pandas dataframe - returning last value of cumulative sum that satisfies condition

index [0, 1, 2, 3, 4, 5]
part_1 [4, 5, 6, 4, 8, 4]
part_2 [11, 12, 10, 12, 14, 13]
new [6, 4, 8, 8, na, na]
I'm a beginner in python & pandas asking for support. In a simple dataframe, I want to create a new column that gives me the last row of a cumulative sum that satisfies the condition
df.part_1.cumsum() > df.part_2
So e.g. for the new column at index 0 I would get the value 6 as (4+5+6) > 11.
Thanks!
IIUC here a NumPy based approach. The idea is to build an upper triangular matrix, with shifted versions of the input array in each row. By taking the cumulative sum of these, and comparing against the second column of the dataframe, we can find using argmax the first index where a value in the cumulative sequences is greater than the third dataframe column in the corresponding index:
a = df.to_numpy()
cs = np.triu(a[:,1]).cumsum(1)
ix = (cs >= a[:,2,None]).argmax(1)
# array([2, 3, 3, 4, 6, 7, 7, 0], dtype=int64)
df['first_ix'] = a[ix,1,None]
print(df)
index part_1 part_2 first_ix
0 0 4 11 6
1 1 5 12 4
2 2 6 10 4
3 3 4 12 8
4 4 8 14 6
5 5 4 13 8
6 6 6 11 8
7 7 8 10 4

Extract data from a dataframe

I have a list based upon which I want to retrieve data from a dataset.
Here is the list:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
and here is the dataset
There are two items with multiple quantity i.e. 3 and 7
I want to extract those rows which are not in packed list. In this case its two times 7(rest 3 are in the list already)
How can I do that? I tried this but this doesn't work
new_df= data[~data["Pid"].isin(packed)].reset_index(drop=True)
Use GroupBy.cumcount with helper DataFrame, merge with left join and indicator=True and last filter by boolean indexing:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
df1 = pd.DataFrame({'Pid':packed})
df1['g'] = df1.groupby('Pid').cumcount()
print (df1)
Pid g
0 1 0
1 5 0
2 8 0
3 2 0
4 3 0
5 3 1
6 7 0
7 3 2
8 7 1
9 7 2
10 4 0
11 6 0
12 3 3
data['g'] = data.groupby('Pid').cumcount()
new_df = data[data.merge(df1, indicator=True, how='left')['_merge'].eq('left_only')]

Categories