Extract data from a dataframe - python

I have a list based upon which I want to retrieve data from a dataset.
Here is the list:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
and here is the dataset
There are two items with multiple quantity i.e. 3 and 7
I want to extract those rows which are not in packed list. In this case its two times 7(rest 3 are in the list already)
How can I do that? I tried this but this doesn't work
new_df= data[~data["Pid"].isin(packed)].reset_index(drop=True)

Use GroupBy.cumcount with helper DataFrame, merge with left join and indicator=True and last filter by boolean indexing:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
df1 = pd.DataFrame({'Pid':packed})
df1['g'] = df1.groupby('Pid').cumcount()
print (df1)
Pid g
0 1 0
1 5 0
2 8 0
3 2 0
4 3 0
5 3 1
6 7 0
7 3 2
8 7 1
9 7 2
10 4 0
11 6 0
12 3 3
data['g'] = data.groupby('Pid').cumcount()
new_df = data[data.merge(df1, indicator=True, how='left')['_merge'].eq('left_only')]

Related

Pandas dataframe with N columns

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.
You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Call pandas explode on one column, and divide the other columns accordingly

I have a dataframe like the one below
d = {"to_explode": [[1, 2, 3], [4, 5], [6, 7, 8, 9]], "numbers": [3, 2, 4]}
df = pd.DataFrame(data=d)
to_explode numbers
0 [1, 2, 3] 3
1 [4, 5] 4
2 [6, 7, 8, 9] 12
I want to call pd.explode on the list-like column, but I want to divide the data in the other column accordingly.
In this example, the values in the numbers column for the first row would be replaced with 1 - i.e. 3 / 3 (the corresponding number of items in the to_explode column).
How would I do this please?
You need to perform the computation (get the list length with str.len), then explode:
out = (df
.assign(numbers=df['numbers'].div(df['to_explode'].str.len()))
.explode('to_explode')
)
output:
to_explode numbers
0 1 1.0
0 2 1.0
0 3 1.0
1 4 1.0
1 5 1.0
2 6 1.0
2 7 1.0
2 8 1.0
2 9 1.0

Python use dataframe column value in iloc (or shift)

Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?
Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

In pandas dataframe - returning last value of cumulative sum that satisfies condition

index [0, 1, 2, 3, 4, 5]
part_1 [4, 5, 6, 4, 8, 4]
part_2 [11, 12, 10, 12, 14, 13]
new [6, 4, 8, 8, na, na]
I'm a beginner in python & pandas asking for support. In a simple dataframe, I want to create a new column that gives me the last row of a cumulative sum that satisfies the condition
df.part_1.cumsum() > df.part_2
So e.g. for the new column at index 0 I would get the value 6 as (4+5+6) > 11.
Thanks!
IIUC here a NumPy based approach. The idea is to build an upper triangular matrix, with shifted versions of the input array in each row. By taking the cumulative sum of these, and comparing against the second column of the dataframe, we can find using argmax the first index where a value in the cumulative sequences is greater than the third dataframe column in the corresponding index:
a = df.to_numpy()
cs = np.triu(a[:,1]).cumsum(1)
ix = (cs >= a[:,2,None]).argmax(1)
# array([2, 3, 3, 4, 6, 7, 7, 0], dtype=int64)
df['first_ix'] = a[ix,1,None]
print(df)
index part_1 part_2 first_ix
0 0 4 11 6
1 1 5 12 4
2 2 6 10 4
3 3 4 12 8
4 4 8 14 6
5 5 4 13 8
6 6 6 11 8
7 7 8 10 4

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Categories