I have a dataset which contains contributor's id and contributor_message. I wanted to retrieve all samples with the same message, say, contributor_message == 'I support this proposal because...'.
I use data.loc[data.contributor_message == 'I support this proposal because...'].index -> so basically you can get the index in the DataFrame with the same message, say those indices are 1, 2, 50, 9350, 30678,...
Then I tried data.iloc[[1,2,50]] and this gives me correct answer, i.e. the indices matches with the DataFrame indices.
However, when I use data.iloc[9350] or higher indices, I will NOT get the corresponding DataFrame index. Say I got 15047 in the DataFrame this time.
Can anyone advise how to fix this problem?
This occurs when your indices are not aligned with their integer location.
Note that pd.DataFrame.loc is used to slice by index and pd.DataFrame.iloc is used to slice by integer location.
Below is a minimal example.
df = pd.DataFrame({'A': [1, 2, 1, 1, 5]}, index=[0, 1, 2, 4, 5])
idx = df[df['A'] == 1].index
print(idx) # Int64Index([0, 2, 4], dtype='int64')
res1 = df.loc[idx]
res2 = df.iloc[idx]
print(res1)
# A
# 0 1
# 2 1
# 4 1
print(res2)
# A
# 0 1
# 2 1
# 5 5
You have 2 options to resolve this problem.
Option 1
Use pd.DataFrame.loc to slice by index, as above.
Option 2
Reset index and use pd.DataFrame.iloc:
df = df.reset_index(drop=True)
idx = df[df['A'] == 1].index
res2 = df.iloc[idx]
print(res2)
# A
# 0 1
# 2 1
# 3 1
Related
Python newbie here with a challenge I'm working to solve...
My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:
pseudo code (may not be correct method)
step 1: set row 0 to an initial value
step 2: compare row 1 to row 0, add changes to a list and record row number
step 3: set current row to new initial
step 4: compare row 2 to row 1, add changes to a list and record row number
step 5: iterate through all rows
step 6: return a table with changes and row index where change occurred
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta():
changes = []
initial = df.loc[0]
for row in df:
if row[i] != initial:
changes.append[i]
delta()
changes I expect to see:
index 1: col2 changed from 1 to 2, 2 should be added to changes list
index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list
index 4: col 1 changed from 2 to 3, 3 should be added to changes list
You can check where each of the columns have changed using the shift method and then use a mask to only get the ones that have changed
df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))
Once you have identified the changes, you can agg them together
# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0 no change,no change,no change
#1 no change,2.0,no change
#2 2.0,no change,2.0
#3 no change,no change,no change
#4 3.0,no change,no change
You can use the pandas function diff() which will already provide the increment compared to the previous row:
import pandas as pd
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta(df):
deltas = df.diff() # will convert to float because this is needed to set Nans in the first row
deltas.iloc[0] = df.iloc[0] # replace Nans in first row with original data from first row
deltas = deltas.astype(df.dtypes) # reset data types according to input data
filter = (deltas!=0).any(axis=1) # filter to use only those rows where all values are non zero
filter.iloc[0] = True # make sure the first row is included even if original data for first row held only zeros
deltas = deltas.loc[filter] # actually apply the filter
return deltas
print( delta(df) )
This prints:
col1 col2 col3
0 1 1 1
1 0 1 0
2 1 0 1
4 1 0 0
I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4
I'm trying to union several pd.DataFrames along the column axis, using the index to remove duplicates (A and B are from the same source "table" filterd by different predicates and I'm tring to recombine).
A = pd.DataFrame({"values": [1, 2]}, pd.MultiIndex.from_tuples([(1,1),(1,2)], names=('l1', 'l2')))
B = pd.DataFrame({"values": [2, 3, 2]}, pd.MultiIndex.from_tuples([(1,2),(2,1),(2,2)], names=('l1', 'l2')))
pd.concat([A,B]).drop_duplicates() fails since it ignores the index and de-dups on the values so it removed index item (2,2)
pd.concat([A.reset_index(),B.reset_index()]).drop_duplicates(subset=('l1', 'l2')).set_index(['l1', 'l2']) does what I want, but I feel like there should be a better way.
you may do a simple concat and filter out dups by using index.duplicated
df1 = pd.concat([A,B])
df1[~df1.index.duplicated()]
Out[123]:
values
l1 l2
1 1 1
2 2
2 1 3
2 2
I try to loop trough rows of a DataFrame with a function calculation most frequent element in a series. The function works perfectly when i manually supply a series into it:
# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
'b' : [1, 1, 2, 1, 1, 1, 2, 2],
'c' : [1, 2, 2, 1, 2, 2, 2, 1]})
# Create function calculating most frequent element
from collections import Counter
def freq_value(series):
return Counter(series).most_common()[0][0]
# Test function on one row
freq_value(df.iloc[1])
# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))
With both tests I get the desired result. However, when i try to apply this function in a loop through DataFrame rows and save the result into new column, i get an error "'Series' object is not callable", 'occurred at index 0'. The line producing the error is as follows:
# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)
How exactly row() in apply() function works? Shouldn't it supply to my freq_value() function values from columns 'a', 'b', 'c'?
#jpp's answer addresses how to apply your custom function, but you can also get the desired result using df.mode, with axis=1. This will avoid the use of apply, and will still give you a column of the most common value for each row.
df['result'] = df.mode(1)
>>> df
a b c result
0 1 1 1 1
1 2 1 2 2
2 1 2 2 2
3 2 1 1 1
4 1 1 2 1
5 2 1 2 2
6 1 2 2 2
7 1 2 1 1
row is not a function within your lambda, so parentheses are not appropriate, Instead, you should use the __getitem__ method or loc accessor to access values. The syntactic sugar for the former is []:
df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)
Using the loc alternative:
def freq_value_calc(row):
return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))
To understand exactly why this is the case, it helps to rewrite your lambda as a named function:
def freq_value_calc(row):
print(type(row)) # useful for debugging
return freq_value((row['a'], row['b'], row['c']))
df['result'] = df.apply(freq_value_calc, axis=1)
Running this, you'll find that row is of type <class 'pandas.core.series.Series'>, i.e. a series indexed by column labels if you use axis=1. To access the value in a series for a given label, you can either use __getitem__ / [] syntax or loc.
df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...