How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?
Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0
One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0
I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0
Related
I have read about dataframe loc. I could not understand why the length of dataframe(indexPD) is being supplied to loc as a first argument. Basically what does this loc indicate?
tp_DataFrame = pd.DataFrame(columns=list(props_file_data["PART_HEADER"].split("|")))
indexPD = len(tp_DataFrame)
tp_DataFrame.loc[indexPD, 'item_id'] = something
That is simply telling pandas you want to do the operation on all of the rows of that column of your dataframe. Consider this pandas Dataframe:
df = pd.DataFrame(zip([1,2,3], [4,5,6]), columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 6
Your transformation df.loc[len(df), 'b'] = -1 is equivalent to df.loc[:, 'b'] = -1. You are applying this -1 transformation to all rows of the desired column, both yield:
a b
0 1 -1
1 2 -1
2 3 -1
The purpose of the first argument is so you specify which indices in that column will suffer the transformation. For instance, if you only want the first 2 rows to suffer the transformation then you can specify it like this:
df.loc[[0,1], 'b'] = -1
a b
0 1 -1
1 2 -1
2 3 6
I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False
I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1
I have two files. One contains the metadata/labels, the other contains the actual count data that has a label corresponding to the metadata file.
I went through the metadata file and slices out the labels I wanted using Pandas and exported it into a list.
How can I take that list of labels and use that to slice a Pandas DataFrame by column label?
I've done something similar with row labels, but that was using Pandas .isin() function, which can't be used on columns.
Edit:
When I'm slicing out rows based on whether the name of the row is found in a list I use a one-liner similar to this
row_list = ['row_name1', 'row_name2', row_name3']
sliced_rows = df[df['row_names'].isin(row_list)]
df =
row_names 1 2 3 4
row_name1 0 2 0 6
row_name5 0 0 1 0
row_name2 0 0 0 0
row_name17 0 5 6 5
So here I'd get row_names1 & rownames_2
I'm trying to do the same thing, but when row_names are labelling the columns instead of the names.
So the matrix would look something like this.
label column_name1 column_name2 column_name3 column_name4
1 0 2 0 6
2 0 0 1 0
3 0 0 0 0
4 0 5 6 5`
And I'd select by column based on whether or not the name of that column was in a list for the entire dataframe.
Actually you can use isin:
In [34]:
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df
Out[34]:
A B C D
0 0.540783 0.206722 0.627336 0.865066
1 0.204596 1.317936 0.624362 -0.573012
2 0.124457 1.052614 -0.152633 -0.021625
3 0.415278 1.469842 0.581196 0.143085
4 0.043743 -1.191018 -0.202574 0.479122
In [37]:
col_list=['A','D']
df[df.columns[df.columns.isin(col_list)]]
Out[37]:
A D
0 0.540783 0.865066
1 0.204596 -0.573012
2 0.124457 -0.021625
3 0.415278 0.143085
4 0.043743 0.479122
So what you can do is call isin and pass your list, this will produce a boolean series:
In [38]:
df.columns.isin(col_list)
Out[38]:
array([ True, False, False, True], dtype=bool)
You then use the boolean mask to mask your columns:
In [39]:
df.columns[df.columns.isin(col_list)]
Out[39]:
Index(['A', 'D'], dtype='object')
You now have an array of columns you can use to subset the df with
If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]