I have two files. One contains the metadata/labels, the other contains the actual count data that has a label corresponding to the metadata file.
I went through the metadata file and slices out the labels I wanted using Pandas and exported it into a list.
How can I take that list of labels and use that to slice a Pandas DataFrame by column label?
I've done something similar with row labels, but that was using Pandas .isin() function, which can't be used on columns.
Edit:
When I'm slicing out rows based on whether the name of the row is found in a list I use a one-liner similar to this
row_list = ['row_name1', 'row_name2', row_name3']
sliced_rows = df[df['row_names'].isin(row_list)]
df =
row_names 1 2 3 4
row_name1 0 2 0 6
row_name5 0 0 1 0
row_name2 0 0 0 0
row_name17 0 5 6 5
So here I'd get row_names1 & rownames_2
I'm trying to do the same thing, but when row_names are labelling the columns instead of the names.
So the matrix would look something like this.
label column_name1 column_name2 column_name3 column_name4
1 0 2 0 6
2 0 0 1 0
3 0 0 0 0
4 0 5 6 5`
And I'd select by column based on whether or not the name of that column was in a list for the entire dataframe.
Actually you can use isin:
In [34]:
df = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df
Out[34]:
A B C D
0 0.540783 0.206722 0.627336 0.865066
1 0.204596 1.317936 0.624362 -0.573012
2 0.124457 1.052614 -0.152633 -0.021625
3 0.415278 1.469842 0.581196 0.143085
4 0.043743 -1.191018 -0.202574 0.479122
In [37]:
col_list=['A','D']
df[df.columns[df.columns.isin(col_list)]]
Out[37]:
A D
0 0.540783 0.865066
1 0.204596 -0.573012
2 0.124457 -0.021625
3 0.415278 0.143085
4 0.043743 0.479122
So what you can do is call isin and pass your list, this will produce a boolean series:
In [38]:
df.columns.isin(col_list)
Out[38]:
array([ True, False, False, True], dtype=bool)
You then use the boolean mask to mask your columns:
In [39]:
df.columns[df.columns.isin(col_list)]
Out[39]:
Index(['A', 'D'], dtype='object')
You now have an array of columns you can use to subset the df with
Related
I have read about dataframe loc. I could not understand why the length of dataframe(indexPD) is being supplied to loc as a first argument. Basically what does this loc indicate?
tp_DataFrame = pd.DataFrame(columns=list(props_file_data["PART_HEADER"].split("|")))
indexPD = len(tp_DataFrame)
tp_DataFrame.loc[indexPD, 'item_id'] = something
That is simply telling pandas you want to do the operation on all of the rows of that column of your dataframe. Consider this pandas Dataframe:
df = pd.DataFrame(zip([1,2,3], [4,5,6]), columns=['a', 'b'])
a b
0 1 4
1 2 5
2 3 6
Your transformation df.loc[len(df), 'b'] = -1 is equivalent to df.loc[:, 'b'] = -1. You are applying this -1 transformation to all rows of the desired column, both yield:
a b
0 1 -1
1 2 -1
2 3 -1
The purpose of the first argument is so you specify which indices in that column will suffer the transformation. For instance, if you only want the first 2 rows to suffer the transformation then you can specify it like this:
df.loc[[0,1], 'b'] = -1
a b
0 1 -1
1 2 -1
2 3 6
How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?
Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0
One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0
I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0
I have a table where one of the columns is an Array of binary features, they are there when that feature is present.
I'd like to train a logistic model on these rows, but can't get the data in the required format where each feature value is it's own column with a 1 or 0 value.
Example:
id feature values
1 ['HasPaws', 'DoesBark', 'CanFetch']
2 ['HasPaws', 'CanClimb', 'DoesMeow']
I'd like to get it to the format of
id HasPaws DoesBark CanFetch CanClimb DoesMeow
1 1 1 1 0 0
2 1 0 0 1 0
It seems like there would be some functionality built in to accomplish this, but I can't think of what this transformation is called to do a better search on my own.
You can first convert lists to columns and then use get_dummies() method:
In [12]: df
Out[12]:
id feature_values
0 1 [HasPaws, DoesBark, CanFetch]
1 2 [HasPaws, CanClimb, DoesMeow]
In [13]: (pd.get_dummies(df.set_index('id').feature_values.apply(pd.Series),
...: prefix='', prefix_sep='')
...: .reset_index()
...: )
Out[13]:
id HasPaws CanClimb DoesBark CanFetch DoesMeow
0 1 1 0 1 1 0
1 2 1 1 0 0 1
Another option is to loop through the feature values column, and construct a series from each cell with the values in the list as index. And in this way, pandas will expand the series into a data frame with index as headers:
pd.concat([df['id'],
(df['feature values'].apply(lambda lst: pd.Series([1]*len(lst), index=lst))
.fillna(0)], axis=1)
method 1
pd.concat([df['id'], df['feature values'].apply(pd.value_counts)], axis=1).fillna(0)
method 2
df.set_index('id').squeeze().apply(pd.value_counts).reset_index().fillna(0)
method 3
pd.concat([pd.Series(1, f, name=i) for _, (i, f) in df.iterrows()],
axis=1).T.fillna(0).rename_axis('id').reset_index()
I have a dataframe where some column labels occur multiple times (i.e., some columns have the same label). This is causing me problems -- I may post more about this separately, because some of the behavior seems a little strange, but here I just wanted to ask about deleting some of these columns. That is, for each column label that occurs multiple times, I would like to delete all but the first column it heads. Here's an exammple:
In [5]: arr = np.array([[0.0, 1.0, 2.0, 3.0], [4.0, 5.0, 6.0, 7.0]])
In [6]: df = pd.DataFrame(data=arr, columns=['A', 'C', 'E', 'A'])
In [7]: df
Out[7]:
A C E A
0 0 1 2 3
1 4 5 6 7
If I drop columns using the label, all columns headed by that label are dropped:
In [9]: df.drop('A', axis=1)
Out[9]:
C E
0 1 2
1 5 6
So I thought I'd try dropping by the column index, but that also deletes all the columns headed by that label:
In [12]: df.drop(df.columns[3], axis=1)
Out[12]:
C E
0 1 2
1 5 6
How can I do what I want, that is, for each such label, delete all but one of the columns? For the above example, I'd want to end up with:
A C E
0 0 1 2
1 4 5 6
For now I've relabeled the columns, as follows:
columns = {}
new_columns = []
duplicate_num = 0
for n in df.columns:
if n in columns:
new_columns.append("duplicate%d" % (duplicate_num))
duplicate_num += 1
else:
columns[n] = 1
new_columns.append(n)
df.columns = new_columns
This works fine for my needs, but it doesn't seem like the best/cleanest solution. Thanks.
Edit: I don't see how this is a duplicate of the other question. For one thing, that deals with duplicate columns, not duplicate column labels. For another, the suggested solution there involved transposing the dataframe (twice), but as mentioned there, transposing large dataframes is inefficient, and in fact I am dealing with large dataframes.
In [18]:
df.ix[: , ~df.columns.duplicated()]
Out[18]:
A C E
0 0 1 2
1 4 5 6
Explanation
In [19]:
~df.columns.duplicated()
Out[19]:
array([ True, True, True, False], dtype=bool)
as you can see here you need first to check whether a column name is duplicated or not , notice that I've added ~ at the beginning of the function .
then you can slice columns using the non duplicated values
and I have some confussion on how does pandas use filtered rows.
Say we have this market data dataframe 'df':
Time Open High Low Close Volume
31.12.2003 23:00:00.000 82440 83150 82440 82880 47686.32
01.01.2004 23:00:00.000 82830 83100 82350 83100 37571.04
02.01.2004 23:00:00.000 83100 83100 83100 83100 0.00
Now we filter the rows to get a df only for the days that the markets are open (Volume>0)
df=df[df['Volume']>0]
Because of the way we filtered the dataframe, there are empty rows that still have indexes, and values, but they are not used in calculations, for instance, if we do:
df.mean()
The values of the filtered rows wont appear on the calculation.
The confusing part comes here:
How could we do the average of the last 2 values counting from row 3, using only the not filtered values?
Meaning, if we filtered out the row 2, it should get the mean of rows 3 and 1.
----------- EDIT --------------
Hey, thanks for the comment, trying to be more clear:
Say we have this example dataframe:
Index Volume
0 1
1 0
2 1
3 1
Then we filter it:
df=df[df['Volume']>0]
If we send the dataframe to numpy, in order to plot or work iterate through the dataframe, it will send also the rows that we dont want.
If we iterate over that data, it will also iterate (and consider) the indexes that we are ruling out.
So, how can we get a copy of the dataframe that excludes the ruled out rows, to avoid those two problems?
I think you're running into a pretty common problem with boolean indexing. When you're trying to filter a DataFrame with DataFrame of booleans, you need to specify how to handle cases where things are True for some columns/rows, but false for other columns/rows. Do you want items where things are True everywhere, or anywhere.
It's especially tricky in this case since your DataFrame is 1-d, so you'd expect things to work like a Series, where there's no ambiguity: with a Series a row is either True or False; it can't be True in some columns and False in others.
To resolve the ambiguity with DataFrames, use the any() or all() methods:
In [36]: df
Out[36]:
Volume
Index
0 1
1 0
2 1
3 1
[4 rows x 1 columns]
In [37]: df[(df > 0).all(1)]
Out[37]:
Volume
Index
0 1
2 1
3 1
[3 rows x 1 columns]
The 1 inside the all() just says across the 1 axis (the columns)
Here's a 2-d example that might help clear things up:
In [39]: df = pd.DataFrame({"A": ['a', 'b', 'c', 'd'], "B": ['e', 'f', 'g', 'h']})
In [40]: df
Out[40]:
A B
0 a e
1 b f
2 c g
3 d h
[4 rows x 2 columns]
In [41]: bf = pd.DataFrame({"A": [True, True, False, False], "B": [True, False, True, False]})
In [42]: bf
Out[42]:
A B
0 True True
1 True False
2 False True
3 False False
[4 rows x 2 columns]
First, the "wrong" way, with the ambiguity unresolved. It's unclear what to do with (1, 'B') since it's false in bf, but there is a 1 row and a B column, so a NaN is filled:
In [43]: df[bf]
Out[43]:
A B
0 a e
1 b NaN
2 NaN g
3 NaN NaN
[4 rows x 2 columns]
All matches only the first row, since that's the only one with both True:
In [44]: df[bf.all(1)]
Out[44]:
A B
0 a e
[1 rows x 2 columns]
any matches all but the last row, since that one has both Falsees
In [45]: df[bf.any(1)]
Out[45]:
A B
0 a e
1 b f
2 c g
[3 rows x 2 columns]