custom sorting pandas dataframe - python

I have a (very large) table using pandas.DataFrame. It contains wordcounts from texts; the index is the wordlist:
one.txt third.txt two.txt
a 1 1 0
i 0 0 1
is 1 1 1
no 0 0 1
not 0 1 0
really 1 0 0
sentence 1 1 1
short 2 0 0
think 0 0 1
I want to sort the wordlist on the frequency of words in all texts. So I can easily create a Series which contains the frequency sum for each word (using the words as index). But how how can I sort on this list?
One easy way would be to add the list to the dataframe as column, sort on it and then delete it. For performance reasons I would like to avoid this.
Two other ways are described here, but the one duplicates the dataframe which is a problem because of its size, and the other creates a new index, but I need the information about the words further down the line.

You could compute the frequency and use the sort method to find the desired order of the index. Then use df.loc[order.index] to reorder the original DataFrame:
order = df.sum(axis=1).sort(inplace=False)
result = df.loc[order.index]
For example,
import pandas as pd
df = pd.DataFrame({
'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0],
'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0],
'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]},
index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think'])
order = df.sum(axis=1).sort(inplace=False, ascending=False)
print(df.loc[order.index])
yields
one.txt third.txt two.txt
sentence 1 1 1
is 1 1 1
short 2 0 0
a 1 1 0
think 0 0 1
really 1 0 0
not 0 1 0
no 0 0 1
i 0 0 1

Related

Dataframe column: to find local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4
You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

Index-based access to rows in pandas.DataFrame with Sparse columns

Due to memory limitations I have to use sparse columns in a pandas.DataFrame (pandas version 1.0.5).
Unfortunately, with index-based access to rows (using .loc[]), I am running into the following issue:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0
2 0 1 0 0
If using .loc:
df.loc[[0,1]]
Output:
0 1 2 3
0 0 0 NaN 1
1 1 0 NaN 0
Ideally, I would be expecting 0s for column two as well. My hypothesis of what's happening here is that the internal csc-matrix representation and the fact that I am accessing values in rows of a column that does not contain any non-zero values originally messes with the fill-value. The dtypes sort of speak against this:
df.loc[[0,1]].dtypes
Output:
0 Sparse[int32, 0]
1 Sparse[int32, 0]
2 Sparse[float64, 0]
3 Sparse[int32, 0]
(note that the fill-value is still given as 0, even though the view's dtype for column 2 has changed from Sparse[int32, 0] to Sparse[float64, 0]).
Can anyone tell me whether all NaNs occuring in a row-sliced pd.DataFrame with sparse columns indeed refer to the respective zero-value and will not "hide" any actual non-zero entries? Is there a "safe" way to use index-based row access on pd.DataFrames with sparse columns?
So this indeed turned out to be a bug in pandas that has been fixed in version 1.1.0 (see GitHub for an issue description and the changelog for 1.1.0).
In 1.1.0 the minimal example works:
df = pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix([[0, 0, 0, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
)
df.loc[[0, 1]]
Output:
0 1 2 3
0 0 0 0 1
1 1 0 0 0

Python: combine boolean columns in Pandas dataframes

I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories