Selecting rows based on Boolean values in a non dangerous way - python

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.

Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Related

Quick sum of all rows that fill a condition in DataFrame

I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0
Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b
Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)

Question about drop=True in pd.dataframe.reset_index()

In a Pandas dataframe, it's possible to reset the index using the reset_index() method. One optional argument is drop=True which according to the documentation:
drop : bool, default False
Do not try to insert index into dataframe columns.
This resets the index to the default integer index.
My question is, what does the first sentence mean? Will it try to convert an integer index to a new column in my df if I leave if False?
Also, will my row order be preserved or should I also sort to ensure proper ordering?
As you can see below, df.reset_index() will move the index into the dataframe as a column. If the index was just a generic numerical index, you probably don't care about it and can just discard it. Below is a simple dataframe, but I dropped the first row just to have differing values in the index.
df = pd.DataFrame([['a', 10], ['b', 20], ['c', 30], ['d', 40]], columns=['letter','number'])
df = df[df.number > 10]
print(df)
# letter number
# 1 b 20
# 2 c 30
# 3 d 40
Default behavior now shows a column named index which was the previous index. You can see that df['index'] matches the index from above, but the index has been renumbered starting from 0.
print(df.reset_index())
# index letter number
# 0 1 b 20
# 1 2 c 30
# 2 3 d 40
drop=True doesn't pretend like the index was important and just gives you a new index.
print(df.reset_index(drop=True))
# letter number
# 0 b 20
# 1 c 30
# 2 d 40
Regarding row order, I suspect that it would be maintained, but the order in which things are stored should not be relied on in general. If you are performing an aggregate function, you probably want to make sure you have the data ordered properly for the aggrigation.

Preserving Cases while Column Matching from List in Pandas

I have a dataframe with strings with varying cases for column names as well as a list of lowercase strings.
a = [5, 5, 4, 6]
b = [4, 4, 4, 4]
c = [6, 8, 2, 3]
d = [8, 6, 4, 3]
df = pd.DataFrame({'a': a,
'B_B': b,
'CC_Cc': c,
'Dd_DdDd': d})
cols = ['b_b', 'cc_cc', 'dd_dddd']
I want to select the columns in df that match the strings in cols while preserving the cases of the columns in df. I've been able to match the column names by making them all lowercase, but I'm not sure how save the original cases of the dataframe columns.
In this case I would want to create a new dataframe with only the columns in df from keep cols, but with their original cases. How would I go about doing this?
Desired output:
B_B CC_Cc Dd_DdDd
0 4 6 8
1 4 8 6
2 4 2 4
3 4 3 3
You can use str.lower() to convert the column names to lower case, then construct a logical series with isin method to select the columns; the column names will not be altered in this way:
df.loc[:, df.columns.str.lower().isin(cols)]
An alternative method would be to use filter function, in regex specify a modifier (?i) to ignore case:
df.filter(regex="(?i)" + "|".join(cols))
Notice this regex method also matches column names that contain the pattern in cols list, if you want a exact match ignoring cases, you can add word boundaries in:
df.filter(regex="(?i)\\b"+"\\b|\\b".join(cols)+"\\b")

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Pandas combine row element in one

I have a Pandas Dataframe that is populated by an CSV, and after that I read the columns and iterate by element in row (for each element in column) and writes that element in a file. My problem is that I have elements in row that I want joined into one element.
Say I have A through Z columns, and let's say it's elements are 1 to 23. Let's say that I want joined the number 9 and 10 (columns I and J) in one element only (columns I and J become one and it's values become[9,10])
How do I achieve that using pandas (while iterating)?
My code is long but you can find it here. I've tried groupby but I think it only work with booleans and int (correct me if I'm wrong)
Also I'm pretty new to Python, any advises on my code would be much apreciated!!
Here is an example. It adds a new column where each entry is the list of two other columns. I hope it helps!
df= pd.DataFrame(np.random.randn(10,4))
df[4]= [[df[2][x],df[3][x]] for x in range(df.shape[0])]
You can concat the columns, then convert to a list using numpy's tolist():
In [56]: df = pd.DataFrame(dict(A=[1,1,1], I=[9,9,9], J=[10,10,10]))
In [57]: df
Out[57]:
A I J
0 1 9 10
1 1 9 10
2 1 9 10
In [58]: df["IJ"] = pd.concat((df.I, df.J), axis=1).values.tolist()
In [59]: df.drop(["I","J"], axis=1)
Out[59]:
A IJ
0 1 [9, 10]
1 1 [9, 10]
2 1 [9, 10]

Categories