Preserving Cases while Column Matching from List in Pandas - python

I have a dataframe with strings with varying cases for column names as well as a list of lowercase strings.
a = [5, 5, 4, 6]
b = [4, 4, 4, 4]
c = [6, 8, 2, 3]
d = [8, 6, 4, 3]
df = pd.DataFrame({'a': a,
'B_B': b,
'CC_Cc': c,
'Dd_DdDd': d})
cols = ['b_b', 'cc_cc', 'dd_dddd']
I want to select the columns in df that match the strings in cols while preserving the cases of the columns in df. I've been able to match the column names by making them all lowercase, but I'm not sure how save the original cases of the dataframe columns.
In this case I would want to create a new dataframe with only the columns in df from keep cols, but with their original cases. How would I go about doing this?
Desired output:
B_B CC_Cc Dd_DdDd
0 4 6 8
1 4 8 6
2 4 2 4
3 4 3 3

You can use str.lower() to convert the column names to lower case, then construct a logical series with isin method to select the columns; the column names will not be altered in this way:
df.loc[:, df.columns.str.lower().isin(cols)]
An alternative method would be to use filter function, in regex specify a modifier (?i) to ignore case:
df.filter(regex="(?i)" + "|".join(cols))
Notice this regex method also matches column names that contain the pattern in cols list, if you want a exact match ignoring cases, you can add word boundaries in:
df.filter(regex="(?i)\\b"+"\\b|\\b".join(cols)+"\\b")

Related

How to slice a dataframe with list comprehension by using a condition that compares two consecutive rows

I have a dataframe:
dfx = pd.DataFrame()
dfx['Exercise'] = ['squats', 'squats', 'rest', 'rest', 'squats', 'squats', 'rest', 'situps']
dfx['Score'] = [8, 7, 6, 5, 4, 3, 2, 1]
By using a list comprehension (or any other technique other than looping), I want to create a list list_ex that contains those slices of dfx in which the consecutive exercise is the same as the current exercise. As soon as the consecutive exercise is not the same as the one in the current row, the next dataframe slice begins.
With respect to the example, that means that list_ex should contain 5 dataframes:
The first df contains two rows: ('squats', 8) and ('squats', 7)
The second df contains two rows: ('rest', 6) and ('rest', 5)
The third df contains two rows: ('squats', 4) and ('squats', 3)
The fourth df contains one row: ('rest', 2)
The fifth df contains one row: ('situps', 1)
Each dataframe should have the same header as dfx.
Sorry for explaining this in such a way. I was not able to produce code for the desired list of dataframes.
I tried using a list comprehension but failed with respect to including the comparison of current and consecutive row. How can I reach the desired result?
Is this what you want?
group = dfx['Exercise'].ne(dfx['Exercise'].shift()).cumsum()
list_ex = [g for _,g in dfx.groupby(group)]
output:
[ Exercise Score
0 squats 8
1 squats 7,
Exercise Score
2 rest 6
3 rest 5,
Exercise Score
4 squats 4
5 squats 3,
Exercise Score
6 rest 2,
Exercise Score
7 situps 1]

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

pandas.DataFrame.value_counts : How to sort results based upon separate list?

I have a list of labels, labels=['a', 'b', 'c', 'd'] and a dataframe with each row consisting of an index followed by one of these letters. Some letters are repeated, whereas others may not even appear. I'm looking for the best way to produce a list of four numbers, where each entry counts the number of occurrences of the letter and where the ordering is prescribed the labels.
import pandas as pd
labels=['a','b','c','d']
occurrences=['d','d','b','c','b','d','b','c','c','d','b','d']
#Observe that 'a' never appears.
df = pd.DataFrame(occurrences,columns=['occurences'])
counts=df['occurences'].value_counts()
Here, counts will produce a series(?) whose columns are b,d,c and 5,4,3. What I want is the list
[0,5,3,4]
You can reindex the Series returned by value_counts:
In [337]: counts.reindex(labels).fillna(0).astype(int)
Out[337]:
a 0
b 4
c 3
d 5
Name: occurences, dtype: int32
If you just want the values, you can cast it to a list:
In [339]: list(counts.reindex(labels).fillna(0).astype(int))
Out[339]: [0, 4, 3, 5]

Replace the values of multiple rows with the values of another row based on a condition in Pandas

I want to replace the values of the columns A, B, C, D with the values where region = '' for the a unique value for the year 2011. For example, the unique column with the value 1 for 2011 will replace its 3, 4, 9, 8 values with 6, 6, 6, 6; this approach would then be applied to the unique values 2 and 3. Afterwards the rows where region = '' would be dropped.
Other questions related to this don't have the answers I am looking for. I have tried using loc but to no avail.
I think you need this:
df=df.groupby(['unique','year']).agg('last').reset_index()

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Categories