Subset dataframe based on integer in column name

Subset dataframe based on integer in column name - python

I have a dataframe that has names such as these for its columns:
column_names=[c_12_2_heart,
c_29_4_lung,
c_21_21_stomach,
c_2_25_bladder,
c_40_1_kidney]
In Python, how can I return a list of only the dataframe columns where the number after the first underscore is greater than 20?

We can use a list comprehension with basic string splitting logic:
column_names = ["c_12_2_heart", "c_29_4_lung", "c_21_21_stomach", "c_2_25_bladder", "c_40_1_kidney"]
output = [x for x in column_names if int(x.split("_")[1].split("_")[0]) > 20]
print(output) # ['c_29_4_lung', 'c_21_21_stomach', 'c_40_1_kidney']

Alternatively to what #Tim Biegeleisen wrote, you can use numpy.where after splitting the columns string.
import numpy as np
numbers = np.array([int(c.split('_')[1]) for c in column_names])
inds = np.where(numbers > 20)[0]
column_names_filt = [column_names[i] for i in inds]
print(column_names_filt)

Related

How to get the sum of elements in two different lists in a DataFrame instead of concatenation in Python?

I have a DataFrame that contains two columns, 'A_List' and 'B_List', which are of the string dtype. I have converted these to lists and I would like to now perform element wise addition of the elements in the lists at specific indices. I have attached an example of the csv file I'm using. When I do the following, I am getting an output that is joining the elements at the specified indices as opposed to finding their sum. What may I try differently to achieve the sum instead?
For example, when I do row["A_List"][0] + row["B_List"][3], the desired output would be 0.16 (since 0.1+0.06 = 0.16). Instead, I am getting 0.10.06 as my answer.
import pandas as pd
df = pd.read_csv('Example.csv')
# Get rid of the brackets []
df["A_List"] = df["A_List"].apply(lambda x: x.strip("[]"))
df["B_List"] = df["B_List"].apply(lambda x: x.strip("[]"))
# Convert the string dtype of values into a list
df["A_List"] = df["A_List"].apply(lambda x: x.split())
df["B_List"] = df["B_List"].apply(lambda x: x.split())
for i, row in df.iterrows():
print(row["A_List"][0] + row["B_List"][3])

You need to turn the individual values into floats when parsing your string lists.
In one step, you can do the following with DataFrame.applymap which applies the given function to every element one at a time, and a lambda containing a list comprehension around str.strip and str.split.
import pandas as pd
df = pd.DataFrame(
{
"A_List": ["[0.1 0.2 0.3]", "[1.1 1.2 1.3]"],
"B_List": ["[0.9 0.8 0.7]", "[0.4 0.3 0.2]"],
}
)
df[["A_List", "B_List"]] = df[["A_List", "B_List"]].applymap(
lambda x: [float(v) for v in x.strip("[]").split()]
)
for i, row in df.iterrows():
print(row["A_List"][0] + row["B_List"][2])
prints
0.7999999999999999
1.3

How to drop Ith row of a data frame

How do I drop row number i of a DF ?
I did the thing below but it is not working.
DF = DF.drop(i)
So I wonder what I miss there.

You must pass a label to drop. Here drop tries to use i as a label and fails (ith KeyError) as your index probably has other values. Worse, if the index was composed of integers in random order you might drop an incorrect row without noticing it.
Use:
df.drop(df.index[i])
Example:
df = pd.DataFrame({'col': range(4)}, index=list('ABCD'))
out = df.drop(df.index[2])
output:
col
A 0
B 1
D 3
pitfall
In case of duplicated indices, you might remove unwanted rows!
df = pd.DataFrame({'col': range(4)}, index=list('ABAD'))
out = df.drop(df.index[2])
output (A is incorrectly dropped!):
col
B 1
D 3
workaround:
import numpy as np
out = df[np.arange(len(df)) != i]
drop several indices by position:
import numpy as np
out = df[~np.isin(np.arange(len(df)), [i, j])]

You need to add square brackets:
df = df.drop([i])

Try This:
df.drop(df.index[i])

pandas: create column conditioned on row containing a string in list

I have a dataframe with 20 columns, and 3 of those columns (always the same) may contain one or more of these strings ["fraction", "fractional", "1/x", "one fifth"].
I want to add a new column that says whether or not each row is "fractional" (in other words, contains one of those words). This column could have Y or N to indicate this.
I've tried to do it with iterrows, like so:
list_of_fractional_widgets = []
for index, row in df.iterrows():
fractional_keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
# use str to remove offending nan values
xx = str(row["HeaderX"])
yy = str(row["HeaderY"])
zz = str(row["HeaderZ"])
widget_data = [xx, yy, zz]
for word in fractional_keywords:
found = [True for x in widget_data if word in x]
if len(found)>0:
list_of_fractional_widgets.append('Y')
break
if len(found) ==0:
list_of_fractional_widgets.append('N')
df['Fractional?'] = list_of_fractional_widgets
however, I'm trying to understand if there is a more pandas / numpy efficient way to do so. Something like:
np.where(df['HeaderX'].str.contains(fractional_keywords?)), True)
as described in this SO question, but using a list and different headers.

Create a single pattern by joining all the words with '|'. Then we check the condition in each column separately using Series.str.contains and create a single mask using np.logical_or.reduce.
Sample Data
import pandas as pd
import numpy as np
keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
np.random.seed(45)
df = pd.DataFrame(np.random.choice(keywords+list('abcdefghijklm'), (4,3)),
columns=['HeaderX', 'HeaderY', 'HeaderZ'])
Code
pat = '|'.join(keywords)
df['Fractional?'] = np.logical_or.reduce([df[col].str.contains(pat)
for col in ['HeaderX', 'HeaderY', 'HeaderZ']])
HeaderX HeaderY HeaderZ Fractional?
0 g one fifth fraction True
1 one fifth Fraction k True
2 fractional j d True
3 j d h False
As a bonus, Series.str.contains can accept a case=False argument to ignore case when matching so there is no need to separately specify both 'fraction' and 'Fraction' (or any arbitrary capitalization like 'FracTIOn').

Pandas String to Integer by Character

In a Pandas data frame column, I want to convert each character in a string to an integer (as is done with ord()) and add 100 to the left. I know how to do this with a regular string:
st = "JOHNSMITH4817001141979"
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
b=int(b)
b
Result: 100074079072078083077073084072052056049055048048049049052049057055057
But what if I wanted to perform this operation on every cell of a column in a Pandas data frame like this one?
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df
string
0 JOHNSMITH4817001141979
1 JOHNSMYTHE4817001141979
I just need a separate column with the result as an integer for each cell in 'string'.
Thanks in advance!

First, you transform your processing chain into a function such as:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
and then you call it iteratively for each element in the column and make this list the new column
df['result'] = [get_it(i) for i in df['string']]
Although this does work, I yet think that you can find a better solution by optimizing your process "get_it"

Also, you can do the following:
def get_it(a):
a=[ord(x) for x in st]
b=[]
for x in a:
b.append('{:03}'.format(x)) #Add leading zero, ensuring 3 digits
b=['100']+b
b=''.join([ "%s"%x for x in b])
return int(b)
df['result'] = df['string'].apply(get_it)

If you want a one-liner(Python 3.6+)
import pandas as pd
df = pd.DataFrame({'string':['JOHNSMITH4817001141979','JOHNSMYTHE4817001141979']})
df['string'].apply(lambda x:''.join(['100']+[f'{ord(i):03}' for i in x])).astype(int)
For Python < 3.6, replace f-format to '{ord(i):03}'.format(i=i). What I have done is transform your function into a lambda expression and apply it to the column.

Numbers with hyphens or strings of numbers with hyphens

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.

You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']

You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subset dataframe based on integer in column name - python

I have a dataframe that has names such as these for its columns: column_names=[c_12_2_heart, c_29_4_lung, c_21_21_stomach, c_2_25_bladder, c_40_1_kidney] In Python, how can I return a list of only the dataframe columns where the number after the first underscore is greater than 20?

Alternatively to what #Tim Biegeleisen wrote, you can use numpy.where after splitting the columns string. import numpy as np numbers = np.array([int(c.split('_')[1]) for c in column_names]) inds = np.where(numbers > 20)[0] column_names_filt = [column_names[i] for i in inds] print(column_names_filt)

Related

How to get the sum of elements in two different lists in a DataFrame instead of concatenation in Python?

How to drop Ith row of a data frame

pandas: create column conditioned on row containing a string in list

Pandas String to Integer by Character

Numbers with hyphens or strings of numbers with hyphens

Categories

Resources