Check if terms are in columns and remove - python

Originally I wanted to filter only for specific terms, however I've found python will match the pattern regardless of specificity eg:
possibilities = ['temp', 'degc']
temp = (df.filter(regex='|'.join(re.escape(x) for x in temp_possibilities))
.columns.to_list())
Output does find the correct columns, but unfortunately it also returns columns like temp_uncalibrated, which I do not want.
So to solve this, so far I define and remove unwanted columns first, before filtering ie:
if 'temp_uncalibrated' in df.columns:
df = df.drop('temp_uncalibrated',axis = 1)
else:
pass
However, I have found more and more of these unwanted columns, and now the code looks messy and hard to read with all the terms. Is there a way to do this more succinctly? I tries putting the terms in a list and do it that way, but it does not work, ie:
if list in df.columns:
df = df.drop(list,axis = 1)
else:
pass
I thought maybe a def function might be a better way to do it, but not really sure where to start.

Related

Using Multiple Wildcards in Python Pandas

Thanks for helping out. Greatly appreciated. I have looked through S.O. and couldn't quite get the answer i was hoping for.
i have data frame with columns that i would like to sum, but would like to exclude based on wildcard
(so am hoping to include based on wildcard but also exclude based on wildcard)
My columns include:
"dose_1", "dose_2", "dose_3"... "new_dose" + "infusion_dose_1" + "infusion_dose_2" + many more similarly
I understand if i want to sum using wildcard, i can do
df['new_column'] = df.filter(regex = 'dose').sum(axis = 1)
but what if i want to exclude columns that contains str "infusion"?
Appreciate it!
regex probably the wrong tool for this job. Excluding based on a match is overly complicated, see Regular expression to match a line that doesn't contain a word. Just use a list comprehension to select the labels:
df = pd.DataFrame(columns=["dose_1", "dose_2", "dose_3", "new_dose",
"infusion_dose_1", "infusion_dose_2", 'foobar'])
cols = [x for x in df.columns if 'dose' in x and 'infusion' not in x]
#['dose_1', 'dose_2', 'dose_3', 'new_dose']
df['new_column'] = df[cols].sum(axis = 1)

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Value Error with indexing string despite string present

I want to use .index() to search a column of a 2D list and return the location of that line so I can then alter data at that location. I've been trying to solve a smaller version of this below.
data_test = [["2016-12-14T07:39:00.000000Z",0],["2016-12-14T07:40:00.000000Z",1],\
["2016-12-14T07:41:00.000000Z",2], ["2016-12-14T07:42:00.000000Z",3]]
string = "2016-12-14T07:39:00.000000Z"
if data_test[0][0] == string:
print('works')
else:
print("does not work")
print(data_test.index(string))
The string compare test works, so it isn't anything wrong there, but the index test below returns:
ValueError: '2016-12-14T07:39:00.000000Z' is not in list
In full operation I will be checking a list of thousands of rows, so I'm trying to avoid just looping through and doing a string comparison at each level. Any alternatives and help would be highly appreciated.
data_test = [["2016-12-14T07:39:00.000000Z",0],["2016-12-14T07:40:00.000000Z",1],
["2016-12-14T07:41:00.000000Z",2], ["2016-12-14T07:42:00.000000Z",3]]
string = "2016-12-14T07:39:00.000000Z"
for i, data in enumerate(data_test):
if data[0] == string:
print("works, index {}".format(i))
else:
pass
I'm not sure what you want to do with the index, so this could be a vastly inefficient way of going through the list. If you want to transform the list, it might be better to rebuild the list as you iterate through it once and make the transforms as you go.
EDIT:
Since data_test is a nested list, \ was unnecessary, since you can split the data structure over multiple lines. Also you can omit the else clause here entirely if you're not rebuilding the list. I just assume that you are.
The string you're looking for is not in data_test but in data_test [0] so obviously data_test.index cannot find it.

Python Pandas replace string based on format

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Categories