Python - Select row in NumPy array where multiple conditions are met - python

My program contains many different NumPy arrays, with various data inside each of them. An example of an array is:
x = [5, 'ADC01', Input1, 25000], # Where [TypeID, Type, Input, Counts]
[5, 'ADC01', Input2, 40000]
From separate arrays I can retrieve the value of Type and Input. I then need to say
Counts = x[0,3] where Type = 'ADC01' and Input = 'Input2'
Obviously it would not be wrote like this. For the times that I have only needed to satisfy one condition, I have used:
InstType_ID = int(InstInv_Data[InstInv_Data[:,glo.inv_InstanceName] == Instrument_Type_L][0,glo.inv_TypeID])
Here, it looks in array(InstInv_Data) at the 'InstanceName' column and finds a match to Instrument_Type. It then assigns the 'TypeID' column to InstType_ID. I basically want to add an and statement so it also looks for another matching piece of data in another column.
Edit: I just thought that I could try and do this in two separate steps. Returning both Input and Counts columns where Type-Column = Type. However, I am unsure of how to actually return two columns, instead of a specific one. Something like this:
Intermediate_Counts = (InstDef_Data[InstDef_Data[:,glo.i_Type] == Instrument_Type_L][0,(glo.i_Input, glo.i_Counts])

You could use a & b to perform element-wise AND for two boolean arrays a, b:
selected_rows = x[(x[:,1] == 'ADC01') & (x[:,2] == 'Input2')]
Similarly, use a | b for OR and ~a for NOT.

Related

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

Taking first 2 values from a list

In the first column of a file I am looking at, it reads a list of 7 different comma separated values, like so:
alex,43,37,12,1,2,5
There are 2000 more rows that are set up the same way.
I only am interested in the first 2, the rest are unimportant to me. I am trying to assign the first two values into separate columns in my dataframe, like so:
typesx = (line.split(",") for line in df['firstcolumn'])
ty=((type[0], type[1]) for type in typesx)
for z in range(len(df)):
df['firstplaceholder'][z] = type(0)
df['secondplaceholder'][z] = type(1)
However, this only places the type into the respective columns (i.e int, string), go figure.
I am confused because if I choose to print type 0 and type 1, it will print what I am looking for (in this case alex, 43).
Like so:
for a,b in ty:
print(a,b)
However, I am unsure how to copy these values into the separate columns I have created (firstplaceholder, secondplaceholder).
It seems to me that you are making the mistake while retrieving the values. Try the below code.
df['firstplaceholder'][z] = ty[z][0]
df['secondplaceholder'][z] = ty[z][1]
df=pd.DataFrame(data={"1":["a,bb,c,d"],"2":["n,g,r,h,l"]},index=['firstcolumn']).T
df['secondcolumn']=df['firstcolumn'].apply(lambda s:s.split(',')[1])
df['firstcolumn'] =df['firstcolumn'].apply(lambda s:s.split(',')[0])

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Pandas convert columns type from list to np.array

I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df

Selecting dataframe rows based on multiple columns, where new functions should be created to handle conditions in some columns

I have a dataframe that consists of multiple columns. I want to select rows based on conditions in multiple columns. Assuming that I have four columns in a dataframe:
import pandas as pd
di={"A":[1,2,3,4,5],
"B":['Tokyo','Madrid','Professor','helsinki','Tokyo Oliveira'],
"C":['250','200//250','250//250//200','12','200//300'],
"D":['Left','Right','Left','Right','Right']}
data=pd.DataFrame(di)
I want to select Tokyo in column B, 200 in column C, Left in column D. By that, the first row will be only selected. I have to create a function to handle column C. Since I need to check the first value if the row contains a list with //
To handle this, I assume this can be done through the following:
def check_200(thecolumn):
thelist=[]
for i in thecolumn:
f=i
if "//" in f:
#split based on //
z=f.split("//")
f=z[0]
f=float(f)
if f > 200.00:
thelist.append(True)
else:
thelist.append(False)
return thelist
Then, I will create the multiple conditions:
selecteddata=data[(data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left"))&(check_200(data.C))]
Is this the best way to do that, or there is an easier pandas function that can handle such requirements ?
I don't think there is a most pythonic way to do this, but I think this is what you want:
bool_idx = ((data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left")) & (data.C.str.contains("//")
& (data.C.str.split("//")[0].astype(float)>200.00))
selecteddata=data[bool_idx]
Bruno's answer does the job, and I agree that boolean masking is the way to go. This answer keeps the code a little closer to the requested format.
import numpy as np
def col_condition(col):
col = col.apply(lambda x: float(x.split('//')[0]) > 200)
return col
data = data[(data.B.str.contains('Tokyo')) & (data.D.str.contains("Left")) &
col_condition(data.C)]
The function reads in a Series, and converts each element to True or False, depending on the condition. It then returns this mask.

Categories