select rows in multiplr conditions in Pandas - python

I have a dataFrame with many rows and columns and also have a specific condition list.
example DataFrame is below.
index
fruit
recipe
size
price
2
apple
burn
big
100
3
banana
fry
small
100
5
apple
slice
big
100
7
apple
fry
small
100
11
pineapple
fry
small
100
13
mango
fry
small
100
and
order = [("apple", "fry", "big"), ("apple", "fry", "big"), ...]
isin not working in multiple conditions.
I want to pick only combinations in order from DataFrame, not using iterrows.

You can try numpy broadcasting to compare value
order = [("apple", "fry", "big"), ("apple", "fry", "small")]
mask = ((df[['fruit', 'recipe', 'size']].to_numpy()[:, None] == np.array(order))
.all(axis=-1)
.any(axis=-1))
out = df[mask]
$ print(mask)
[False False False True False False]
$ print(out)
fruit recipe size price
3 apple fry small 100

Assuming each element in list order is of only 3 elements, you could do below without much performance issue:
df.loc[pd.Series(zip(df['fruit'], df['recipe'], df['size'])).isin(order)]

Related

filter pandas df with list comprehension instead of multiple ampersands

Let's say I have a dataframe with the columns dataset, inflation, cat, region, and some other values.
I want to filter my df according to the values of a list (or tuple) where the list has the values of those columns that I want to select.
My filter list will look like
filterlist=['abc', 'real', 'food', 'central']
What I'm doing now is
DF.loc[(DF['dataset']==filterlist[0]) & (DF['inflation']==filterlist[1]) & (DF['cat']==filterlist[2]) & (DF['region']==filterlist[3])]
I'd like to do something more like this:
id_cols=['dataset', 'inflation', 'cat', 'region']
DF.loc[[DF[x]==y for x,y in zip(id_cols, filterlist)]]
I hacked this together which technically works but isn't very readable, just looks clunky, and has terrible %%timeit performance (1.18ms vs 654 µs)
DF.iloc[list(set.intersection(*[set(DF[DF[id_cols[x]]==filterlist[x]].index) for x in range(len(id_cols))])),:]
I also did
DF.query(' and '.join([f"{x} == '{y}'" for x,y in zip(id_cols, filterlist)]))
This has the benefit of being fairly readable but is even worse performance.
There's got to be a better way!!
The answer is np.logical_and.reduce
Thanks to this answer
I found that I can do:
DF[np.logical_and.reduce([DF[x]==y for x,y in zip(id_cols, filterlist)])]
The %%timeit on this one is just 470 µs which is even better than the parenthesis laden example.
Use 2D comparison:
cols = ['dataset', 'inflation', 'cat', 'region']
out = df[df[cols].eq(filterlist).all(1)]
output:
dataset inflation cat region
0 abc real food central
example input df:
dataset inflation cat region
0 abc real food central
1 abc X food central
2 abc real X central
intermediates:
df[cols].eq(filterlist)
dataset inflation cat region
0 True True True True
1 True False True True
2 True True False True
df[cols].eq(filterlist).all(1)
0 True
1 False
2 False
dtype: bool

How to create a Dataframe from one Series, when it is not as simple as to transpose the object?

I've seen many similar questions here, but none of them applies to the case I need to solve. It happens that I have a products Series in which the names of the "future" columns end with the string [edit] and are mixed with the values that are going to be join in them. Something like this:
Index Values
0 Soda [edit]
1 Coke
2 Sprite
3 Ice Cream [edit]
4 Nestle
5 Snacks [edit]
6 Lays
7 Act II
8 Nachos
I need to turn this into a df, to get sth like:
Soda Ice Cream Snacks
0 Coke Nestle Lays
1 Sprite NaN Act II
2 NaN NaN Nachos
I made a Series called cols_index, which saves the index of the columns as in the first series:
Index Values
0 Soda [edit]
3 Ice Cream [edit]
5 Snacks [edit]
However, from here I don't know how to get to pass the values to the columns. As I'm new to pandas I thought in iterating using a for loop generating ranges which would refer to the elements' indexes ([1,2], [4], [6:8]), but that wouldn't be a pandorable way to do things.
How can I do this? Thanks in advance.
=========================================================
EDIT: I solved it, here's how I did it.
After reviewing the problem with a colleague, we concluded that there's no pandorable way to do it and therefore, I had to use the data as a list and apply for and if loops:
products = pd.read_csv("products_file.txt", delimiter='\n', header = None, squeeze = True)
product_list = products.values.tolist()
cols = products[products.str.contains('\[edit\]', case = False)].values.tolist() # List of elements to be columns
df = []
category = product_list[0]
for item in product_list:
if item in cols:
category = item[:-6] # Removes '[edit]'
else:
df.append((category, item))
df = pd.DataFrame(df, columns = ['Category', 'Product'])
We do isin find the column name , then with cumsum and cumcount create the pivot key then do crosstab
s=df1.Values.isin(df2.Values)
df=pd.crosstab(index=s.cumsum(),
columns=s.groupby(s.cumsum()).cumcount(),
values=df1.Values,
aggfunc='first').set_index(0).T
0 Soda IceCream Snacks
col_0
1 Coke Nestle Lays
2 Sprite NaN ActII
3 NaN NaN Nachos

Replicate a row based on a conditional

I've read through at least 10 answers to very similar questions but none of them work and/or are quite what I need. I have a large-ish dataframe where I need it to find a particular row and create a copy of the entire row. So for example:
Before:
index price quantity flavor
0 1.45 6 vanilla
1 1.85 3 berry
2 2.25 2 double chocolate
After:
index price quantity flavor
0 1.45 6 vanilla
1 1.85 3 berry
2 2.25 2 double chocolate
3 1.85 3 berry
What would seem to work based on my knowledge of pandas and python is this:
df.loc[df.index.max() + 1,:] = df.loc[df['flavor'] == 'berry'].values
However I get this error:
ValueError: setting an array element with a sequence.
Bear in mind that I have no idea where in the database "berry" might be (other than it will be in the "flavor" column). (edit to add) Also there may be more than one "berry" so it would need to find them all.
Thoughts?
So, this is probably what you want:
import pandas as pd
df = pd.DataFrame({"quantity":[6, 3, 2], "flavor":["vanilla", "berry", "double chocolate"], "price":[1.45, 1.85, 2.25]})
df = df.append(df.loc[df['flavor'] == 'berry']).reset_index()
df
#output
flavor price quantity
0 vanilla 1.45 6
1 berry 1.85 3
2 double chocolate 2.25 2
3 berry 1.85 3
Just using append and resetting index should do it.
I came up with a slightly different answer than what #user2906838 suggested. Because it is possible that there is more than one 'berry' in the dataframe, I created a new dataframe and then concatenated them:
import pandas as pd
df = pd.DataFrame({'quantity':[6, 3, 2], 'flavor':['vanilla', 'berry', 'double chocolate'], 'price':[1.45, 1.85, 2.25]})
df_flavor = pd.DataFrame
df_flavor.append(df.loc[df['flavor'] == 'berry'], sort = False)
df = pd.concat([df, df_flavor], sort = False, ignore_index = True)
This worked fine, but would love to hear if there are other solutions!

Create random.randint with condition in a group by?

I have a column called: cars and want to create another called persons using random.randint() which i have:
dat['persons']=np.random.randint(1,5,len(dat))
This is so I can put the number of persons who use these but I'd
like to know how to put a condition so in the suv category will be generated only numbers from 4 to 9 for example.
cars | persons
suv 4
sedan 2
truck 2
suv 1
suv 5
You can create an index for your series, where matching rows have True, and everything else has False. You can then assign to the rows matching that index using loc[] to select the rows; you then generate just the number of values for those selected rows:
m = dat['cars'] == 'suv'
dat.loc[m, 'persons'] = np.random.randint(4, 9, m.sum())
You could also use apply on the cars series to create the new column, creating a new random value in each call:
dat['persons'] = dat.cars.apply(
lambda c: random.randint(4, 9) if c == 'suv' else random.randint(1, 5))
But this has to make a separate function call for each row. Using a mask will be more efficient.
Option 1
So, you're generating random numbers between 1 and 5, whereas numbers in the SUV category should be between 4 and 9. That just means you can generate a random number, and then add 4 to all random numbers belonging to the SUV category?
df = df.assign(persons=np.random.randint(1,5, len(df)))
df.loc[df.cars == 'suv', 'persons'] += 4
df
cars persons
0 suv 7
1 sedan 3
2 truck 1
3 suv 8
4 suv 8
Option 2
Another alternative would be using np.where -
df.persons = np.where(df.cars == 'suv',
np.random.randint(5, 9, len(df)),
np.random.randint(1, 5, len(df)))
df
cars persons
0 suv 8
1 sedan 1
2 truck 2
3 suv 5
4 suv 6
There may be a way to do this with something like a groupby that's more clever than I am, but my approach would be to build a function and apply it to your cars column. This is pretty flexible - it will be easy to build in more complicated logic if you want something different for each car:
def get_persons(car):
if car == 'suv':
return np.random.randint(4, 9)
else:
return np.random.randint(1, 5)
dat['persons'] = dat['cars'].apply(get_persons)
or in a more slick, but less flexible way:
dat['persons'] = dat['cars'].apply(lambda car: np.random.randint(4, 9) if car == 'suv' else np.random.randint(1, 5))
I had a similar problem. I'll describe what I did generally because application may vary. For smaller frames it won't matter so the above methods might work but for larger frames like mine (i.e.; hundreds of thousands to millions of rows) I would do this:
Sort dat by 'cars'
Get a unique list of cars
Create a temporary list for the random numbers
Loop over that list of cars and populate the temporary list of random
numbers and extending a new list with the temp list
Add the new list to the 'persons' column
If order matters maintain and re-sort by the index

Remove row from data frame A, when column value from data frame B is not similar

My Target
I'm trying to create a new data frame of which is formed by comparing columns from different data frames.
More specifically, when a column value from ColumnA isn't alike/present in ColumnB, that whole row is disregarded and not included in new_df
Data Frames
>>> df
ColumnA Stats
0 Cake 872
1 Cheese Cake 912
2 Egg 62
3 Raspb Jam 091
4 Bacon 123
5 Bread 425
>>> df1
ColumnB
0 Cake
1 Cheese Cake
3 Raspberry Jam
4 Bacon
My Attempt
Since I'm not sure on how to achieve this, I have done my best to produce the following, although I know it probably won't achieve my expected output:
new_df = df[df['ColumnA'].str.strip() in df1['ColumnB'].str.split()]
Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Expected Output
As you can see, for the column values that aren't present in df1, the rows are erased from df. In this case, Bread and Egg are both not present, consequently, new_df doesn't contain their rows
>>> new_df
ColumnA Stats
0 Cake 872
1 Cheese Cake 912
3 Raspberry Jam 091
4 Bacon 123
EDIT:
Raspb Jam is also kept in the new DF because it is SIMILAR to Raspberry Jam at a very basic level.
You can use map function to provide explicit lookup.
df = DataFrame( {'ColumnA' : ['Cake' ,'Cheese Cake','Egg' , 'Raspb Jam' ,'Bacon' ,'Bread'],'Value' : [872,912,62,91,123, 425]})
df1 = DataFrame(['Cake' ,'Cheese Cake','Raspberry Jam','Bacon'],columns=['ColumnB'])
value_map = {'Raspberry Jam' : 'Raspb Jam' }
df1.ColumnB = df1.ColumnB.map(lambda x : value_map.get(x,x))
df1.rename(columns={'ColumnB' : 'ColumnA'},inplace=True)
df.merge(df1)
ColumnA Value
0 Cake 872
1 Cheese Cake 912
2 Raspb Jam 91
3 Bacon 123
Alternatively , use left_on and right_on param to specify the column name(s) to merge.
df.merge(df1,how='inner',left_on='ColumnA',right_on='ColumnB')[['ColumnA','Value']]
I didn't have the energy to take care of all the edge cases. But you may find this method helpful. If not, no worries.
use set and <= to test is the characters in df are in df1 as a measure of similarity.
leverage numpy's broadcasting to help out
a = df.ColumnA.apply(set).values
b = df1.ColumnB.apply(set).values
print(df[(a[:, None] <= b).any(1)])
ColumnA Stats
0 Cake 872
1 Cheese Cake 912
3 Raspb Jam 91
4 Bacon 123
Response to comments
You can force the columns to be str with
a = df.ColumnA.astype(str).apply(set).values
b = df1.ColumnB.astype(str)..apply(set).values
Explanation
a[:, None] reshapes the single dimensional a array to a 2-D array. This enables me to perform numpy broadacasting
set objects use <= to perform issubset checks. Since a and b are all sets, we do a[:, None] <= b] to perform every pairwise comparison of is a[i] a subset of b[j] for all i, j.
(a[:, None] <= b).any(1) checks to see if a[i] was a subset of b[j] for any j. Meaning did I find at least one element in b that a[i] was a subset of.

Categories