Pandas Dataframe Filter Multiple Conditions - python

I am looking to filter a dataframe to only include values that are equal to a certain value, or greater than another value.
Example dataframe:
0 1 2
0 0 1 23
1 0 2 43
2 1 3 54
3 2 3 77
From here, I want to pull all values from column 0, where column 2 is either equal to 23, or greater than 50 (so it should return 0, 1 and 2). Here is the code I have so far:
df = df[(df[2]=23) & (df[2]>50)]
This returns nothing. However, when I split these apart and run them individually (df = df[df[2]=23] and df = df[df[2]>50]), then I do get results back. Does anyone have any insights onto how to get this to work?

As you said , it's or : | not and: &
df = df[(df[2]=23) | (df[2]>50)]

Related

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.
Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

Is there a way to make custom function in pandas aggregation function?

Want to apply custom function in a Dataframe
eg. Dataframe
index City Age
0 1 A 50
1 2 A 24
2 3 B 65
3 4 A 40
4 5 B 68
5 6 B 48
Function to apply
def count_people_above_60(age):
** *** #i dont know if the age can or can't be passed as series or list to perform any operation later
return count_people_above_60
expecting to do something like
df.groupby(['City']).agg{"AGE" : ["mean",""count_people_above_60"]}
expected Output
City Mean People_Above_60
A 38 0
B 60.33 2
If performance is important create new column filled by compared values converted to integers, so for count is used aggregation sum:
df = (df.assign(new = df['Age'].gt(60).astype(int))
.groupby(['City'])
.agg(Mean= ("Age" , "mean"), People_Above_60= ('new',"sum")))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2
Your solution should be changed with compare values and sum, but is is slow if many groups or large DataFrame:
def count_people_above_60(age):
return (age > 60).sum()
df = (df.groupby(['City']).agg(Mean=("Age" , "mean"),
People_Above_60=('Age',count_people_above_60)))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2

Use a loop on a dataframe by taking a specific column

I am new to pandas and python. Here I have a data-frame,
DID feature
0 1
0 1
0 2
0 22
0 22
0 33
1 11
1 13
1 14
1 2
1 33
2 1
2 22
2 33
2 13
2 14
In this dataframe there are two columns. DID is a document Id and and feature is the feature of that .
Now, I am trying to use a for loop here on the basis of the document ID's ..
I am trying to call a fucntion inside a loop which will have the data of that DID only , like the features of that DID only.
so
for i in df1 :
call_process ["Here only the values of i"] (i is the document ID , which will be first 0).
call_process[df1['feature'].values]
like this ?
Is there any way to do this ?
expected output is like ,
while calling a method it should have the data of that document ID only.
call_process([1,1,2,22,22,33])
If I understood you correctly, here is a simple function to get you the features for a DID:
def get_features(did):
feats = [] #to load the matching features
for d,idx in zip(df['DID'],range(len(df))): #get DID and index of DID
if d == did:
feats.append(df['feature'][idx])
return feats #return the features in a list
Then you call the function with did value you want, suppose it is DID 0:
get_features(0)
And it returns:
[1, 1, 2, 22, 22, 33]
I don't understand your purpose, but you may do it with for-loop on groupby object.
for _, g in df1.groupby('DID'):
call_process(g['feature'].values)

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories