Is it possible to do full text search in pandas dataframe - python

currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?

CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83

Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)

Related

Add a column in pandas based on sum of the subgroup values in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.

lookup within filtered range

I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!
IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.

A question about how to make calculus after making a group by pandas

I'm working with a Data Frame with categorical values where my input DataFrame is below:
df
Age Gender Smoke
18 Female Yes
24 Female No
18 Female Yes
34 Male Yes
34 Male No
I want to groupby my DataFrame based on columns "Age" and "Gender" where "Occurrence" column calculates the frequency of each selection and then, I want to create two other columns "Smoke Yes" that calculates number of smoking people based on the selection and "Smoke No" that calculates number of non smoking people
Age Gender Occurence Smoke Yes Smoke No
18 Woman 2 0.50 0.50
24 Woman 1 0 1
34 Man 2 0.5 0.5
In order to do that, I used the following code
#Group and sort
df1=df.groupby(['Age', 'Gender']).size().reset_index(name='Frequency').sort_values('Frequency', ascending=False)
#Delete index
df1.reset_index(drop=True,inplace=True)
However the df['Smoke'] column is disappeared so I can't continue my calculus. Does any one have an idea and what can I do to obtain like the output DataFrame?
you can use groupby and value_counts with normalize=True to return percentage share. then unstack. Also using a dictionary you can replace the Gender column to match the desired output.
d = {"Female":"Woman","Male":"Man"}
u = (df.groupby(['Age','Gender'])['Smoke'].value_counts(normalize=True)
.unstack().fillna(0))
s = df.groupby("Age")['Gender'].value_counts()
u.columns = u.columns.name+"_"+u.columns
out=u.rename_axis(None,axis=1).assign(Occurance=s).reset_index().replace({"Gender":d})
print(out)
Age Gender Smoke_No Smoke_Yes Occurance
0 18 Woman 0.0 1.0 2
1 24 Woman 1.0 0.0 1
2 34 Man 0.5 0.5 2

How to summarize only certain columns of dataframe (python pandas)

I want to get new dataframe, in which I need to see sum of certain columns for rows which have same value of 'Index' columns (campaign_id and group_name in my example)
This is sample (example) of my dataframe:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 20 5 50 bar 12
102 red 7 3 25 bar 12
102 brown 5 0 18 bar 12
this is what I want to get:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 27 8 75 bar 12
102 brown 5 0 18 bar 12
I tried:
df = df.groupby(['campaign_id','group_name'])['clicks','conversions','cost'].sum().reset_index()
but this gives my only mentioned (summarized) columns (and Index), like this:
campaign_id group_name clicks conversions cost
101 blue 40 15 100
102 red 27 8 75
102 brown 5 0 18
I can try to add leftover columns after this operation, but I'm not sure if this will be optimal and adequate way to solve the problem
Is there simple way to summarize certain columns and leave other columns untouched (I don't care if they would differ, because in my data all leftover columns have same data for rows with same corresponding values in 'Index' columns (which are campaign_id and group_name)
When I finished my post I saw the answer right away: since all columns except those which I want to summarize - have matching values - I just need to take all those columns as part of multi-index, for this operation. Like this:
df = df.groupby(['campaign_id','group_name','lavel','city_id'])['clicks','conversions','cost'].sum().reset_index()
In this case I got exacty what I wanted.

Problem about using groupby on a list column

I'm using the MovieLens 1M dataset to learn pandas, and I want to get some data based on the genres column.
one row of the dataframe I get is like this:
movieid title genres rating userid gender age occupation zipcode timestamp
1000204 2198 Modulations (1998) [Documentary] 5 5949 M 18 17 47901 958846401
1000205 2703 Broken Vessels (1998) [Drama] 3 5675 M 35 14 30030 976029116
1000206 2845 White Boys (1999) [Drama] 1 5780 M 18 17 92886 958153068
1000207 3607 One Little Indian (1973) [Comedy, Drama, Western] 5 5851 F 18 20 55410 957756608
1000208 2909 Five Wives, Three Secretaries and Me (1998) [Documentary] 4 5938 M 25 1 35401 957273353
I want to us df.groupby('genres') to groupby the dataframe and then get the sum of each genre and the mean rating of each genre.
However, when I use the df.groupby('genres').mean(), it had an error
"TypeError: unhashable type: 'list' "
Please tell me why this error happeded and how can I use groupby on a column which the data are lists.
THX very much!
groupby takes a list as argument. Try df.groupby(['genres']).mean()

Categories