I have a dataframe df which looks like:
id colour response
1 blue curent
2 red loaning
3 yellow current
4 green loan
5 red currret
6 green loan
You can see the values in the response column are not uniform and I would like to get the to snap to a standardized set of responses.
I also have a validation list validate which looks like
validate
current
loan
transfer
I would like to standardise the response column in the df based on the first three characters in the entry against the validate list
So the eventual output would look like:
id colour response
1 blue current
2 red loan
3 yellow current
4 green loan
5 red current
6 green loan
have tried to use fnmatch
pattern = 'cur*'
fnmatch.filter(df, pattern) = 'current'
but can't change the values in the df.
If anyone could offer assistance it would be appreciated
Thanks
You could use map
In [3664]: mapping = dict(zip(s.str[:3], s))
In [3665]: df.response.str[:3].map(mapping)
Out[3665]:
0 current
1 loan
2 current
3 loan
4 current
5 loan
Name: response, dtype: object
In [3666]: df['response2'] = df.response.str[:3].map(mapping)
In [3667]: df
Out[3667]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan
Where s is series of validation values.
In [3650]: s
Out[3650]:
0 current
1 loan
2 transfer
Name: validate, dtype: object
Details
In [3652]: mapping
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'}
mapping can be series too
In [3678]: pd.Series(s.str[:3].values, index=s.values)
Out[3678]:
current cur
loan loa
transfer tra
dtype: object
Fuzzy match ?
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
a=[]
for x in df.response:
a.append([process.extract(x, val.validate, limit=1)][0][0][0])
df['response2']=a
df
Out[867]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan
Related
I have a dataframe with three columns:
Colour
Person
Number of times worn
There are three colours, multiple names and the number column indicates how many times a specific name had a particular colour. The problem is, the same colour occurs multiple times for the same person I am trying to do a groupby, which sums up the total number, per colour per name. Any idea how I can perform a groupby which aggregates in this manner? Sorry if this it too vague!
I attach an image of the sample data for clarity.
Any help on how to neatly aggregate by colour would be great!
Colour Person Number of times worn
0 Red Tom 1
1 Red Tom 2
2 Red Tom 5
3 Blue Tom 7
4 Blue Tom 8
5 Green Tom 9
6 Red John 9
7 Red John 6
8 Green John 0
9 Green John 0
10 Orange John 5
11 Red John 4
12 Red Stanley 2
13 Orange Stanley 4
14 Green Stanley 5
15 Green Stanley 0
16 Green Stanley 6
17 Green Stanley 7
Thanks
You can also write in this way
df.groupby(["Person", "Color"])["n"].sum().reset_index(drop=True)
Or this works like a charm as well
df.groupby(["Person", "Color"]).agg({"n": "sum"}).reset_index(drop=True)
Only use reset_index(drop=True) if you plan to modify the original dataframe, otherwise don't pass drop=True and just store it a variable.
You can groupby multiple columns at the same time like this.
df = pd.DataFrame({
'Colour' : ['red', 'red', 'red', 'red', 'blue','blue',],
'Person' : ['Tom', 'Tom', 'Tom', 'John', 'John', 'John'],
'n' : [1,2,4,5,6,7]
})
df.groupby(['Person','Colour']).sum().reset_index()
Output:
Person Colour n
0 John blue 13
1 John red 5
2 Tom red 7
I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')
I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes
Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000
It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.
With grouped data I mean the following: Assume we have a data set which is grouped by a single feature, e.g. customer data, which is grouped by the single customer:
Customer | Purchase Nr | Item | Paid Amount ($)
1 1 TShirt 15
1 2 Trousers 25
1 3 Scarf 10
2 1 Underwear 5
2 2 Dress 35
2 3 Trousers 30
2 4 TShirt 10
3 1 TShirt 8
3 2 Socks 5
4 1 Shorts 13
I want to find clusters in a way, that a customers purchases are in one single cluster, in other words, that that a customer is not appearing in two clusters.
I thought about grouping the data set by the customer with a groupby, though it is difficult to express all the information of the columns for one customer in only one column. Futher, the order of purchases is important to me, e.g. if a T-Shirt was bought first or second.
Is there any cluster algorithm which includes information about groups like this?
Thank you!
I have two data frames I want to match partial strings by using str.contains function then merge them.
Here is an example:
data1
email is_mane name id
hi#amal.com 1 there is rain 10
hi2#amal.com 1 here is the food 9
hi3#amal.com 1 let's go together 8
hi4#amal.com 1 today is my birthday 6
data2
id name
1 the rain is beautiful
1 the food
2 together
4 my birthday
3 your birthday
And here is the code I wrote:
data.loc[data.name.str.contains('|'.join(data2.name)),:]
and the output:
email is_mane name id
hi2#amal.com 1 here is the food 9
hi3#amal.com 1 let's go together 8
hi4#amal.com 1 today is my birthday 6
As you can see it did not return "there is rain" even that rain word is contained in dara2: could it be because of space?
Also I want to merge data1 with data2 so that will help me to know what email has match.
I would like to have the following output:
email is_mane name id id2 name2
hi2#amal.com 1 here is the food 9 1 the food
hi3#amal.com 1 let's go together 8 2 together
hi4#amal.com 1 today is my birthday 6 4 my birthday
hi4#amal.com 1 today is my birthday 6 3 your birthday
Is there is any way to do it?
If you're good with matching only full words you can do (so e.g. dog and dogs won't match)
data1["key"]=data1["name"].str.split(r"[^\w+]")
data2["key"]=data2["name"].str.split(r"[^\w+]")
data3=data1.explode("key").merge(data2.explode("key"), on="key", suffixes=["", "2"]).drop("key", axis=1).drop_duplicates()
Otherwise it's a matter of cross join, and applying str.contains(...) to filter out the ones, which aren't matching.