I have a dataframe which is in this format.
I wonder how I can split the Items ordered column into multiple rows like below.
Thanks in advance!
You can use:
>>> (df.assign(**{'Items ordered': lambda x: x['Items ordered'].str.rstrip(';').str.split(';\s*')})
.explode('Items ordered', ignore_index=True))
Transaction ID Client Name Items ordered
0 1 Sam Fruit
1 1 Sam Water
2 1 Sam Coffee
3 2 Peter Fruit
4 2 Peter Soup
5 2 Peter Sandwich
6 3 Han Fruit
7 3 Han Coffee
8 3 Han Ice Cream
Related
The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight
b_weight
a_name
b_name
10
5
John
Michael
1
2
Jake
Michelle
21
3
Alice
Bob
2
1
Ashley
Brian
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
a_weight
b_weight
a_name
b_name
max_weight
max_name
10
5
John
Michael
a_weight
John
1
2
Jake
Michelle
b_weight
Michelle
21
3
Alice
Bob
a_weight
Alice
2
1
Ashley
Brian
a_weight
Ashley
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight
b_weight
a_name
b_name
max_weight
max_name_col
10
5
John
Michael
a_weight
a_name
1
2
Jake
Michelle
b_weight
b_name
21
3
Alice
Bob
a_weight
a_name
2
1
Ashley
Brian
a_weight
a_name
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley
I have this dataframe where the type columns shows the type of a stuff
id type total
1 shoes 2
1 sandal 1
1 vest 2
1 tshirt 2
1 345 3
1 345 2
based on the type column, I want to categorize the type and create a new column called category.
The rules are as follow:
- shoes and sandal as shoes
- vest, we keep it as vest
- tshirt, we also keep it as tshirt
- other than that, I want to keep it as other
so the desired result will be as follow:
id type total category
1 shoes 2 shoes
1 sandal 1 shoes
1 vest 2 vest
1 tshirt 2 tshirt
1 345 3 other
1 345 2 other
how can I make it with python?
thanks in advance
Try using map on a dictionary with fillna:
>>> df['category'] = df['type'].map({'sandal': 'shoes', 'shoes': 'shoes', 'vest': 'vest', 'tshirt': 'tshirt'}).fillna('other')
>>> df
id type total category
0 1 shoes 2 shoes
1 1 sandal 1 shoes
2 1 vest 2 vest
3 1 tshirt 2 tshirt
4 1 345 3 other
5 1 345 2 other
>>>
As mentioned in the documentation of both pages (links shown above), map maps and replaces the values in the Series that are the keys of the dictionary to the values of the dictionary. Whereas if there are values in there series that aren't mapped in the dictionary, it would give NaNs, so you could just do fillna to replace the NaNs with others.
I'm trying to find a method of inserting a zero into a pandas dataframe where the result of the .count()aggregate function is < 1. I've tried putting in a condition where it looks for null/None values and using a simple < 1 operator. So far I can only count instances where a categorical variable exists. Below is some example code to demonstrate my issue:
data = {'Person': ['Jim', 'Jim', 'Jim', 'Jim', 'Jim', 'Bob','Bob','Bob','Bob','Bob',], 'Result': ['Good', 'Good','Good','Good','Good','Good','Bad','Good','Bad','Bad',]}
dtf = pd.DataFrame.from_dict(data)
names = ['Jim','Bob']
append = []
for i in names:
good = dtf[dtf['Person']==i]
good = good[good['Result']=='Good']
if good['Result'].count() > 0:
good.insert(2,"Count",good['Result'].count())
elif good['Result'].count() < 1:
good.insert(2,"Count",0)
bad = dtf[dtf['Person']==i]
bad = bad[bad['Result']=='Bad']
if bad['Result'].count() > 0:
bad.insert(2,"Count",bad['Result'].count())
elif bad['Result'].count() < 1:
bad.insert(2,"Count",0)
res = [good,bad]
res = pd.concat(res)
append.append(res)
print(res)
The current output is:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
Person Result Count
5 Bob Good 2
7 Bob Good 2
6 Bob Bad 3
8 Bob Bad 3
9 Bob Bad 3
What I am trying to achieve is a zero count for Jim for the 'Bad' variable in the dtf['Results'] column. Like this:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
I hope this makes sense. Vive la Resistance! └[∵┌]└[ ∵ ]┘[┐∵]┘
First create a multiindex mi from the product of Person and Result to keep missing combinations from df. Then count (size) all groups and reindex by the multiindex. Finally, merge the two dataframes use union of keys from both.
mi = pd.MultiIndex.from_product([df["Person"].unique(),
df["Result"].unique()],
names=["Person", "Result"])
out = df.groupby(["Person", "Result"]) \
.size() \
.reindex(mi, fill_value=0) \
.rename("Count") \
.reset_index()
out = out.merge(df, on=["Person", "Result"], how="outer")
>>> out
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
Output:
names, append = list(zip(*out.groupby("Person")))
>>> names
('Bob', 'Jim')
>>> append
( Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3,
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0)
Dataframe :
name
Location
Rating
Frequency
Ali
Nasi Kandar
1 star
1
Ali
Baskin Robin
4 star
3
Ali
Nasi Ayam
3 star
1
Ali
Burgergrill
2 star
2
Lee
Fries
1 star
3
Abu
Mcdonald
3 star
3
Abu
KFC
3 star
1
Ahmad
Nandos
3 star
2
Ahmad
Burgerdhil
2 star
3
Ahmad
Kebab
1 star
10
Here is the sample data set. The logic would be:
1st condition: if the name has duplicate values, check the frequency and see which one is higher, drop the row with lower frequency
2nd condition: If no name duplicate (e.g:Lee), keep the row
3rd condition: Rating is the same (e.g: Abu), keep the first value
Desired Output:
name
Location
Rating
Frequency
Ali
Baskin Robin
4 star
3
Lee
Fries
1 star
3
Abu
KFC
3 star
1
Ahmad
Kebab
1 star
10
Any of you guys know how I do this in python pandas or pyspark?
I got into troubles checking for duplicates and also applying probably the "if condition" to this dataframe
Pyspark solution. You can use row_number over an appropriately ordered and partitioned window, and get the rows with a row number of 1.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(Window.partitionBy('name').orderBy(F.desc('frequency')))
).filter('rn = 1').drop('rn')
df2.show()
+-----+------------+------+---------+
| name| Location|Rating|Frequency|
+-----+------------+------+---------+
|Ahmad| Kebab|1 star| 10|
| Abu| Mcdonald|3 star| 3|
| Lee| Fries|1 star| 3|
| Ali|Baskin Robin|4 star| 3|
+-----+------------+------+---------+
Use DataFrame.sort_values with DataFrame.drop_duplicates and last sorting index:
df = (df.sort_values(['Frequency','Rating'],
ascending=[False, True])
.drop_duplicates('name')
.sort_index())
print (df)
name Location Rating Frequency
1 Ali Baskin Robin 4 star 3
4 Lee Fries 1 star 3
5 Abu Mcdonald 3 star 3
9 Ahmad Kebab 1 star 10
I have a dataframe logging exercises completed, with a two column multiindex: Day and Person. Each day, each person logs which exercises they did (if they exercised). I would like to add another column which sequentially counts the entries made into this log, as shown below. So for each unique pair of day and person, count up by 1.
Day Person Exercise EntryNumber
1 Joe Curls 1
1 Joe Squats 1
1 Sandy Sprints 2
1 Sandy Bench 2
2 Joe Curls 3
2 Sandy Squats 4
3 Bob Pushups 5
Here is the code to generate that above dataframe.
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
How would I go about creating the EntryNumber column? I've tried all manner of groupby and cumcount but have not yet figured it out.
Thanks!
May be you can try with groupby followed by ngroup():
#Generating df from above
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
# applying reset index and ngroup
df.reset_index(inplace=True)
df['Entry Number'] = df.groupby(['Day','Person']).ngroup() +1
df
Result:
Day Person Exercise Entry Number
0 1 Joe Curls 1
1 1 Joe Squats 1
2 1 Sandy Sprints 2
3 1 Sandy Bench 2
4 2 Joe Curls 3
5 2 Sandy Squats 4
6 3 Bob Pushups 5
Another way is factorize by index without having to group:
df['EntryNumber'] = df.index.factorize()[0]+1
#df = df.reset_index() -> if you want to reset theindex
print(df)
Exercise EntryNumber
Day Person
1 Joe Curls 1
Joe Squats 1
Sandy Sprints 2
Sandy Bench 2
2 Joe Curls 3
Sandy Squats 4
3 Bob Pushups 5