I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
Related
I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!
Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]
I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back
Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice:
I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area.
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
Using pandas drop_duplicates drop the row duplicates but the condition is base on the first/ last occurring Shop ID which does not allow me to sort by distance:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
I also tried to group by Shop ID then sort, but sort returns error: Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
So far i tried doing up till this stage:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
I think i'm doing it the long way and still haven't figure out dropping the duplicates, anyone knows how to solve this in a shorter way? I'm new to python, thanks for helping out!
Try to first sort the dataframe based on distance, then drop the duplicate shops.
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
Or just as one chained expression (per comment from #chmielcode).
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)
You can use idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7
I have a dataframe like this:
name time session1 session2 session3
Alex 135 10 3 5
Lee 136 2 6 4
I want to make multiple dataframes based on each session. for example, i want to make dataframe one that has name, time, and session1. and dataframe 2 has name, time, and session2. and dataframe 3 has name, time, and session3. I want to use for loop or any other way is better but don't know how to choose column 1,2,3 at one time but column 1,2, 4 and etc. Any one has idea about that. The data is saved in pandas dataframe. I just don't know how to type it in Stackoverflow here. Thank you
I don't think you need to create a new dictionary for that.
Just directly slice your data frame whenever needed.
df[['name', 'time', 'session 1']]
If you think the following design can help you, you can set the name and time to be indexes (df.set_index(['name', 'time'])) and just simply
df['session 1']
Organize it into a dictionary of dataframes:
dict_of_dfs = {f'df {i}':df[['name','time', i]] for i in df.columns[2:]}
Then you can access each dataframe as you would any other dictionary values:
>>> dict_of_dfs['df session1']
name time session1
0 Alex 135 10
1 Lee 136 2
>>> dict_of_dfs['df session2']
name time session2
0 Alex 135 3
1 Lee 136 6
This is a continuation on a question I asked before and got a suitable answer for at the time. Now however my problem is different and the given answers no longer (completely) applies.
I have a big collection of Twitter messages and I want to do some statistical analysis on it. Part of the data frame looks as follows:
user.id user.screen_name user.followers_count text
Jim JimTHEbest 14 blahbla
Jim JIMisCOOL 15 blebla
Sarah Sarah123 33 blaat
Sarah Sarah123 33 bla
Peter PeterOnline 9 blabla
user.id never changes and is a identifier of a Twitter account.
user.screen_name Name given to twitter account, can change over time.
user.followers_count How many followers account has, can change over time.
text twitter message, each row represents 1 twitter message and its meta data.
What i would like to do is count the frequency of tweets by each twitter user in my data frame and combine it with the data that I already have. So that I get something like this:
user.id user.screen_name user.followers_count count
Jim JIMisCOOL 15 2
Sarah Sarah123 33 2
Peter PeterOnline 9 1
A data frame with 1 row for each twitter user in my data set that shows their tweet count and the last screen_name and followers_count.
What I think I should do is first do the 'count' operation and then pd.merge that outcome with a part of my original data frame. Trying merge with help of the pandas documentation didn't get me very far, mostly endlessly repeating rows of duplicate data.. Any help would be greatly appreciated!
The count part I do as follows:
df[['name', 'text']].groupby(['name']).size().reset_index(name='count')
# df being the original dataframe, taking the last row of each unique user.id and ignoring the 'text' column
output_df = df.drop_duplicates(subset='user.id', take_last=True)[['user.id', 'user.screen_name', 'user.followers_count']]
# adding the 'count' column
output_df['count'] = df['user.id'].apply(lambda x: len(df[df['user.id'] == x]))
output_df.reset_index(inplace=True, drop=True)
print output_df
>> user.id user.screen_name user.followers_count count
0 Jim JIMisCOOL 15 2
1 Sarah Sarah123 33 2
2 Peter PeterOnline 9 1
You group on user.id, and then use agg to apply a custom aggregation function to each column. In this case, we use a lambda expression and then use iloc to take the last member of each group. We then use count on the text column.
result = df.groupby('user.id').agg({'user.screen_name': lambda group: group.iloc[-1],
'user.followers_count': lambda group: group.iloc[-1],
'text': 'count'})
result.rename(columns={'text': 'count'}, inplace=True)
>>> result[['user.screen_name', 'user.followers_count', 'count']]
user.screen_name user.followers_count count
user.id
Jim JIMisCOOL 15 2
Peter PeterOnline 9 1
Sarah Sarah123 33 2
This is how I did it myself, but I'm also going to look into the other answers, they are probably different for a reason :).
df2 = df[['user.id', 'text']].groupby(['user.id']).size().reset_index(name='count')
df = df.set_index('user.id')
df2 = df2.set_index('user.id')
frames = [df2, df]
result = pd.concat(frames, axis=1, join_axes=[df.index])
result = result.reset_index()
result = result.drop_duplicates(['user.id'], keep='last')
result = result[['user.id', 'user.screen_name', 'user.followers_count', 'count']]
result
user.id user.screen_name user.followers_count count
1 Jim JIMisCOOL 15 2
3 Sarah Sarah123 33 2
4 Peter PeterOnline 9 1