Finding earliest date after groupby a specific column - python

I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!

Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]

Related

add a column in python with rows number

I have a dataset like these
state
sex
birth
player
QLD
m
1993
Dave
QLD
m
1992
Rob
Now I would like to create an additional row, which is the id. ID is equal to the row number but + 1
df = df.assign(ID=range(len(df)))
But unfortunately the first ID is zero, how can I fix that the first ID begins with 1 and so on
I want these output
state
sex
birth
player
ID
QLD
m
1993
Dave
1
QLD
m
1992
Rob
2
but I got these
state
sex
birth
player
ID
QLD
m
1993
Dave
0
QLD
m
1992
Rob
1
How can I add an additional column to python, which starts with one and gives for every row a unique number so for the second row 2, third 3 and so on.
You can try this:
import pandas as pd
df['ID'] = pd.Series(range(len(df))) + 1

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.
d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

How to add rows in Data Frame while for loop?

I want to add a row in an existing data frame, where I don't have a matching regex value.
For example,
import pandas as pd
import numpy as np
import re
lst = ['Sarah Kim', 'Added by January 21']
df = pd.DataFrame(lst)
df.columns = ['Info']
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
for index, row in dff.iterrows():
if re.findall(name_pat, str(row['Info'])):
print("Name matched")
elif re.findall(title_pat, str(row['Info'])):
print("Title matched")
if re.findall(title_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
elif re.findall(date_pat, str(row['Info'])):
print("Date matched")
if re.findall(date_pat, str(row['Info'])) == None:
# Add a row here in the dataframe
So here in my dataframe df, I do not have a title, but just Name and Date. While looping df, I want to add an empty column for a title.
The output is:
Info
0 Sarah Kim
1 Added on January 21
My expected output is:
Info
0 Sarah Kim
1 None
2 Added on January 21
Is there any way that I can add an empty column, or is there a better way?
+++
The dataset I'm working with is just one column with many rows. The rows have some structure, that repeat data of "name, title, date". For example,
Info
0 Sarah Kim
1 Added on January 21
2 Jesus A. Moore
3 Marketer
4 Added on May 30
5 Bobbie J. Garcia
6 CEO
7 Anita Jobe
8 Designer
9 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I have sliced the data frame, so I can only extract data frame looks like this:
Info
0 Sarah Kim
1 Added on January 21
And I'm trying to run a loop for each section, and if a date or title is missing, I will fill with an empty row. So that in the end, I will have:
Info
0 Sarah Kim
1 **NULL**
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 **NULL**
9 Anita Jobe
10 Designer
11 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
I see you have a long dataframe with information and each set of information is different. I think the your goal is possibly to have a dataframe where you have 3 columns.
Name,Title and Date
Here is a way I would approach this problem and some code samples. I would take advantage of the df.shift method so I could tie information and use your existing dataframe to create a new one.
I am also making some assumptions based on what you have listed above. First I will assume that only the Title and Date field could be missing. Second I will assume that the order of the is Name,Title and Date like you have mentioned above.
#first step create test data
test_list = ['Sarah Kim','Added on January 21','Jesus A. Moore','Marketer','Added on May 30','Bobbie J. Garcia','CEO','Anita Jobe','Designer','Added on January 3']
test_df =pd.DataFrame(test_list,columns=['Info'])
# second step use your regex to get what type of column each info value is
name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"
test_df['Col'] = test_df['Info'].apply(lambda x: 'Name' if re.findall(name_pat, x) else ('Date' if re.findall(date_pat,x) else 'Title'))
# third step is to get the next values from our dataframe using df.shift
test_df['Next_col'] = test_df['Col'].shift(-1)
test_df['Next_col2'] = test_df['Col'].shift(-2)
test_df['Next_val1'] = test_df['Info'].shift(-1)
test_df['Next_val2'] = test_df['Info'].shift(-2)
# Now filter to only the names and apply a function to get our name, title and date
new_df = test_df[test_df['Col']=='Name']
def apply_func(row):
name = row['Info']
title = None
date = None
if row['Next_col']=='Title':
title = row['Next_val1']
elif row['Next_col']=='Date':
date = row['Next_val1']
if row['Next_col2']=='Date':
date = row['Next_val2']
row['Name'] = name
row['Title'] = title
row['date'] = date
return row
final_df = new_df.apply(apply_func,axis=1)[['Name','Title','date']].reset_index(drop=True)
print(final_df)
Name Title date
0 Sarah Kim None Added on January 21
1 Jesus A. Moore Marketer Added on May 30
2 Bobbie J. Garcia CEO None
3 Anita Jobe Designer Added on January 3
There is probably a way that we could do this in less lines of code. I welcome anyone who can make this more efficient, but I believe this should work. Also if you wanted to flatten this back into an array.
flattened_df = pd.DataFrame(final_df.values.flatten(),columns=['Info'])
print(flattened_df)
Info
0 Sarah Kim
1 None
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 None
9 Anita Jobe
10 Designer
11 Added on January 3

Categories