I have the table below in a Pandas dataframe:
name birth
jack 1989-11-17
joe 1988-09-10
ben 1980-10-20
kate 1985-05-15
nichos 1986-07-05
john 1989-11-12
tom 1980-10-25
jason 1985-05-21
eron 1985-07-10
yun 1989-11-05
kung 1986-07-01
i want to do some aggregation by the month of birth,the results should be like this :
moth cnt
1989-11 3
1988-09 1
1986-07 2
1985-07 1
1985-05 2
1980-10 2
Is there any convenience way of doing this?
Many thanks
Make your data into a TimeSeries object and then call resample:
s.resample("M", how="count")
Related
If I have a pandas DataFrame like this
date
person_active
22/2
John
22/2
Marie
22/2
Mark
23/2
John
24/2
Mark
24/2
Marie
how do I count in a rolling window based on time the unique values in person_active, for example: 2 days rolling window, so it ends up like this:
date
person_active
people_active
22/2
John
3
22/2
Marie
3
22/2
Mark
3
23/2
John
3
24/2
Mark
3
24/2
Marie
3
The main issue here is that I have duplicate entries on date for each person so a simple df.rolling('2d',on='date').count() won't do the job.
EDIT: Please consider implementation in a big dataset and how the time to compute will scale, the solution needs to be ideally applicable in a real-world environment so if it takes too long to compute it's not that useful.
IIUC, try:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"], format="%d/%m")
#convert string name to categorical codes for numerical aggegation
df["people"] = pd.Categorical(df["person_active"]).codes
#compute the rolling unique count
df["people_active"] = (df.rolling("2D", on="date")["people"]
.agg(lambda x: x.nunique())
.groupby(df["date"])
.transform("max")
)
#drop the unneccessary column
df = df.drop("people", axis=1)
>>> df
date person_active people_active
0 1900-02-22 John 3.0
1 1900-02-22 Marie 3.0
2 1900-02-22 Mark 3.0
3 1900-02-23 John 3.0
4 1900-02-24 Mark 3.0
5 1900-02-24 Marie 3.0
Group by date, count unique values and then you're good to go:
df.groupby('date').nunique().rolling('2d').sum()
If there is someone who understands, please help me to resolve this. I want to label user data using python pandas, where there are two columns in my dataset, namely author, and retweeted_screen_name. I want to do a label with the criteria if every user in the author column has the same value in the retweeted_screen_name column then are 1 and the others that do not have the same value are 0.
Author
RT_Screen_Name
Label
Alice
John
1
Sandy
John
1
Lisa
Mario
0
Luna
Mark
0
Luna
John
1
Luke
Anthony
0
df['Label']=0
df.loc[df["RT_Screen_Name"]=="John", ["Label"]] = 1
It is unclear what condition you are using to decide the label variable, but if you are clear on your condition you can change out the conditional statement within this code. Also if you edit your question to clarify the condition, notify me and I will adjust my answer.
IIUC, try with groupby:
df["Label"] = (df.groupby("RT_Screen_Name")["Author"].transform("count")>1).astype(int)
>>> df
Author RT_Screen_Name Label
0 Alice John 1
1 Sandy John 1
2 Lisa Mario 0
3 Luna Mark 0
4 Luna John 1
5 Luke Anthony 0
This question already has answers here:
How to group dataframe rows into list in pandas groupby
(17 answers)
Closed 3 years ago.
I wasn't sure how to title this.
Assume the following Pandas DataFrame:
Student ID Class
1 John 99124 Biology
2 John 99124 History
3 John 99124 Geometry
4 Sarah 74323 Physics
5 Sarah 74323 Geography
6 Sarah 74323 Algebra
7 Alex 80045 Trigonometry
8 Alex 80045 Economics
9 Alex 80045 French
I'd like to reduce the number of rows in this DataFrame by creating a list of classes that each student is taking, and then putting that in the "class" column. Here's my desired output:
Student ID Class
1 John 99124 ["Biology","History","Geometry"]
2 Sarah 74323 ["Physics","Geography","Algebra"]
3 Alex 80045 ["Trigonometry","Economics","French"]
I am working with a large DataFrame that is not as nicely organized as this example. Any help is appreciated.
You need to groupby on Student and ID and then use agg.
df.groupby(['Student', 'ID'], as_index=False).agg({'Class': list})
Ouput:
Student ID Class
0 Alex 80045 [Trigonometry, Economics, French]
1 John 99124 [Biology, History, Geometry]
2 Sarah 74323 [Physics, Geography, Algebra]
df.groupby('ID')['Class'].apply(list)
let's see, using some help
Apply multiple functions to multiple groupby columns
you could write something like
df= df.groupby('student').agg({'id':'max', 'Class': lambda x: x.tolist()})
hope it helps, giulio
try like below
df.groupby(['Student', 'ID'],as_index=False).agg(lambda x:','.join('"'+x+'"'))
I have a dataframe df which looks like this:
data = [['Alex','Japan'],['Joe','Japan, India']]
df = pd.DataFrame(data,columns=['Name','Countries'])
Name Countries
Alex Japan
Joe Japan, India
So I want to modify df in such a way that when I implememt df['Countries'].value_coun
ts(), I get
Japan 2
India 1
So I thought that I should convert those strings in df['Countries'] into a list using this:
df['Countries']= df['Countries'].str[0:].str.split(',').tolist()
Name Countries
0 Alex [Japan]
1 Bob [Japan, India]
But now when I run df['Countries'].value_counts(), I get the following error:
TypeError: unhashable type: 'list'
All I wish is that when I run a .values_counts() I get 2 for Japan and 1 for India. Please see if you can help me with this. Thank you!
Use Series.str.split with reshape by DataFrame.stack for Series, so possible use value_counts:
s = df['Countries'].str.split(', ', expand=True).stack().value_counts()
print (s)
Japan 2
India 1
dtype: int64
Another way using series.str.get_dummies():
df.Countries.str.get_dummies(',').sum()
India 1
Japan 2
From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA