I'm programmatically trying to detect the column in a dataframe that contains dates & I'm converting the date values to the same format.
My logic is to find the column name that contains the word 'Date' either as a whole word or as a sub-word (using contains()) & then work on the dates in that column.
My code:
from dateutil.parser import parse
import re
from datetime import datetime
import calendar
import pandas as pd
def date_fun(filepath):
lst_to_ser=pd.Series(filepath.columns.values)
date_col_search= lst_to_ser.str.contains(pat = 'date')
#print(date_col_search.columns.values)
for i in date_col_search:
if i is True:
formatted_dates=pd.to_datetime(date_col_search[i], errors='coerce')
print(formatted_dates)
main_path = pd.read_csv('C:/Data_Cleansing/random_dateset.csv')
fpath=main_path.copy()
date_fun(fpath)
The retrieved column names are stored in an array & since contains() works only on 'Series' I converted the array to series.
This is what 'date_col_search' variable contains:
0 False
1 True
2 False
dtype: bool
I want to access the column corresponding to the 'True' value. But I'm getting the following error at the line formatted_dates=......:
Exception has occurred: KeyError
True
How should I access the 'True' column?
My dataframe:
random joiningdate branch
1 25.09.2019 rev
8 9/16/2015 pop
98 10.12.2017 switch
65 02.12.2014 high
45 08-Mar-18 aim
2 08-12-2016 docker
0 26.04.2016 grit
9 05-03-2016 trevor
56 24.12.2016 woll
4 10-Aug-19 qerty
78 abc yak
54 05-06-2015 water
42 12-2012-18 rance
43 24-02-2010 stream
38 2008,13,02 verge
78 16-09-2015 atom
I would use:
def mixed_datetime(s):
# this is just an example, adapt this function to your need
return (pd.to_datetime(s, yearfirst=False, dayfirst=True, errors='coerce')
.fillna(
pd.to_datetime(s, yearfirst=True, dayfirst=False, errors='coerce')
)
)
cols = df.columns.str.contains('date', case=False)
df.loc[:, cols] = df.loc[:, cols].apply(mixed_datetime)
Updated DataFrame:
random joiningdate branch
0 1 2019-09-25 rev
1 8 2015-09-16 pop
2 98 2017-12-10 switch
3 65 2014-12-02 high
4 45 2018-03-08 aim
5 2 2016-12-08 docker
6 0 2016-04-26 grit
7 9 2016-03-05 trevor
8 56 2016-12-24 woll
9 4 2019-08-10 qerty
10 78 NaT yak
11 54 2015-06-05 water
12 42 NaT rance
13 43 2010-02-24 stream
14 38 2008-02-01 verge
15 78 2015-09-16 atom
Related
I want to pull out all values from a series which have a value found in the n smallest values, as I may have many values with a zero value, but nsmallest(5) only returns 5.
I got this to work, but am wondering if there is a more pythonic way of doing it, like using a lambda, or using a basic in statement?
alcohol[[True if a in alcohol.nsmallest(5).values else False for a in alcohol]] # works, but best way?
IIUC:
>>> alcohol[alcohol <= alcohol.nsmallest(5).max()]
1 46
5 19
6 25
7 17
9 42
dtype: int64
Setup
np.random.seed(2022)
alcohol = pd.Series(np.random.randint(1, 100, 10))
print(alcohol)
# Output
0 93
1 46
2 50
3 56
4 89
5 19
6 25
7 17
8 54
9 42
dtype: int64
I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64
As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e jbrockmendel#gmail.com 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 simonjayhawkins#gmail.com 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c 45562402+rhshadrach#users.noreply.github.com 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 jbrockmendel#gmail.com 159
50263213+MomIsBestFriend#users.noreply.github.com 44
TomAugspurger#users.noreply.github.com 41
...
2020-10-31 jbrockmendel#gmail.com 170
2658661+dsaxton#users.noreply.github.com 23
61934744+phofl#users.noreply.github.com 21
2020-11-30 jbrockmendel#gmail.com 134
61934744+phofl#users.noreply.github.com 36
41443370+ivanovmg#users.noreply.github.com 19
Name: email, dtype: int64
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49
I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you
Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76
I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.
You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857
You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857