I have a string type column called 'datetimes' that contains multiple dates with their timestamps, and I'm trying to extract the earliest and last dates (without the timestamps) into new columns called 'earliest_date' and 'last date'.
The problem, however, is that the dates are not in order, so it's not as straightforward as using a str.split() method to get the first and last dates in the string. I need to order them first in ascending order.
Here's an example of an entry for one of the rows: 2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00,
As you can see, the order is randomized. I would like to firstly remove the timestamps which are fortunately in between a whitespace and comma, then order the dates in ascending order, and then finally get the max and min dates into two separate columns.
Can anyone please help me? Thanks in advance :)
`df['Campaign Interaction Dates'] = df['Campaign Interaction Dates'].str.replace('/','-')
def normalise(d):
if len(t := d.split('-')) == 3:
return d if len(t[0]) == 4 else '-'.join(reversed(t))
return '9999-99-99'
out = sorted(normalise(t[:10]) for t in str(df[df['Campaign Interaction Dates']]).split(',') if t)
df['out'] = out[1]
print(display(df[df['Number of Campaign Codes registered']==3]))`
You can use following code if you are not sure that date format will always be YYYY-MM-DD:
import datetime
string= "2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00"
dates_list = [date[:10] for date in string.split(',')]
dates_list.sort(key=lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
min_date, max_date = dates_list[0], dates_list[-1]
You can easily replace date format here
string = "2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00"
split_string = string.split(",")
split_string.sort()
new_list = []
for i in split_string:
temp_list = i.split()
new_list.append(temp_list[0])
max_date = new_list[-1]
min_date = new_list[0]
I have a .csv file with 29 columns and 1692 rows.
The columns D_INT_1 and D_INT_2 are just dates.
I want to check for these 2 columns if there is dates between :>= "2022-03-01" and <= "2024-12-31.
And, if a value is found, I want to display the date found + the value of the column "NAME" that is located on the same row of said found value.
This is what I did right now, but it only grab the dates and not the adjacent value ('NAME').
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
This is what I would like to get:
Example of DF:
NAME, ADDRESS, D_INT_1, D_INT_2
Mark, H4N1V8, 2023-01-02, 2019,-01-01
Expected output:
MARK, 2023-01-02
Lots of times the compact [for in] syntax can be used efficiently for simple code, but in this case I wouldn't recommend it. I suggest you use a normal for. Here's an example:
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
# loop for every row in the data
# (i will start as 0 and increase by 1 every iteration)
for i in range(0, len(D_INT_1)):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
print(NAME[i], D_INT_1[i])
if D_INT_2[i] >= "2022-03-01" and D_INT_2[i] <= "2024-12-31":
print(NAME[i], D_INT_2[i])
First for performance dont use loops, because exist vectorized alternatives unpivot by DataFrame.melt and filter by Series.between with DataFrame.loc:
df = df.melt(id_vars='NAME', value_vars=['D_INT_1','D_INT_2'], value_name='Date')
df1 = df.loc[df['Date'].between("2022-03-01","2024-12-31"), ['NAME','Date']]
print (df1)
NAME Date
0 Mark 2023-01-02
Or filter original DataFrame and last join in concat:
df1 = df.loc[df['D_INT_1'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_1']]
df2 = df.loc[df['D_INT_2'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_2']]
df = pd.concat([df1.rename(columns={'D_INT_1':'date'}),
df2.rename(columns={'D_INT_2':'date'})])
print (df)
NAME date
0 Mark 2023-01-02
Last if need loops output with print:
for i in df.itertuples():
print (i.NAME, i.Date)
Mark 2023-01-02 00:00:00
Mark 2019-01-01 00:00:00
So there a few things to be of note here:
In this case, you are better off probably with a normal for-loop since it can be a bit more complicated.
To do what you want, you want to first:
Load the names:
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
Use enumerate since we know all lists are aligned the same in your loop, keep in mind that enumerate gets both value and index, but I am getting the value manually just for cleaner (and clearer) code:
ext = []
for i,_ in enumerate(D_INT_1):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_1[i],NAMES[i]))
if D_INT_2[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_2[i],NAMES[i]))
Of course, you can use a list comprehension (or in this case, two), but this form should be easier to understand for this answer.
To do so, you will need to still load the names like in the first step, then use enumerate in the list comprehension, while adding the name after i in a tuple, perhaps something like this:
ext = [(i,NAMES[ind]) for ind,i in enumerate(D_INT_1 + D_INT_2) if i >= "2022-03-01" and i <= "2024-12-31"]
Keep in mind that I didn't test the above code since I have no access to the original csv, but it should be a good starting point.
The data in test.csv likes this:
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,
I use this code to get the data with the speed is 0 in the dataframe, and then group the dataframe according to latitude, longitude,year,month and day.
After grouping, get the first upload_time_add_8hour and the last upload_time_add_8hour of each group. If the difference more than 5 minutes between the first upload_time_add_8hour and the last upload_time_add_8hour, get the first row of data for each group, and finally save these data to csv.
I think my code is not concise enough.
I use df_first_row = sub_df.iloc[0:1,:] to get the first row in the dataframe, I use upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] and upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] to get the first element and the last element of a specific column.
Is there any more suitable way?
My code:
import pandas as pd
device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
sub_df = pd.DataFrame(gb.get_group(i))
sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
count_row = sub_df.shape[0] #get row count
if count_row>1: #each group must have more then 1 row
upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] # get first upload_time_add_8hour
upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] # get last upload_time_add_8hour
minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
df_first_row = sub_df.iloc[0:1,:]
sub_dataframe_list.append(df_first_row)
if sub_dataframe_list:
result = pd.concat(sub_dataframe_list,ignore_index=True)
result = result.sort_values(by=['upload_time'])
result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')
To get the first and last element of the column, your option is already the most efficient/correct way. If you're interested in this topic, I can recommend you to read this other Stackoverflow answer: https://stackoverflow.com/a/25254087/8294752
To get the first row, I personally prefer to use DataFrame.head(1), therefore for your code something like this:
df_first_row = sub_df.head(1)
I didn't look into how the head() method is defined in Pandas and its performance implications, but in my opinion it improves readability and reduces some potential confusion with indexes.
In other examples you might also find sub_df.iloc[0], but this option will return a pandas.Series which has as indexes the DataFrame column names.
sub_df.head(1) will return a 1-row DataFrame instead, which is the same result as sub_df.iloc[0:1,:]
Your way out is either groupby().agg or df. agg
If you need it it as per device you can
#sub_df.groupby('device_id')['upload_time_add_8hour'].agg(['first','last'])
sub_df.groupby('device_id')['upload_time_add_8hour'].agg([('upload_time_add_8hour_first','first'),('upload_time_add_8hour_last ','last')]).reset_index()
device_id upload_time_add_8hour_first upload_time_add_8hour_last
0 1101 10/1/2020 0:03 10/7/2020 13:04
If you do not want it as per device, maybe try
sub_df['upload_time_add_8hour'].agg({'upload_time_add_8hour_first': lambda x: x.head(1),'upload_time_add_8hour_last': lambda x: x.tail(1)})
upload_time_add_8hour_first 0 10/1/2020 0:03
upload_time_add_8hour_last 19 10/7/2020 13:04
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.