Python: Convert columns into date format and extract order - python

I am asking for help in transforming values into date format.
I have following data structure:
ID ACT1 ACT2 ACT3 ACT4
1 154438.0 154104.0 155321.0 155321.0
2 154042.0 154073.0 154104.0 154104.0
...
The number in columns ACT1-4 need to be converted. Some rows contain NaN values.
I found that following function helps me to get a Gregorian date:
from datetime import datetime, timedelta
gregorian = datetime.strptime('1582/10/15', "%Y/%m/%d")
modified_date = gregorian + timedelta(days=154438)
datetime.strftime(modified_date, "%Y/%m/%d")
It would be great to know how I can apply this transformation to all columns except for "ID" and whether the approach is correct (or could be improved).
After the transformation is applied, I need to extract the order of column items, sorted by date in ascending order. For instance
ID ORDER
1 ACT1, ACT3, ACT4, ACT2
2 ACT2, ACT1, ACT3, ACT4
Thank you!

It sounds like you have two questions here.
1) To change to datetime:
cols = [col for col in df.columns if col != 'ID']
df.loc[:, cols] = df.loc[:, cols].applymap(lambda x: datetime.strptime('1582/10/15', "%Y/%m/%d") + timedelta(days=x) if np.isfinite(x) else x)
2) To get the sorted column names:
df['ORDER'] = df.loc[:, cols].apply(lambda dr: ','.join(df.loc[:, cols].columns[dr.dropna().argsort()]), axis=1)
Note: the dropna above will omit columns with NaT values from the order string.

First I would make the input column comma separated so that its much easier to handle of the form:
ID,ACT1,ACT2,ACT3,ACT4
1,154438.0,154104.0,155321.0,155321.0
2,154042.0,154073.0,154104.0,154104.0
Then you can read each line using a CSV reader, extracting key,value pairs that have your column names as keys. Then you pop the ID off that dictionary to get its value ie, 1,2,etc. And you can then reorder according to the value which is the date. The code is below:
#!/usr/bin/env python3
import csv
from operator import itemgetter
idAndTuple = {}
with open('time.txt') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
myID = row.pop('ID',None)
reorderedList = sorted(row.items(), key = itemgetter(1))
idAndTuple[myID] = reorderedList
print( myID, reorderedList )
The result when you run this is:
1 [('ACT2', '154104.0'), ('ACT1', '154438.0'), ('ACT3', '155321.0'), ('ACT4', '155321.0')]
2 [('ACT1', '154042.0'), ('ACT2', '154073.0'), ('ACT3', '154104.0'), ('ACT4', '154104.0')]
which I think is what you are looking for.

Related

How do I reorder a long string of concatenated date and timestamps seperated by commas using Python?

I have a string type column called 'datetimes' that contains multiple dates with their timestamps, and I'm trying to extract the earliest and last dates (without the timestamps) into new columns called 'earliest_date' and 'last date'.
The problem, however, is that the dates are not in order, so it's not as straightforward as using a str.split() method to get the first and last dates in the string. I need to order them first in ascending order.
Here's an example of an entry for one of the rows: 2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00,
As you can see, the order is randomized. I would like to firstly remove the timestamps which are fortunately in between a whitespace and comma, then order the dates in ascending order, and then finally get the max and min dates into two separate columns.
Can anyone please help me? Thanks in advance :)
`df['Campaign Interaction Dates'] = df['Campaign Interaction Dates'].str.replace('/','-')
def normalise(d):
if len(t := d.split('-')) == 3:
return d if len(t[0]) == 4 else '-'.join(reversed(t))
return '9999-99-99'
out = sorted(normalise(t[:10]) for t in str(df[df['Campaign Interaction Dates']]).split(',') if t)
df['out'] = out[1]
print(display(df[df['Number of Campaign Codes registered']==3]))`
You can use following code if you are not sure that date format will always be YYYY-MM-DD:
import datetime
string= "2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00"
dates_list = [date[:10] for date in string.split(',')]
dates_list.sort(key=lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
min_date, max_date = dates_list[0], dates_list[-1]
You can easily replace date format here
string = "2022-04-13 04:47:00,2022-04-07 01:58:00,2022-03-31 02:32:00,2022-03-25 11:59:00,2022-04-12 05:07:00,2022-03-29 01:46:00,2022-03-31 05:52:00"
split_string = string.split(",")
split_string.sort()
new_list = []
for i in split_string:
temp_list = i.split()
new_list.append(temp_list[0])
max_date = new_list[-1]
min_date = new_list[0]

Extract several values from a row when a certain value is found using a list

I have a .csv file with 29 columns and 1692 rows.
The columns D_INT_1 and D_INT_2 are just dates.
I want to check for these 2 columns if there is dates between :>= "2022-03-01" and <= "2024-12-31.
And, if a value is found, I want to display the date found + the value of the column "NAME" that is located on the same row of said found value.
This is what I did right now, but it only grab the dates and not the adjacent value ('NAME').
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
This is what I would like to get:
Example of DF:
NAME, ADDRESS, D_INT_1, D_INT_2
Mark, H4N1V8, 2023-01-02, 2019,-01-01
Expected output:
MARK, 2023-01-02
Lots of times the compact [for in] syntax can be used efficiently for simple code, but in this case I wouldn't recommend it. I suggest you use a normal for. Here's an example:
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
# loop for every row in the data
# (i will start as 0 and increase by 1 every iteration)
for i in range(0, len(D_INT_1)):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
print(NAME[i], D_INT_1[i])
if D_INT_2[i] >= "2022-03-01" and D_INT_2[i] <= "2024-12-31":
print(NAME[i], D_INT_2[i])
First for performance dont use loops, because exist vectorized alternatives unpivot by DataFrame.melt and filter by Series.between with DataFrame.loc:
df = df.melt(id_vars='NAME', value_vars=['D_INT_1','D_INT_2'], value_name='Date')
df1 = df.loc[df['Date'].between("2022-03-01","2024-12-31"), ['NAME','Date']]
print (df1)
NAME Date
0 Mark 2023-01-02
Or filter original DataFrame and last join in concat:
df1 = df.loc[df['D_INT_1'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_1']]
df2 = df.loc[df['D_INT_2'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_2']]
df = pd.concat([df1.rename(columns={'D_INT_1':'date'}),
df2.rename(columns={'D_INT_2':'date'})])
print (df)
NAME date
0 Mark 2023-01-02
Last if need loops output with print:
for i in df.itertuples():
print (i.NAME, i.Date)
Mark 2023-01-02 00:00:00
Mark 2019-01-01 00:00:00
So there a few things to be of note here:
In this case, you are better off probably with a normal for-loop since it can be a bit more complicated.
To do what you want, you want to first:
Load the names:
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
Use enumerate since we know all lists are aligned the same in your loop, keep in mind that enumerate gets both value and index, but I am getting the value manually just for cleaner (and clearer) code:
ext = []
for i,_ in enumerate(D_INT_1):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_1[i],NAMES[i]))
if D_INT_2[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_2[i],NAMES[i]))
Of course, you can use a list comprehension (or in this case, two), but this form should be easier to understand for this answer.
To do so, you will need to still load the names like in the first step, then use enumerate in the list comprehension, while adding the name after i in a tuple, perhaps something like this:
ext = [(i,NAMES[ind]) for ind,i in enumerate(D_INT_1 + D_INT_2) if i >= "2022-03-01" and i <= "2024-12-31"]
Keep in mind that I didn't test the above code since I have no access to the original csv, but it should be a good starting point.

What is the correct way to get the first row of a dataframe?

The data in test.csv likes this:
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,
I use this code to get the data with the speed is 0 in the dataframe, and then group the dataframe according to latitude, longitude,year,month and day.
After grouping, get the first upload_time_add_8hour and the last upload_time_add_8hour of each group. If the difference more than 5 minutes between the first upload_time_add_8hour and the last upload_time_add_8hour, get the first row of data for each group, and finally save these data to csv.
I think my code is not concise enough.
I use df_first_row = sub_df.iloc[0:1,:] to get the first row in the dataframe, I use upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] and upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] to get the first element and the last element of a specific column.
Is there any more suitable way?
My code:
import pandas as pd
device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
sub_df = pd.DataFrame(gb.get_group(i))
sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
count_row = sub_df.shape[0] #get row count
if count_row>1: #each group must have more then 1 row
upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] # get first upload_time_add_8hour
upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] # get last upload_time_add_8hour
minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
df_first_row = sub_df.iloc[0:1,:]
sub_dataframe_list.append(df_first_row)
if sub_dataframe_list:
result = pd.concat(sub_dataframe_list,ignore_index=True)
result = result.sort_values(by=['upload_time'])
result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')
To get the first and last element of the column, your option is already the most efficient/correct way. If you're interested in this topic, I can recommend you to read this other Stackoverflow answer: https://stackoverflow.com/a/25254087/8294752
To get the first row, I personally prefer to use DataFrame.head(1), therefore for your code something like this:
df_first_row = sub_df.head(1)
I didn't look into how the head() method is defined in Pandas and its performance implications, but in my opinion it improves readability and reduces some potential confusion with indexes.
In other examples you might also find sub_df.iloc[0], but this option will return a pandas.Series which has as indexes the DataFrame column names.
sub_df.head(1) will return a 1-row DataFrame instead, which is the same result as sub_df.iloc[0:1,:]
Your way out is either groupby().agg or df. agg
If you need it it as per device you can
#sub_df.groupby('device_id')['upload_time_add_8hour'].agg(['first','last'])
sub_df.groupby('device_id')['upload_time_add_8hour'].agg([('upload_time_add_8hour_first','first'),('upload_time_add_8hour_last ','last')]).reset_index()
device_id upload_time_add_8hour_first upload_time_add_8hour_last
0 1101 10/1/2020 0:03 10/7/2020 13:04
If you do not want it as per device, maybe try
sub_df['upload_time_add_8hour'].agg({'upload_time_add_8hour_first': lambda x: x.head(1),'upload_time_add_8hour_last': lambda x: x.tail(1)})
upload_time_add_8hour_first 0 10/1/2020 0:03
upload_time_add_8hour_last 19 10/7/2020 13:04

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories