I have a pandas dataframe with a column CreatedDate in it. Currently the values of the column look like this:
id CreatedDate
123 1586362930000
124 1586555550000
Desired output is:
id CreatedDate
123 2020-04-08T15:50:00Z
124 2020-04-08T15:45:00Z
I have tried the following:
# Change the column type from int to datetime64[ns]
df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])
new_df = df['CreatedDate'].dt.strftime("%Y-%m-%d"+"T"+"%H:%M:%S"+"Z")
The output is this:
id CreatedDate
123 1970-01-01 00:26:26.362930
124 1970-01-01 00:26:26.365487
Which is not what I have expected, I know for a fact that those days should be April 8th.
I have tested dt.strftime("%Y-%m-%d"+"T"+"%H:%M:%S"+"Z") with just a string and it returns the desired output, however, when I apply it to the dataframe it doesn't work properly
This is unix time
pd.to_datetime(df.CreatedDate,unit='ms').dt.strftime("%Y-%m-%d""T""%H:%M:%S""Z")
0 2020-04-08T16:22:10Z
1 2020-04-10T21:52:30Z
Name: CreatedDate, dtype: object
Related
I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe.
Here is an example of what the data looks like.
I'm looking for an output of one instance of an account with the product and most recent date.
account_number product sale_date
0 123 rental 2021-12-01
1 423 rental 2021-10-01
2 513 sale 2021-11-02
3 123 sale 2022-01-01
4 513 sale 2021-11-30
I was trying to use groupby and idxmax() but it doesn't work with dates.
And I did want to change the dtype away from date time.
data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()
Any ideas would be awesome.
To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head (which will return NaNs if in first row per group unlike DataFrame.groupby().first):
data_grouped = (
data.sort_values(
["account_number", "sale_date"], ascending=[True, False]
).reset_index(drop=True)
.groupby("account_number")
.head(1)
)
It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:
df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]
Output:
account_number product sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30
Would the keyword 'first' work ? So that would be:
data.groupby('account_number')['sale_date'].first()
You want the last keyword in order to get the most recent date after grouping, like this:
df.groupby(by=["account_number"])["sale_date"].last()
which will provide this output:
account_number
123 2022-01-01
423 2021-10-01
513 2021-11-30
Name: sale_date, dtype: datetime64[ns]
It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.
I'll change my answer to use #Daniel Weigelbut's answer... and also here, where you can apply .nth(n) to find the nth value for a general case ((-1) for the most recent date).
new_data = data.groupby('account_number')['sale_date'].nth(-1)
My previous suggestion of creating a sorted multi index with
data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])
still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.
Basically this is the challenge I have
Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.
123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021
Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.
I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...
The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.
Is there any recommendations or ideas in how to accomplish that task?
The df:
df = pd.read_csv(StringIO(
"""id,trans_date
123,1/1/2021
345,1/1/2021
123,1/2/2021
123,1/20/2021
345,1/3/2021
"""
)) # added extra record for demo
df
id trans_date
0 123 1/1/2021
1 345 1/1/2021
2 123 1/2/2021
3 123 1/20/2021
4 345 1/3/2021
df['trans_date'] = pd.to_datetime(df['trans_date'])
As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.
result = df.groupby('id')['trans_date'].apply(
lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123 True
345 False
Name: trans_date, dtype: bool
If you just need the required ids, then
result.index[result].values
array([123])
The context and data you've provided about your situation are scanty, but you can probably do something like this:
>>> df
id type date
0 123 transaction 2021-01-01
1 345 transaction 2021-01-01
2 123 transaction 2021-01-02
3 123 transaction 2021-01-20
>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0 False
1 False
2 True
3 False
Name: id, dtype: bool
There, item 2 (the third item) is True because 123 already occured in the past week.
As far as I can understand the question, I think this is what you need.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"id": [123, 345, 123, 123],
"name": ["transaction", "transaction", "transaction", "transaction"],
"date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})
def dates_in_range(dates):
num_days_frame = 6
processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame for i in range(1, len(processed_dates)))
return difference_in_range and 1 or 0
group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))
print(df_new)
"""
print(df_new)
id
123 1
345 0
"""
Here you first group by the id such that you get all dates for that particular id in the same row.
After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.
Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.
All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.
I have a dataframe, df, where I am wanting to insert a new column named data in specific format.
df:
Name ID
Kelly A
John B
Desired output:
Date Name ID
2019-10-01 Kelly A
2019-10-01 John B
This is what I am doing:
df['2019-10-01'] = date
I am still researching this. Any insight is helpful
Try with
df['date'] = '2019-10-01'
main_df:
Name Age Id DOB
0 Tom 20 A4565 22-07-1993
1 nick 21 G4562 11-09-1996
2 krish AKL F4561 15-03-1997
3 636A 18 L5624 06-07-1995
4 mak 20 K5465 03-09-1997
5 nits 55 56541 45aBc
6 444 66 NIT 09031992
column_info_df:
Column_Name Column_Type
0 Name string
1 Age integer
2 Id string
3 DOB Date
how can i find data type error value from main df. For example from column info df we can see 'Name' is a string column, so in main df, 'Name' column should contain either string or alphanumeric other than that it's an error. I need to find those datatype error values in a separate df.
error output df:
Column_Name Current_Value Exp_Dtype Index_No.
0 Name 444 string 6
1 Age 444 int 2
2 Name 56441 string 6
0 DOB 4aBc Date 5
0 DOB 09031992 Date 6
i tried this:
for i,r in column_info_df.iterrows():
if r['Column_Type'] == 'string':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^a-z|A-Z]+')]
elif r['Column_Type'] == 'integer':
main_df[r['Column_Name']].loc[main_df[r['Column_Name']].str.match(r'[^0-9]+')]
elif r['Column_Type'] == 'Date':
i have stuck here,because this RE is not catching every errors. i don't know how to go further?
Here is one way of using df.eval(),
Note: though this will check based on pattern and return non matching values. However, note that this cannot check valid types, example if date column has an entry which looks like a date but is an invalid date, this wouldnot identify that:
d={"string":".str.contains(r'[a-z|A-Z]')","integer":".str.contains('^[0-9]*$')",
"Date":".str.contains('\d\d-\d\d-\d\d\d\d')"}
m=df.eval([f"~{a}{b}"
for a,b in zip(column_info_df['Column_Name'],column_info_df['Column_Type'].map(d))]).T
final=(pd.DataFrame(np.where(m,df,np.nan),columns=df.columns)
.reset_index().melt('index',var_name='Column_Name',
value_name='Current_Value').dropna())
final['Expected_dtype']=(final['Column_Name']
.map(column_info_df.set_index('Column_Name')['Column_Type']))
print(final)
Output:
index Column_Name Current_Value Expected_dtype
6 6 Name 444 string
9 2 Age AKL integer
19 5 Id 56541 string
26 5 DOB 45aBc Date
27 6 DOB 09031992 Date
I agree there can be better regex patterns for this job but the idea should be same.
If I understood what you did, you created separate dataframes, which contains infos about your main one.
What I suggest would be instead to use the build-in methods offered by pandas to deal with dataframes.
For instance, if you have a dataframe main, then:
main.info()
will give you the type of object for each column. Note that a column can contain only one type, as it is a series, which is itself a ndarray.
So your column name cannot have anything else but strings that you would have missed. Instead, you can have NaN values. You can check for them with the help of
main.describe()
I hope that helped :-)
I have the table below in a pandas dataframe:
date user_id val1 val2
01/01/2014 00:00:00 1 1790 12
01/02/2014 00:00:00 3 364 15
01/03/2014 00:00:00 2 280 10
02/04/2000 00:00:00 5 259 24
05/05/2003 00:00:00 4 201 39
02/05/2001 00:00:00 5 559 54
05/03/2003 00:00:00 4 231 69
..
The table was extracted from a .csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'val1', 'val2']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of each users and/or for the whole.
For this purpose, I would like to know how I can store at this stage all user_id (without duplicate) into another dataframe df_user_id (that I could use at the end in a loop in order to display the results for each user id).
I'm confused about your big-picture goal, but if you want to store all the unique user IDs, that probably should not be a DataFrame. (What would the index mean? And why would there need to be multiple columns?) A simple numpy array would suffice -- or a Series if you have some reason to need pandas' methods.
To get a numpy array of the unique user ids:
user_ids = df['user_id'].unique()