Non Uniform Date columns in Pandas dataframe - python

my dataframe looks like:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018, 5:
4760, 6: 4029},'date': {0: '23-02-2016', 1: '24-02-2016', 2: '11/2/2016',
3: '12/2/2016', 4: '13-02-2016', 5: '14-02-2016', 6: '15-02-2016'}})
Rows have 2 different formattings.
The format I need is:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018,
5: 4760, 6: 4029}, 'date': {0: '2/23/2016',1: '2/24/2016', 2: '2/11/2016',
3: '2/12/2016', 4: '2/13/2016', 5: '2/14/2016', 6: '2/15/2016'}})
So far I managed to open the csv in Excel as text data, UTF-8 format and then choose a MDY formatting for the date column. Moreover I apply:
a['date'] = a['date'].apply(lambda x: datetime.strptime(x,'%m/%d/%Y'))
How can I efficiently do that in Pandas?

You can convert to datetime using to_datetime and then call dt.strftime to get it in the format you want:
In [21]:
a['date'] = pd.to_datetime(a['date']).dt.strftime('%m/%d/%Y')
a
Out[21]:
clicks date
0 4020 02/23/2016
1 3718 02/24/2016
2 2700 02/11/2016
3 3867 02/12/2016
4 4018 02/13/2016
5 4760 02/14/2016
6 4029 02/15/2016
if the column is already datetime dtype then you can skip the to_datetime step

Related

Plot count of unique values in Python

I have a data frame that is similar to the following:
Time Account_ID Device_ID Zip_Code
0 2011-02-02 12:02:19 ABC123 A12345 83420
1 2011-02-02 13:35:12 EFG456 B98765 37865
2 2011-02-02 13:54:57 EFG456 B98765 37865
3 2011-02-02 14:45:20 EFG456 C24568 37865
4 2011-02-02 15:08:58 ABC123 A12345 83420
5 2011-02-02 15:25:17 HIJ789 G97352 97452
How do I make a plot with the count of unique of account id's on the y-axis and the number of unique device id's associated with a single account id on the x-axis?
So in this instance the "1" bin on the x-axis would have a height of 2 since accounts "ABC123" and "HIJ789" only have 1 unique device id each and the "2" bin would have a height of 1 since account "EFG456" has two unique device id's associated with it.
EDIT
This is the output I got from trying
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
You can combine groupby nunique and value_counts like this:
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
Edit:
Code used to recreate your data:
df = pd.DataFrame({'Time': {0: '2011-02-02 12:02:19', 1: '2011-02-02 13:35:12', 2: '2011-02-02 13:54:57',
3: '2011-02-02 14:45:20', 4: '2011-02-02 15:08:58', 5: '2011-02-02 15:25:17'},
'Account_ID': {0: 'ABC123', 1: 'EFG456', 2: 'EFG456', 3: 'EFG456', 4: 'ABC123', 5: 'HIJ789'},
'Device_ID': {0: 'A12345', 1: 'B98765', 2: 'B98765', 3: 'C24568', 4: 'A12345', 5: 'G97352'},
'Zip_Code': {0: 83420, 1: 37865, 2: 37865, 3: 37865, 4: 83420, 5: 97452}})

Dataframe sort and remove on date

I have the following data frame
import pandas as pd
from pandas import Timestamp
df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85},
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8},
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37},
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04},
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08},
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62},
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77},
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})
I scrape these from a website several times a day
I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept.
So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31
The entries on the previous day are made already okay manually :-(
I intent to do this once a month so each day has several entries
I already tried
dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')
But that gives me al the records back because the time is different
Any one an idea how to do this?
If the data is sorted by date, use a groupby.last:
df.groupby(df['Datum'].dt.date, as_index=False).last()
else:
df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]
output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
2 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
2 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
2 48.62 52.77 2022-07-12 15:59:31
You can use .max() with datetime columns like this:
dfclean = df.loc[
(df['Datum'].dt.date < df['Datum'].max().date()) |
(df['Datum'] == df['Datum'].max())
]
Output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
6 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
6 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
6 48.62 52.77 2022-07-12 15:59:31
Below a working example, where I keep only the date part of the timestamp to filter the dataframe:
df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)
Does this get you what you need?
df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]

Create a 2nd column based on the maximum date By Month in 1st column

I would like to create a 2nd column based on the maximum date by month in 1st column, but I'm having trouble identifying the maximum date by month in the 1st column (first step below).
I'm trying to do a groupby but im getting a ValueError: Cannot index with multidimensional key.
I believe the steps are:
Within the datadate column, identify the maximum date by month. Eg.
1/29/1993, 2/11/1993, 3/29/1993, etc.
For the datadate row that equals the maximum date by month: in a new column called last_day_in_month, put the maximum
possible date: Eg. 1/31/1993, 2/28/1993, 3/31/1993, etc. For all the other rows where datadate row != maximum date by month, put
False.
Sample Data and Ideal Output:
{'tic': {0: 'SPY', 1: 'SPY', 2: 'SPY', 3: 'SPY', 4: 'SPY', 5: 'SPY', 6: 'SPY', 7: 'SPY', 8: 'SPY', 9: 'SPY'}, 'cusip': {0: '78462F103', 1: '78462F103', 2: '78462F103', 3: '78462F103', 4: '78462F103', 5: '78462F103', 6: '78462F103', 7: '78462F103', 8: '78462F103', 9: '78462F103'}, 'datadate': {0: '1993-01-29', 1: '1993-02-01', 2: '1993-02-02', 3: '1993-02-03', 4: '1993-02-04', 5: '1993-02-05', 6: '1993-02-08', 7: '1993-02-09', 8: '1993-02-10', 9: '1993-02-11'}, 'prccd': {0: 43.938, 1: 44.25, 2: 44.34375, 3: 44.8125, 4: 45.0, 5: 44.96875, 6: 44.96875, 7: 44.65625, 8: 44.71875, 9: 44.9375}, 'next_year': {0: '1994-01-25', 1: '1994-01-26', 2: '1994-01-27', 3: '1994-01-28', 4: '1994-01-31', 5: '1994-02-01', 6: '1994-02-02', 7: '1994-02-03', 8: '1994-02-04', 9: '1994-02-07'}, 'next_year_px': {0: 47.1875, 1: 47.3125, 2: 47.75, 3: 47.875, 4: 48.21875, 5: 47.96875, 6: 48.28125, 7: 48.0625, 8: 46.96875, 9: 47.1875}, 'one_yr_chg': {0: 0.073956484136738, 1: 0.0692090395480226, 2: 0.076814658210007, 3: 0.0683403068340306, 4: 0.0715277777777777, 5: 0.0667129951355107, 6: 0.0736622654621264, 7: 0.0762771168649405, 8: 0.050314465408805, 9: 0.0500695410292072}, 'daily_chg': {0: nan, 1: 0.0071009149255769, 2: 0.0021186440677967, 3: 0.0105708245243127, 4: 0.0041841004184099, 5: -0.0006944444444444, 6: 0.0, 7: -0.0069492703266157, 8: 0.0013995801259623, 9: 0.004891684136967}, 'last_day_in_month': {0: '1993-01-31', 1: 'False', 2: 'False', 3: 'False', 4: 'False', 5: 'False', 6: 'False', 7: 'False', 8: 'False', 9: '1993-02-28'}}
Check group by month and idxmax for maximum days. Check to_period and to_timestamp for last day of each month.
datetime = pd.to_datetime(df.datadate)
max_day_indx = datetime.groupby(datetime.dt.month).idxmax()
df['last_day_in_month'] = False
df.loc[max_day_indx, 'last_day_in_month'] = datetime[max_day_indx].dt.to_period('M').dt.to_timestamp('M').dt.strftime('%Y-%m-%d')
print(df)
tic cusip datadate prccd next_year next_year_px one_yr_chg \
0 SPY 78462F103 1993-01-29 43.93800 1994-01-25 47.18750 0.073956
1 SPY 78462F103 1993-02-01 44.25000 1994-01-26 47.31250 0.069209
2 SPY 78462F103 1993-02-02 44.34375 1994-01-27 47.75000 0.076815
3 SPY 78462F103 1993-02-03 44.81250 1994-01-28 47.87500 0.068340
4 SPY 78462F103 1993-02-04 45.00000 1994-01-31 48.21875 0.071528
5 SPY 78462F103 1993-02-05 44.96875 1994-02-01 47.96875 0.066713
6 SPY 78462F103 1993-02-08 44.96875 1994-02-02 48.28125 0.073662
7 SPY 78462F103 1993-02-09 44.65625 1994-02-03 48.06250 0.076277
8 SPY 78462F103 1993-02-10 44.71875 1994-02-04 46.96875 0.050314
9 SPY 78462F103 1993-02-11 44.93750 1994-02-07 47.18750 0.050070
daily_chg last_day_in_month
0 NaN 1993-01-31
1 0.007101 False
2 0.002119 False
3 0.010571 False
4 0.004184 False
5 -0.000694 False
6 0.000000 False
7 -0.006949 False
8 0.001400 False
9 0.004892 1993-02-28

Create a ranking of data based on the dates and categories of another column

I have the following dataframe:
account_id contract_id type date activated
0 1 AAA Downgrade 2021-01-05
1 1 ADS Original 2020-12-12
2 1 ADGD Upgrade 2021-02-03
3 1 BB Winback 2021-05-08
4 1 CC Upgrade 2021-06-01
5 2 HHA Original 2021-03-05
6 2 HAKD Downgrade 2021-03-06
7 3 HADSA Original 2021-05-01
I want the following output:
account_id contract_id type date activated Renewal Order
0 1 ADS Original 2020-12-12 Original
1 1 AAA Downgrade 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original
The column I want to create is "Renewal Order". Each account can have multiple contracts. The condition is based on each account (account_id), the type (only when it is either "Original" or "Winback", and the order in which the contracts are activated (date_activated). The first contract (or tagged as "Original" under the "Type" column) will be identified as "Original" while the succeeding contracts as "1st", "2nd", and so on. The order resets when the contract is tagged as "Winback" under the "Type" column, i.e. it will now be identified as "Original" and the succeeding contracts as "1st", "2nd", and so on (refer to contract_id BB).
I tried the following code but it does not consider the condition on the "Winback":
def format_order(n):
if n == 0:
return 'Original'
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
if 11 <= (n % 100) <= 13:
suffix = 'th'
return str(n) + suffix
df = df.sort_values(['account_id', 'date_activated']).reset_index(drop=True)
# apply
df['Renewal Order'] = df.groupby('account_id').cumcount().apply(format_order)
Here's the dictionary of the original dataframe:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'AAA',
1: 'ADS',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Downgrade',
1: 'Original',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2021-01-05 00:00:00'),
1: Timestamp('2020-12-12 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')}}
Here's the dictionary for the result:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'ADS',
1: 'AAA',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Original',
1: 'Downgrade',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2020-12-12 00:00:00'),
1: Timestamp('2021-01-05 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')},
'Renewal Order': {0: 'Original',
1: '1st',
2: '2nd',
3: 'Original',
4: '1st',
5: 'Original',
6: '1st',
7: 'Original'}}
Let us just change the cumcount result
s = df.groupby('account_id').cumcount()
s[df.type=='Winback'] = 0
df['Renewal Order'] = s.apply(format_order)
Using #BENY solution:
df = df.sort_values(['account_id', 'date activated']).reset_index(drop=True)
s = df.groupby(['account_id',
(df['type'] == 'Winback').cumsum()
]).cumcount()
df['Renewal Order'] = s.apply(format_order)
Output:
account_id contract_id type date activated Renewal Order
0 1 ADS Downgrade 2020-12-12 Original
1 1 AAA Original 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original

Annotating line chart with data values

I have plot with multiple line charts. I would like to plot data values for each point in each line that corresponds to its value in my pandas dataframe. I am however having difficulties annotating these data points.
This is a sample of the code I have written to try and solve this issue:
ax= weekdays.plot(marker='o',label='Conversion rates between signup states')
ax.set_xticklabels(['original','sunday','monday',
'tuesday','wednesday','thursday',
'friday','saturday'])
for i in weekdays.values:
ax.text(str(i),xy=i)
Here is a sample of my data (from weekdays dataframe). I returned it as a dictionary for ease of reading:
{'filter': {0: 'original',
1: 'sunday',
2: 'monday',
3: 'tuesday',
4: 'wednesday',
5: 'thursday',
6: 'friday',
7: 'saturday'},
'session_to_leads': {0: 16.28,
1: 13.88,
2: 13.63,
3: 15.110000000000001,
4: 13.469999999999999,
5: 13.54,
6: 12.58,
7: 12.82},
'leads_to_opps': {0: 9.47,
1: 6.279999999999999,
2: 7.62,
3: 8.6,
4: 7.5600000000000005,
5: 7.9,
6: 7.08,
7: 5.7299999999999995},
'opps_to_complete': {0: 1.92,
1: 0.86,
2: 1.3599999999999999,
3: 1.69,
4: 1.3599999999999999,
5: 1.48,
6: 1.51,
7: 0.88}}
You can try it in a different way with plotly. You can first generate a new data frame with 3 columns with the following code,
values = weekdays.T.values[1:].ravel()
idx = weekdays.T.values[0].ravel().tolist() * 3
cols = weekdays.columns[1:]
cols_ = []
for col in cols:
cols_.append([col]*8)
cols_ = np.array(cols_).ravel()
weekdays_ = pd.DataFrame({'days': idx, 'values': values, 'cols': cols_})
, the output is like below:
days values cols
0 original 16.28 session_to_leads
1 sunday 13.88 session_to_leads
2 monday 13.63 session_to_leads
3 tuesday 15.11 session_to_leads
4 wednesday 13.47 session_to_leads
5 thursday 13.54 session_to_leads
6 friday 12.58 session_to_leads
7 saturday 12.82 session_to_leads
8 original 9.47 leads_to_opps
Now, you use the following to get a plot
import plotly_express as px # or import plotly.express as px
px.line(weekdays_, x='days', y='values', color='cols')
which produces the following interactive plot.

Categories