Plot count of unique values in Python - python

I have a data frame that is similar to the following:
Time Account_ID Device_ID Zip_Code
0 2011-02-02 12:02:19 ABC123 A12345 83420
1 2011-02-02 13:35:12 EFG456 B98765 37865
2 2011-02-02 13:54:57 EFG456 B98765 37865
3 2011-02-02 14:45:20 EFG456 C24568 37865
4 2011-02-02 15:08:58 ABC123 A12345 83420
5 2011-02-02 15:25:17 HIJ789 G97352 97452
How do I make a plot with the count of unique of account id's on the y-axis and the number of unique device id's associated with a single account id on the x-axis?
So in this instance the "1" bin on the x-axis would have a height of 2 since accounts "ABC123" and "HIJ789" only have 1 unique device id each and the "2" bin would have a height of 1 since account "EFG456" has two unique device id's associated with it.
EDIT
This is the output I got from trying
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()

You can combine groupby nunique and value_counts like this:
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
Edit:
Code used to recreate your data:
df = pd.DataFrame({'Time': {0: '2011-02-02 12:02:19', 1: '2011-02-02 13:35:12', 2: '2011-02-02 13:54:57',
3: '2011-02-02 14:45:20', 4: '2011-02-02 15:08:58', 5: '2011-02-02 15:25:17'},
'Account_ID': {0: 'ABC123', 1: 'EFG456', 2: 'EFG456', 3: 'EFG456', 4: 'ABC123', 5: 'HIJ789'},
'Device_ID': {0: 'A12345', 1: 'B98765', 2: 'B98765', 3: 'C24568', 4: 'A12345', 5: 'G97352'},
'Zip_Code': {0: 83420, 1: 37865, 2: 37865, 3: 37865, 4: 83420, 5: 97452}})

Related

Dataframe sort and remove on date

I have the following data frame
import pandas as pd
from pandas import Timestamp
df=pd.DataFrame({
'Tech en Innovation Fonds': {0: '63.57', 1: '63.57', 2: '63.57', 3: '63.57', 4: '61.03', 5: '61.03', 6: 61.03}, 'Aandelen Index Fonds': {0: '80.22', 1: '80.22', 2: '80.22', 3: '80.22', 4: '79.85', 5: '79.85', 6: 79.85},
'Behoudend Mix Fonds': {0: '44.80', 1: '44.8', 2: '44.8', 3: '44.8', 4: '44.8', 5: '44.8', 6: 44.8},
'Neutraal Mix Fonds': {0: '50.43', 1: '50.43', 2: '50.43', 3: '50.43', 4: '50.37', 5: '50.37', 6: 50.37},
'Dynamisch Mix Fonds': {0: '70.20', 1: '70.2', 2: '70.2', 3: '70.2', 4: '70.04', 5: '70.04', 6: 70.04},
'Risicomijdende Strategie': {0: '46.03', 1: '46.03', 2: '46.03', 3: '46.03', 4: '46.08', 5: '46.08', 6: 46.08},
'Tactische Strategie': {0: '48.69', 1: '48.69', 2: '48.69', 3: '48.69', 4: '48.62', 5: '48.62', 6: 48.62},
'Aandelen Groei Strategie': {0: '52.91', 1: '52.91', 2: '52.91', 3: '52.91', 4: '52.77', 5: '52.77', 6: 52.77},
'Datum': {0: Timestamp('2022-07-08 18:00:00'), 1: Timestamp('2022-07-11 19:42:55'), 2: Timestamp('2022-07-12 09:12:09'), 3: Timestamp('2022-07-12 09:29:53'), 4: Timestamp('2022-07-12 15:24:46'), 5: Timestamp('2022-07-12 15:30:02'), 6: Timestamp('2022-07-12 15:59:31')}})
I scrape these from a website several times a day
I am looking for a way to clean the dataframe, so that for each day only the latest entry is kept.
So for this dataframe 2022-07-12 has 5 entries for 2027-07-12 but I want to keep the last one i.e. 2022-07-12 15:59:31
The entries on the previous day are made already okay manually :-(
I intent to do this once a month so each day has several entries
I already tried
dfclean=df.sort_values('Datum').drop_duplicates('Datum', keep='last')
But that gives me al the records back because the time is different
Any one an idea how to do this?
If the data is sorted by date, use a groupby.last:
df.groupby(df['Datum'].dt.date, as_index=False).last()
else:
df.loc[df.groupby(df['Datum'].dt.date)['Datum'].idxmax()]
output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
2 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
2 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
2 48.62 52.77 2022-07-12 15:59:31
You can use .max() with datetime columns like this:
dfclean = df.loc[
(df['Datum'].dt.date < df['Datum'].max().date()) |
(df['Datum'] == df['Datum'].max())
]
Output:
Tech en Innovation Fonds Aandelen Index Fonds Behoudend Mix Fonds \
0 63.57 80.22 44.80
1 63.57 80.22 44.8
6 61.03 79.85 44.8
Neutraal Mix Fonds Dynamisch Mix Fonds Risicomijdende Strategie \
0 50.43 70.20 46.03
1 50.43 70.2 46.03
6 50.37 70.04 46.08
Tactische Strategie Aandelen Groei Strategie Datum
0 48.69 52.91 2022-07-08 18:00:00
1 48.69 52.91 2022-07-11 19:42:55
6 48.62 52.77 2022-07-12 15:59:31
Below a working example, where I keep only the date part of the timestamp to filter the dataframe:
df['Datum_Date'] = df['Datum'].dt.date
dfclean = df.sort_values('Datum_Date').drop_duplicates('Datum_Date', keep='last')
dfclean = dfclean.drop(columns='Datum_Date', axis=1)
Does this get you what you need?
df['Day'] = df['Datum'].dt.day
df.loc[df.groupby('Day')['Day'].idxmax()]

Create and trip report with end latitude and logitude

Please help, I have a data set structured like below
ss={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride1',3: 'ride2',4: 'ride2',
5: 'ride2',6: 'ride2',7: 'ride3',8: 'ride3',9: 'ride3',10: 'ride3'},
'lat': {0: 5.616526,1: 5.623686, 2: 5.616555,3: 5.616556,4: 5.613834, 5: 5.612899,
6: 5.610804,7: 5.616614,8: 5.644431,9: 5.650771, 10: 5.610828},
'long': {0: -0.231901,1: -0.227248,2: -0.23192,3: -0.23168,4: -0.223812,
5: -0.22869,6: -0.226193,7: -0.231461,8: -0.237549,9: -0.271337,10: -0.226157},
'distance': {0: 0.0,1: 90.021,2: 138.0751,3: 0.0,4: 90.0041,5: 180.0293,6: 180.562, 7:0.0,8: 90.004,9: 180.0209,10: 189.0702},}
df=pd.DataFrame(ss)
the ride_id column indicates the number of trips taken in a window to make up the ride.
For example, ride1 consists of 2 trips, the first trip starts at index 0 and ends at index 1, then trip 2 starts at index 1 and ends at index 2.
I want to create a new data frame of trip reports, where each row will have the start coordinates (lat, long) and trip end coordinates(end_lat,end_long) taken from the next row and then distance. The results should look like the data frame below
sf={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride2',3: 'ride2',4: 'ride2',},
'lat': {0: 5.616526,1: 5.623686,2: 5.616556,3: 3.613834, 4: 5.612899},
'long': {0: -0.231901,1: -0.227248,2: -0.23168,3: -0.223812,4: -0.22869},
'end_lat':{0: 5.623686,1: 5.616555,2: 5.613834,3: 5.612899,4: 5.610804},
'end_long':{0: -0.227248,1: -0.23192,2: -0.223812,3: -0.22869,4: -0.226193},
'distance': {0: 90.02100,1: 138.07510,2: 90.00410,3: 180.02930,4: 180.5621},}
df_s=pd.DataFrame(sf)
df_s
OUT:
ride_id lat long end_lat end_long distance
0 ride1 5.616526 -0.231901 5.623686 -0.227248 90.0210
1 ride1 5.623686 -0.227248 5.616555 -0.231920 138.0751
2 ride2 5.616556 -0.231680 5.613834 -0.223812 90.0041
3 ride2 3.613834 -0.223812 5.612899 -0.228690 180.0293
4 ride2 5.612899 -0.228690 5.610804 -0.226193 180.5621
I tried to group the data frame by the ride_id to isolate each ride_id, but I'm stuck, any ideas are warmly welcomed.
We can do groupby with shift then dropna
df['start_lat'] = df.groupby('ride_id')['lat'].shift()
df['start_long'] = df.groupby('ride_id')['long'].shift()
df = df.dropna()
df
Out[480]:
ride_id lat long distance start_lat start_long
1 ride1 5.623686 -0.227248 90.0210 5.616526 -0.231901
2 ride1 5.616555 -0.231920 138.0751 5.623686 -0.227248
4 ride2 5.613834 -0.223812 90.0041 5.616556 -0.231680
5 ride2 5.612899 -0.228690 180.0293 5.613834 -0.223812
6 ride2 5.610804 -0.226193 180.5620 5.612899 -0.228690
8 ride3 5.644431 -0.237549 90.0040 5.616614 -0.231461
9 ride3 5.650771 -0.271337 180.0209 5.644431 -0.237549
10 ride3 5.610828 -0.226157 189.0702 5.650771 -0.271337

Similarity between time series groups

I have a dataset like the below, multiple groups, completed values with over 200 columns (denoting days)
Input
Series
1
2
3
4
5
6
7
GROUP
01/08/2021
100%
75%
60%
50%
40%
30%
0%
A
08/08/2021
100%
95%
80%
60%
30%
10%
0%
A
15/08/2021
100%
85%
60%
40%
20%
10%
5%
A
01/08/2021
100%
70%
65%
55%
45%
35%
0%
B
08/08/2021
100%
90%
80%
60%
30%
10%
0%
B
15/08/2021
100%
95%
60%
40%
30%
20%
5%
B
Now, I have an incomplete dataset like the below. I would like to compute similarity metric for each group and state which series is most similar.
For purpose of similarity, I am using CORREL in Excel at the moment and in case of tie, I am using the latest one. For comparison, only complete values in both groups are compared (i.e. so missing values in expected output are not used for similarity metric calculation).
This is a VBA macro which I am shifting to python (either pandas or pyspark).
I am confused on how best to proceed. Any other similarity metric can be tried out too. Thanks
Expected Output
Series
1
2
3
4
5
6
7
Similarity_Score
Similarity_Week
Group
01/09/2021
39%
28%
0%
0.99
01/08/2021
A
08/09/2021
62%
44%
21%
12%
7%
0.99
15/08/2021
A
15/09/2021
8%
0%
1.00
08/08/2021
A
15/09/2021
30%
19%
0%
1.00
15/08/2021
B
This solution involves iterating over each group, taking a subset of each dataframe and taking the product of each dataframes values, such that each row can be compared to every other row.
We can use some nested zip/filter/reverse trickery to keep only columns that are filled out. Putting that in a list with the dates from both dfs and the group, we can create a dataframe, sort, group, and keep the top score from each.
Joining this back to the second df should give you the output you want.
import pandas as pd
import numpy as np
from itertools import product
df = pd.DataFrame({'Series': {0: '01/08/2021',
1: '08/08/2021',
2: '15/08/2021',
3: '01/08/2021',
4: '08/08/2021',
5: '15/08/2021'},
'1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
'2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
'3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
'4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
'5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
'6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
'7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})
df2 = pd.DataFrame({'Series': {0: '01/09/2021',
1: '08/09/2021',
2: '15/09/2021',
3: '15/09/2021'},
'1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
'4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
'5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
'6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
'7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
'Similarity_Week': {0: '01/08/2021',
1: '15/08/2021',
2: '08/08/2021',
3: '15/08/2021'},
'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)
df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)
l = []
for g, data in df.groupby('GROUP'):
x = df2.loc[df2['Group']==g]
for c in product(data.values,x.values):
a = c[0][1:-1]
b = c[1][1:-1]
a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
a = [int(x.replace('%',''))/100 for x in a]
b = list(reversed([int(x.replace('%',''))/100 for x in b]))
l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])
out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])
Output
Series 1 2 3 4 5 6 7 Group Similarity_Week \
0 01/09/2021 NaN NaN NaN NaN 39% 28% 0% A 01/08/2021
1 08/09/2021 NaN NaN 62% 44% 21% 12% 7% A 15/08/2021
2 15/09/2021 NaN NaN NaN NaN NaN 8% 0% A 01/08/2021
3 15/09/2021 NaN NaN NaN NaN 30% 19% 0% B 15/08/2021
Similarity_Score
0 0.999405
1 0.999005
2 1.000000
3 0.999286
I believe the scores are very similar for 15/09/2021 group A, such that if you were to round the scores you would get a different most recent date. You can validate this by checking
[x for x in l if x[2]=='15/09/2021' and x[0]=='A']
Yields
[['A', '01/08/2021', '15/09/2021', 1.0],
['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]
So in theory the 15/08/2021 would be the date if you were rounding to a few decimal places, which you could do by putting round() around the np.corrcoef
If you prefer a solution without for loops you could merge the two data frames on Group, and use groupby to apply the similarity metric.
Building on the data frames constructed by #Chris:
df.rename(columns={"GROUP":"Group"}, inplace=True)
def similarity(arr1, arr2):
"""Similarity between two arrays of percent strings, nans ignored"""
df = pd.DataFrame({"arr1":arr1, "arr2":arr2}).dropna() \
.apply(lambda s: s.str.strip("%").astype(float)/100)
return df.arr1.corr(df.arr2)
# Convert data columns to array in each row.
df_xformed = df.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df_Series", 0:"df"})
df2_xformed = df2.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df2_Series", 0:"df2"})
# Merge on Group and calculate similarities.
df_combined = df_xformed.merge(df2_xformed, on="Group")
df_combined["similarity"] = df_combined.apply(
lambda row: similarity(row["df"], row["df2"]), axis=1)
# Find max similarity of each df2_Series within its Group.
df_combined["df2_sim_max"] = df_combined.groupby(\
["df2_Series", "Group"])["similarity"] \
.transform(max)
idx = df_combined["similarity"] == df_combined["df2_sim_max"]
result = df_combined[idx][["df2_Series", "Group", "df2", "df_Series", "similarity"]]
result
# df2_Series Group ... df_Series similarity
# 0 01/09/2021 A ... 01/08/2021 0.999405
# 2 15/09/2021 A ... 01/08/2021 1.000000
# 7 08/09/2021 A ... 15/08/2021 0.999005
# 11 15/09/2021 B ... 15/08/2021 0.999286

Create a ranking of data based on the dates and categories of another column

I have the following dataframe:
account_id contract_id type date activated
0 1 AAA Downgrade 2021-01-05
1 1 ADS Original 2020-12-12
2 1 ADGD Upgrade 2021-02-03
3 1 BB Winback 2021-05-08
4 1 CC Upgrade 2021-06-01
5 2 HHA Original 2021-03-05
6 2 HAKD Downgrade 2021-03-06
7 3 HADSA Original 2021-05-01
I want the following output:
account_id contract_id type date activated Renewal Order
0 1 ADS Original 2020-12-12 Original
1 1 AAA Downgrade 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original
The column I want to create is "Renewal Order". Each account can have multiple contracts. The condition is based on each account (account_id), the type (only when it is either "Original" or "Winback", and the order in which the contracts are activated (date_activated). The first contract (or tagged as "Original" under the "Type" column) will be identified as "Original" while the succeeding contracts as "1st", "2nd", and so on. The order resets when the contract is tagged as "Winback" under the "Type" column, i.e. it will now be identified as "Original" and the succeeding contracts as "1st", "2nd", and so on (refer to contract_id BB).
I tried the following code but it does not consider the condition on the "Winback":
def format_order(n):
if n == 0:
return 'Original'
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
if 11 <= (n % 100) <= 13:
suffix = 'th'
return str(n) + suffix
df = df.sort_values(['account_id', 'date_activated']).reset_index(drop=True)
# apply
df['Renewal Order'] = df.groupby('account_id').cumcount().apply(format_order)
Here's the dictionary of the original dataframe:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'AAA',
1: 'ADS',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Downgrade',
1: 'Original',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2021-01-05 00:00:00'),
1: Timestamp('2020-12-12 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')}}
Here's the dictionary for the result:
{'account_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 3},
'contract_id': {0: 'ADS',
1: 'AAA',
2: 'ADGD',
3: 'BB',
4: 'CC',
5: 'HHA',
6: 'HAKD',
7: 'HADSA'},
'type': {0: 'Original',
1: 'Downgrade',
2: 'Upgrade',
3: 'Winback',
4: 'Upgrade',
5: 'Original',
6: 'Downgrade',
7: 'Original'},
'date activated': {0: Timestamp('2020-12-12 00:00:00'),
1: Timestamp('2021-01-05 00:00:00'),
2: Timestamp('2021-02-03 00:00:00'),
3: Timestamp('2021-05-08 00:00:00'),
4: Timestamp('2021-06-01 00:00:00'),
5: Timestamp('2021-03-05 00:00:00'),
6: Timestamp('2021-03-06 00:00:00'),
7: Timestamp('2021-05-01 00:00:00')},
'Renewal Order': {0: 'Original',
1: '1st',
2: '2nd',
3: 'Original',
4: '1st',
5: 'Original',
6: '1st',
7: 'Original'}}
Let us just change the cumcount result
s = df.groupby('account_id').cumcount()
s[df.type=='Winback'] = 0
df['Renewal Order'] = s.apply(format_order)
Using #BENY solution:
df = df.sort_values(['account_id', 'date activated']).reset_index(drop=True)
s = df.groupby(['account_id',
(df['type'] == 'Winback').cumsum()
]).cumcount()
df['Renewal Order'] = s.apply(format_order)
Output:
account_id contract_id type date activated Renewal Order
0 1 ADS Downgrade 2020-12-12 Original
1 1 AAA Original 2021-01-05 1st
2 1 ADGD Upgrade 2021-02-03 2nd
3 1 BB Winback 2021-05-08 Original
4 1 CC Upgrade 2021-06-01 1st
5 2 HHA Original 2021-03-05 Original
6 2 HAKD Downgrade 2021-03-06 1st
7 3 HADSA Original 2021-05-01 Original

Non Uniform Date columns in Pandas dataframe

my dataframe looks like:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018, 5:
4760, 6: 4029},'date': {0: '23-02-2016', 1: '24-02-2016', 2: '11/2/2016',
3: '12/2/2016', 4: '13-02-2016', 5: '14-02-2016', 6: '15-02-2016'}})
Rows have 2 different formattings.
The format I need is:
a = DataFrame({'clicks': {0: 4020, 1: 3718, 2: 2700, 3: 3867, 4: 4018,
5: 4760, 6: 4029}, 'date': {0: '2/23/2016',1: '2/24/2016', 2: '2/11/2016',
3: '2/12/2016', 4: '2/13/2016', 5: '2/14/2016', 6: '2/15/2016'}})
So far I managed to open the csv in Excel as text data, UTF-8 format and then choose a MDY formatting for the date column. Moreover I apply:
a['date'] = a['date'].apply(lambda x: datetime.strptime(x,'%m/%d/%Y'))
How can I efficiently do that in Pandas?
You can convert to datetime using to_datetime and then call dt.strftime to get it in the format you want:
In [21]:
a['date'] = pd.to_datetime(a['date']).dt.strftime('%m/%d/%Y')
a
Out[21]:
clicks date
0 4020 02/23/2016
1 3718 02/24/2016
2 2700 02/11/2016
3 3867 02/12/2016
4 4018 02/13/2016
5 4760 02/14/2016
6 4029 02/15/2016
if the column is already datetime dtype then you can skip the to_datetime step

Categories