I have a dataset like the below, multiple groups, completed values with over 200 columns (denoting days)
Input
Series
1
2
3
4
5
6
7
GROUP
01/08/2021
100%
75%
60%
50%
40%
30%
0%
A
08/08/2021
100%
95%
80%
60%
30%
10%
0%
A
15/08/2021
100%
85%
60%
40%
20%
10%
5%
A
01/08/2021
100%
70%
65%
55%
45%
35%
0%
B
08/08/2021
100%
90%
80%
60%
30%
10%
0%
B
15/08/2021
100%
95%
60%
40%
30%
20%
5%
B
Now, I have an incomplete dataset like the below. I would like to compute similarity metric for each group and state which series is most similar.
For purpose of similarity, I am using CORREL in Excel at the moment and in case of tie, I am using the latest one. For comparison, only complete values in both groups are compared (i.e. so missing values in expected output are not used for similarity metric calculation).
This is a VBA macro which I am shifting to python (either pandas or pyspark).
I am confused on how best to proceed. Any other similarity metric can be tried out too. Thanks
Expected Output
Series
1
2
3
4
5
6
7
Similarity_Score
Similarity_Week
Group
01/09/2021
39%
28%
0%
0.99
01/08/2021
A
08/09/2021
62%
44%
21%
12%
7%
0.99
15/08/2021
A
15/09/2021
8%
0%
1.00
08/08/2021
A
15/09/2021
30%
19%
0%
1.00
15/08/2021
B
This solution involves iterating over each group, taking a subset of each dataframe and taking the product of each dataframes values, such that each row can be compared to every other row.
We can use some nested zip/filter/reverse trickery to keep only columns that are filled out. Putting that in a list with the dates from both dfs and the group, we can create a dataframe, sort, group, and keep the top score from each.
Joining this back to the second df should give you the output you want.
import pandas as pd
import numpy as np
from itertools import product
df = pd.DataFrame({'Series': {0: '01/08/2021',
1: '08/08/2021',
2: '15/08/2021',
3: '01/08/2021',
4: '08/08/2021',
5: '15/08/2021'},
'1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
'2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
'3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
'4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
'5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
'6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
'7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})
df2 = pd.DataFrame({'Series': {0: '01/09/2021',
1: '08/09/2021',
2: '15/09/2021',
3: '15/09/2021'},
'1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
'4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
'5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
'6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
'7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
'Similarity_Week': {0: '01/08/2021',
1: '15/08/2021',
2: '08/08/2021',
3: '15/08/2021'},
'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)
df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)
l = []
for g, data in df.groupby('GROUP'):
x = df2.loc[df2['Group']==g]
for c in product(data.values,x.values):
a = c[0][1:-1]
b = c[1][1:-1]
a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
a = [int(x.replace('%',''))/100 for x in a]
b = list(reversed([int(x.replace('%',''))/100 for x in b]))
l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])
out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])
Output
Series 1 2 3 4 5 6 7 Group Similarity_Week \
0 01/09/2021 NaN NaN NaN NaN 39% 28% 0% A 01/08/2021
1 08/09/2021 NaN NaN 62% 44% 21% 12% 7% A 15/08/2021
2 15/09/2021 NaN NaN NaN NaN NaN 8% 0% A 01/08/2021
3 15/09/2021 NaN NaN NaN NaN 30% 19% 0% B 15/08/2021
Similarity_Score
0 0.999405
1 0.999005
2 1.000000
3 0.999286
I believe the scores are very similar for 15/09/2021 group A, such that if you were to round the scores you would get a different most recent date. You can validate this by checking
[x for x in l if x[2]=='15/09/2021' and x[0]=='A']
Yields
[['A', '01/08/2021', '15/09/2021', 1.0],
['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]
So in theory the 15/08/2021 would be the date if you were rounding to a few decimal places, which you could do by putting round() around the np.corrcoef
If you prefer a solution without for loops you could merge the two data frames on Group, and use groupby to apply the similarity metric.
Building on the data frames constructed by #Chris:
df.rename(columns={"GROUP":"Group"}, inplace=True)
def similarity(arr1, arr2):
"""Similarity between two arrays of percent strings, nans ignored"""
df = pd.DataFrame({"arr1":arr1, "arr2":arr2}).dropna() \
.apply(lambda s: s.str.strip("%").astype(float)/100)
return df.arr1.corr(df.arr2)
# Convert data columns to array in each row.
df_xformed = df.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df_Series", 0:"df"})
df2_xformed = df2.set_index(["Series", "Group"]).apply(pd.Series.to_numpy, axis=1) \
.reset_index().rename(columns={"Series":"df2_Series", 0:"df2"})
# Merge on Group and calculate similarities.
df_combined = df_xformed.merge(df2_xformed, on="Group")
df_combined["similarity"] = df_combined.apply(
lambda row: similarity(row["df"], row["df2"]), axis=1)
# Find max similarity of each df2_Series within its Group.
df_combined["df2_sim_max"] = df_combined.groupby(\
["df2_Series", "Group"])["similarity"] \
.transform(max)
idx = df_combined["similarity"] == df_combined["df2_sim_max"]
result = df_combined[idx][["df2_Series", "Group", "df2", "df_Series", "similarity"]]
result
# df2_Series Group ... df_Series similarity
# 0 01/09/2021 A ... 01/08/2021 0.999405
# 2 15/09/2021 A ... 01/08/2021 1.000000
# 7 08/09/2021 A ... 15/08/2021 0.999005
# 11 15/09/2021 B ... 15/08/2021 0.999286
Related
I have a data frame that is similar to the following:
Time Account_ID Device_ID Zip_Code
0 2011-02-02 12:02:19 ABC123 A12345 83420
1 2011-02-02 13:35:12 EFG456 B98765 37865
2 2011-02-02 13:54:57 EFG456 B98765 37865
3 2011-02-02 14:45:20 EFG456 C24568 37865
4 2011-02-02 15:08:58 ABC123 A12345 83420
5 2011-02-02 15:25:17 HIJ789 G97352 97452
How do I make a plot with the count of unique of account id's on the y-axis and the number of unique device id's associated with a single account id on the x-axis?
So in this instance the "1" bin on the x-axis would have a height of 2 since accounts "ABC123" and "HIJ789" only have 1 unique device id each and the "2" bin would have a height of 1 since account "EFG456" has two unique device id's associated with it.
EDIT
This is the output I got from trying
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
You can combine groupby nunique and value_counts like this:
df.groupby("Account_ID")["Device_ID"].nunique().value_counts().plot.bar()
Edit:
Code used to recreate your data:
df = pd.DataFrame({'Time': {0: '2011-02-02 12:02:19', 1: '2011-02-02 13:35:12', 2: '2011-02-02 13:54:57',
3: '2011-02-02 14:45:20', 4: '2011-02-02 15:08:58', 5: '2011-02-02 15:25:17'},
'Account_ID': {0: 'ABC123', 1: 'EFG456', 2: 'EFG456', 3: 'EFG456', 4: 'ABC123', 5: 'HIJ789'},
'Device_ID': {0: 'A12345', 1: 'B98765', 2: 'B98765', 3: 'C24568', 4: 'A12345', 5: 'G97352'},
'Zip_Code': {0: 83420, 1: 37865, 2: 37865, 3: 37865, 4: 83420, 5: 97452}})
I am working on a school project, so please no exact answers.
I have a pandas dataframe that has numerators and denominators rating images of dogs out of 10. When there are multiple dogs in the image, the rating is out of number of dogs * 10. I am trying to adjust it so that for example... if there are 5 dogs, and the rating is 40/50, then the new numerator/denominator is 8/10.
Here is an example of my code. I am aware that the syntax does not work in line 3, but I believe it accurately represents what I am trying to accomplish. twitter_archive is the dataframe.
twitter_archive['new_denom'] = 10
twitter_archive['new_numer'] = 0
for numer, denom in twitter_archive['rating_numerator','rating_denominator']:
if (denom > 10) & (denom % 10 == 0):
num_denom = denom / 10
new_numer = numer / num_denom
twitter_archive['new_numer'] = new_numer
So basically I am checking the denominator if it is above 10, and if it is, is it divisible by 10? if it is, then find out how many times 10 goes into it, and then divide the numerator by that value to get an new numerator. I think my logic for that works fine, but the issue I have is grabbing that row, and then adding that new value to the new column I created, in that row.
edit: added df head
tweet_id
timestamp
text
rating_numerator
rating_denominator
name
doggo
floofer
pupper
puppo
avg_numerator
avg_denom
avg_numer
0
8.924206e+17
2017-08-01 16:23:56+00:00
This is Phineas. He's a mystical boy. Only eve...
13.0
10.0
phineas
None
None
None
None
0.0
10
0
1
8.921774e+17
2017-08-01 00:17:27+00:00
This is Tilly. She's just checking pup on you....
13.0
10.0
tilly
None
None
None
None
0.0
10
0
2
8.918152e+17
2017-07-31 00:18:03+00:00
This is Archie. He is a rare Norwegian Pouncin...
12.0
10.0
archie
None
None
None
None
0.0
10
0
3
8.916896e+17
2017-07-30 15:58:51+00:00
This is Darla. She commenced a snooze mid meal...
13.0
10.0
darla
None
None
None
None
0.0
10
0
4
8.913276e+17
2017-07-29 16:00:24+00:00
This is Franklin. He would like you to stop ca...
12.0
10.0
franklin
None
None
None
None
0.0
10
0
copy/paste head below:
{'tweet_id': {0: 8.924206435553362e+17,
1: 8.921774213063434e+17,
2: 8.918151813780849e+17,
3: 8.916895572798587e+17,
4: 8.913275589266883e+17},
'timestamp': {0: Timestamp('2017-08-01 16:23:56+0000', tz='UTC'),
1: Timestamp('2017-08-01 00:17:27+0000', tz='UTC'),
2: Timestamp('2017-07-31 00:18:03+0000', tz='UTC'),
3: Timestamp('2017-07-30 15:58:51+0000', tz='UTC'),
4: Timestamp('2017-07-29 16:00:24+0000', tz='UTC')},
'text': {0: "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 ",
1: "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 ",
2: 'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 ',
3: 'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us ',
4: 'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek '},
'rating_numerator': {0: 13.0, 1: 13.0, 2: 12.0, 3: 13.0, 4: 12.0},
'rating_denominator': {0: 10.0, 1: 10.0, 2: 10.0, 3: 10.0, 4: 10.0},
'name': {0: 'phineas', 1: 'tilly', 2: 'archie', 3: 'darla', 4: 'franklin'},
'doggo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'floofer': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'pupper': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'puppo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'}}
If you want to use for loop to get row values, you can use iterrows() function.
for idx, row in twitter_archive.iterrows():
denom = row['rating_denominator']
numer = row['rating_numerator']
# You can add values in list and concat it with df
Faster way to iterate on df is itertuples():
for row in twitter_archive.itertuples():
denom = row[1]
numer = row[2]
But I think best way to create new col from old ones is to use pandas apply function .
df = pd.DataFrame(data={'a' : [1,2], 'b': [3,5]})
df['c'] = df.apply(lambda x: 'sum_is_odd' if (x['a'] + x['b']) % 2 == 1 else 'sum_is_even', axis=1)
In this case, 'c' is a new column and value is calculated using 'a' and 'b' columns.
I wrote a little script that loops through constraints to filter a dataframe. Example and follow up explaining the issue are below.
constraints = [['stand','==','L'],['zone','<','20']]
for x in constraints:
vari = x[2]
df = df.query("{0} {1} #vari".format(x[0],x[1]))
zone
stand
speed
type
0
2
L
83.7
CH
1
7
L
95.9
SI
2
14
L
94.9
FS
3
11
L
93.3
FS
4
13
L
86.9
CH
5
7
L
96.4
SI
6
13
L
82.6
SL
I can't figure out a way to filter when there is an OR condition. For example, in the table above I'd like to return a dataframe using the constraints in the code example along with any rows that contain SI or CH in the type column. Does anyone have ideas on how to accomplish this? Any help would be greatly appreciated.
This seems to have gotten the job done but there is probably a much better way of going about it.
for x in constraints:
vari = x[2]
if isinstance(vari,list):
frame = frame[frame[x[0]].isin(vari)]
else:
frame = frame.query("{0} {1} #vari".format(x[0],x[1]))
IIUC (see my question in the comment) you can do it like this:
Made a little different df to show you the result (I guess the table you show is already filtered)
df = pd.DataFrame(
{'zone': {0: 2, 1: 11, 2: 25, 3: 11, 4: 23, 5: 7, 6: 13},
'stand': {0: 'L', 1: 'L', 2: 'L', 3: 'C', 4: 'L', 5: 'K', 6: 'L'},
'speed': {0: 83.7, 1: 95.9, 2: 94.9, 3: 93.3, 4: 86.9, 5: 96.4, 6: 82.6},
'type': {0: 'CH', 1: 'SI', 2: 'FS', 3: 'FS', 4: 'CH', 5: 'SI', 6: 'SL'}})
print(df)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
2 25 L 94.9 FS
3 11 C 93.3 FS
4 23 L 86.9 CH
5 7 K 96.4 SI
6 13 L 82.6 SL
res = df.loc[ ( (df['type']=='SI') | (df['type']=='CH') ) & ( (df['zone']<20) & (df['stand']=='L') ) ]
print(res)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
Let me know if that is what you are searching for.
I have a column like this
User time Column
User1 time1 44 db
User1 time2 55 db
User1 time3 43 db
User1 time4 no_available
How to calculate average, Min, MAX by just taking 44 55 43 (without db) and ignoring values like 'no_available' and 'no_power' for each user
Bonus, also how take last value of the day if user has for exemple 10 values for 10 times.
Regards,
thank you.
If all integers, you can use str.extract() to pull out the numbers. Then, return the mean, max, etc:
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '55 db', 2: '43 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.extract('(\d+)').astype(float)
print(df['Numbers'].mean(), df['Numbers'].max())
Out [1]:
47.333333333333336 55.0
Example with -, ., or , in the number:
import pandas as pd
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '-45.32 db', 2: '4,452.03 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.replace(',','').str.extract('(-?\d+.?\d+)').astype(float)
print(df['Numbers'])
0 44.00
1 -45.32
2 4452.03
3 NaN
Name: Numbers, dtype: float64
There are the MAX and MIN for that column.
Give me like. :D
import pandas as pd
a = ["user1","user1","user1","user1"]
a2 = ["time1","time2","time3","time4"]
a3 = ['45 db','55 db','43 db','no_available']
a = pd.DataFrame(a, columns=["user"])
a2 = pd.DataFrame(a2, columns=["time"])
a3 = pd.DataFrame(a3, columns=["column"])
data = pd.concat([a,a2,a3], axis=1)
data1 = list(data["column"])
h = []
for i in data1:
try:
if int(i[0:2]):
h.append(int(i[0:2]))
except:
print(i)
max(h)
min(h)
I have a pandas df that looks like so:
df = pd.DataFrame({'index': {0: 34, 1: 35, 2: 36, 3: 37, 4: 38},
'lane': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'project': {0: 'default',
1: 'default',
2: 'default',
3: 'default',
4: 'default'},
'sample': {0: 'None-BORD1778',
1: 'None-BORD1779',
2: 'None-BORD1780',
3: 'None-BORD1782',
4: 'None-BORD1783'},
'barcode_sequence': {0: 'AACCTACG',
1: 'TTGCGAGA',
2: 'TTGCTTGG',
3: 'TACACACG',
4: 'TTCGGCTA'},
'pf_clusters': {0: '"1,018,468"',
1: '"750,563"',
2: '"752,191"',
3: '"876,957"',
4: '"695,347"'},
'%_of_the_lane': {0: 0.28, 1: 0.21, 2: 0.21, 3: 0.24, 4: 0.19},
'%_perfect_barcode': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'yield_(mbases)': {0: '511', 1: '377', 2: '378', 3: '440', 4: '349'},
'%_pf_clusters': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'%_>=_q30_bases': {0: 89.74, 1: 89.9, 2: 89.0, 3: 89.31, 4: 88.69},
'mean_quality_score': {0: 35.13, 1: 35.15, 2: 34.98, 3: 35.04, 4: 34.92}})
I am now trying to do the following. For each of the values under the column barcode_sequence, I want to compare, character by character, how similar they are to all of the other values under that same column.
For that I have defined the following function:
def compare(s1,s2):
return len([x for x in range(len(s1)) if s1[x] == s2[x]])/len(s1)
Now I want to apply this function to each value under df['barcode_sequence']. This means that, in my first iteration (where s1 is AACCTACG) I would apply the function compare to all other values under the same column i.e. AACCTACG with TTGCGAGA, TTGCTTGG, TACACACG and TTCGGCTA. Then I would do the same for the second row TTGCGAGA (which is now my new value of s1), and so on, until I reach the final entry under df['barcode_sequence'].
So far I have got the number of iterations that I need for each entry under df['barcode_sequence'], which can be achieved with a combination of a nested for loop with the iterrows() method. So if I do:
for index, row in df.iterrows():
for sample in list(range(len(df.index))):
print(index, row['sample'],row['barcode_sequence'])
I get at least which string I am comparing (my s1 in compare) and the number of comparisons I will do for each s1.
Though I am stuck at extracting all the s2 for each s1
Here's a way to do using a cross join format (no explicit for loops required):
# do a cross join
df1 = df[['barcode_sequence']].copy()
df1['barcode_un'] = [df1['barcode_sequence'].unique().tolist() for _ in range(df1.shape[0])]
# remove duplicate rows
df1 = df1.explode('barcode_un').query("barcode_sequence != barcode_un").reset_index(drop=True)
# calculate the score
df1['score'] = df1.apply(lambda x: compare(x['barcode_sequence'], x['barcode_un']), 1)
print(df1)
barcode_sequence barcode_un score
0 AACCTACG TTGCGAGA 0.250
1 AACCTACG TTGCTTGG 0.375
2 AACCTACG TACACACG 0.625
3 AACCTACG TTCGGCTA 0.125
4 TTGCGAGA AACCTACG 0.250
5 TTGCGAGA TTGCTTGG 0.625
6 TTGCGAGA TACACACG 0.250
7 TTGCGAGA TTCGGCTA 0.500
8 TTGCTTGG AACCTACG 0.375
9 TTGCTTGG TTGCGAGA 0.625
10 TTGCTTGG TACACACG 0.250
11 TTGCTTGG TTCGGCTA 0.250
12 TACACACG AACCTACG 0.625
13 TACACACG TTGCGAGA 0.250
14 TACACACG TTGCTTGG 0.250
15 TACACACG TTCGGCTA 0.250
16 TTCGGCTA AACCTACG 0.125
17 TTCGGCTA TTGCGAGA 0.500
18 TTCGGCTA TTGCTTGG 0.250
19 TTCGGCTA TACACACG 0.250