Pandas compare each row to reference row - certain columns only

Pandas compare each row to reference row - certain columns only - python

I have the following Pandas Dataframe**** in Python.
Temp_Fact Oscillops_read A B C D E F G H I J
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162
1 B Today 0.653489 0.452543 0.618755 0.555629 0.490342 0.280299 0.026055 0.138876 0.053148 0.899734
2 A Aactl 0.129211 0.579690 0.641324 0.615772 0.927384 0.199651 0.652395 0.249467 0.262301 0.049795
3 A DFE 0.743794 0.355085 0.637794 0.633634 0.810033 0.509244 0.470418 0.972145 0.647222 0.610636
4 C Real_Mt_Olv 0.724282 0.332965 0.063078 0.004550 0.585398 0.869376 0.232148 0.630162 0.102206 0.232981
5 E Q_Mont 0.221685 0.224834 0.110734 0.397999 0.814153 0.552924 0.981098 0.536750 0.251941 0.383994
6 D DFE 0.655386 0.561297 0.305310 0.140998 0.433054 0.118187 0.479206 0.556546 0.556017 0.025070
7 F Bryo 0.257884 0.228650 0.413149 0.285651 0.814095 0.275627 0.775620 0.392448 0.827725 0.935581
8 C Aactl 0.017388 0.133848 0.939049 0.159416 0.923788 0.375638 0.331078 0.939089 0.098718 0.785569
9 C Today 0.197419 0.595253 0.574718 0.373899 0.363200 0.289378 0.698455 0.252657 0.357485 0.020484
10 C Pars 0.037771 0.683799 0.184114 0.545062 0.857000 0.295918 0.733196 0.613165 0.180642 0.254839
11 B Pars 0.637346 0.090000 0.848710 0.596883 0.027026 0.792180 0.843743 0.461608 0.552165 0.215250
12 B Pars 0.768422 0.017828 0.090141 0.108061 0.456734 0.803175 0.454479 0.501713 0.687016 0.625260
13 E Tomorrow 0.860112 0.532859 0.091641 0.768896 0.635966 0.007211 0.656367 0.053136 0.482367 0.680557
14 D DFE 0.801734 0.365921 0.243407 0.826373 0.904416 0.062448 0.801726 0.049983 0.433135 0.351150
15 F Q_Mont 0.360710 0.330745 0.598830 0.582379 0.828019 0.467044 0.287276 0.470980 0.355386 0.404299
16 D Last_Week 0.867126 0.600093 0.813257 0.005423 0.617543 0.657219 0.635255 0.314910 0.016516 0.689257
17 E Last_Week 0.551499 0.724981 0.821087 0.175279 0.301397 0.304105 0.379553 0.971244 0.558719 0.154240
18 F Bryo 0.511370 0.208831 0.260223 0.089106 0.121442 0.120513 0.099722 0.750769 0.860541 0.838855
19 E Bryo 0.323441 0.663328 0.951847 0.782042 0.909736 0.512978 0.999549 0.225423 0.789240 0.155898
20 C Tomorrow 0.267086 0.357918 0.562190 0.700404 0.961047 0.513091 0.779268 0.030190 0.460805 0.315814
21 B Tomorrow 0.951356 0.570077 0.867533 0.365708 0.791373 0.232377 0.478656 0.003857 0.805882 0.989754
22 F Today 0.963750 0.118826 0.264858 0.571066 0.761669 0.967419 0.565773 0.468971 0.466120 0.174815
23 B Last_Week 0.291186 0.126748 0.154725 0.527029 0.021485 0.224272 0.259218 0.052286 0.205569 0.617701
24 F Aactl 0.269308 0.655920 0.595518 0.404817 0.290342 0.447246 0.627082 0.306856 0.868357 0.979879
I also have a Series of values for each column:
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
As you can see, this is a Pandas Series and it is the average, for each column, over rows where Oscillops_read == 'Last_Week'. Here is the Series:
[0.56993702256121603, 0.48394061768804786, 0.59635616273775061, 0.23591030688019868, 0.31347492150330231, 0.39519847430740507, 0.42467546792253791, 0.4461465888887961, 0.26026797943899194, 0.48706569569369912]
I also have 2 lists:
1.
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
This list gives values that must be added to the dataframe df under certain conditions (described below).
2.
col_1 = list('DFA')
col_2 = list('ACEF')
col_3 = list('CEF')
col_4 = list('ABDF')
col_5 = list('DEF')
col_6 = list('AC')
col_7 = list('ABCDE')
These are lists of column names. These columns from df must be compared to the average Series above. So for example, for the 6th list col_6, columns A and C from each row of the dataframe df must be compared to columns A and C of the Series.
Problem:
As I mentioned above, I need to compare specific columns from the dataframe df to the base Series df_base_val. The columns to be compared are listed in col_1, col_2, col_3, ..., col_7. Here is what I need to do:
if a row for the dataframe column names listed in col_1 (eg. if a row for columns A and C) is greater than the base Series df_base_val in those 2 columns then for that row, in a new column Range, enter the 6th value from the list range_name_list.
Example:
eg. use col_6 - this is the 6th list and it has column names A and C.
Step 1: For row 1 of df, columns A and C are greater than
df_base_val[A] and df_base_val[C] respectively.
Step 2: Thus, for row 1, in a new column Range, enter the 6th element from the list range_name_list - the 6th element is Barometer_Output.
Example Output:
After doing this, the 1st row becomes:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Barometer_Output'
Now, if this row were NOT to be greater than the Series in columns A and C and is not greater than the Series in columns from col_1, col_2, etc. then the Range column must be assigned the value 'Not_in_Range'. In that case, this row would become:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Not_in_Range'
Simplification and Question:
In this example:
I only compared the 1st row to the base series. I need to compare
all rows of df individually to the base series and add an appropriate value.
I only used the 6th list of columns - this was col_6. Similarly, I need to go through each list of column names - col_1, col_2, ...., col_7.
If the row being compared is not greater than any of the lists col_1 to col_7, in the specified columns, then the column Range must be assigned the value 'Not_in_Range'.
Is there a way to do this? Maybe using loops?
**** to create the above dataframe, select it from above and copy. Then use the following code:
import pandas as pd
df = pd.read_clipboard()
print df
EDIT:
If multiple conditions are met, I would need that they all be listed. i.e. if the row belongs to 'Swg' and 'Curnt', then I would need to list both of these in the Range column, or to create separate Range columns, or just Python lists, for each matching result. Range1 would list 'Swg' and column Range2 would list 'Curnt', etc.

For starters I would create a dictionary with your condition sets where the keys can be used as indices for your range_name_list list:
conditions = {0: list('DFA'),
1: list('ACEF'),
2: list('CEF'),
3: list('ABDF'),
4: list('DEF'),
5: list('AC'),
6: list('ABCDE')}
The following code will then accomplish what I understand to be your task:
# Create your Range column to be filled in later.
df['Range'] = '|'
for index, row in df.iterrows():
for ix, list in conditions.iteritems():
# Create a list of the outcomes of checking whether the
# value for each condition column is greater than the
# df_base_val average.
truths = [row[column] > df_base_val[column] for column in list]
# See if all checks evaluated to True
if sum(truths) == len(truths):
# If so, set the 'Range' column's value for the current row
# to the appropriate 'range_name'
df.ix[index, 'Range'] = df.ix[index, 'Range'] + range_name_list[ix] + "|"
# Fill in all rows where no conditions were met with 'Not_in_Range'
df['Range'][df['Range'] == '|'] = 'Not_in_Range'

try this code:
df = pd.read_csv(BytesIO(txt), delim_whitespace=True)
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
columns = ['DFA', 'ACEF', 'CEF', 'ABDF', 'DEF', 'AC', 'ABCDE']
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
ranges = pd.Series(["NOT_IN_RANGE" for _ in range(df.shape[0])], index=df.index)
for name, cols in zip(range_name_list, columns):
cols = list(cols)
idx = df.index[(df[cols] > df_base_val[cols]).all(axis=1)]
ranges[idx] = name
print ranges
But I don't know what you want if there are multiple range match with one row.

Related

Run values of one dataframe through another and find the index of similar value from dataframe

I have two dataframes both consisting of a 1 column with 62 values each:
Distance_obs = [
0.0
0.9084
2.1931
2.85815
3.3903
3.84815
4.2565
4.6287
4.97295
5.29475
5.598
5.8856
6.15975
6.4222
6.67435
6.9173
7.152
7.37925
7.5997
7.8139
8.02235
8.22555
8.42385
8.61755
8.807
8.99245
9.17415
9.35235
9.5272
9.6989
9.86765
10.0335
10.1967
10.3574
10.5156
10.6714
10.825
10.9765
11.1259
11.2732
11.4187
11.5622
11.7041
11.8442
11.9827
12.1197
12.2552
12.3891
12.5216
12.6527
12.7825
12.9109
13.0381
13.1641
13.2889
13.4126
13.5351
13.6565
13.7768
13.8961
14.0144
14.0733
]
and
Cell_mid = [0.814993
1.96757
2.56418
3.04159
3.45236
3.8187
4.15258
4.46142
4.75013
5.02221
5.28026
5.52624
5.76172
5.98792
6.20588
6.41642
6.62027
6.81802
7.01019
7.19723
7.37952
7.55742
7.73122
7.9012
8.0676
8.23063
8.39049
8.54736
8.70141
8.85277
9.00159
9.14798
9.29207
9.43396
9.57374
9.71152
9.84736
9.98136
10.1136
10.2441
10.373
10.5003
10.626
10.7503
10.8732
10.9947
11.1149
11.2337
11.3514
11.4678
11.5831
11.6972
11.8102
11.9222
12.0331
12.143
12.2519
12.3599
12.4669
12.573
12.6782
12.7826
]
I want to run each element in Distance_obs to run through the values in Cell_mid and find the corresponding index of nearest value which matches the element.
I have been trying using the following:
for i in Distance_obs:
Required_value = (np.abs(Cell_mid - i)).idxmin()
But I get
error: ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

One way to do this, could be as follows:
Use pd.merge_asof, passing "nearest" to the direction parameter.
Now, from the merged result select column Cell_mid, and use Series.map with a pd.Series where the values and index of its original df (here: df2) are swapped.
df['Cell_mid_index'] = pd.merge_asof(df, df2,
left_on='Distance_obs',
right_on='Cell_mid',
direction='nearest')\
['Cell_mid'].map(pd.Series(df2['Cell_mid'].index.values,
index=df2['Cell_mid']))
print(df.head())
Distance_obs Cell_mid_index
0 0.00000 0
1 0.90840 0
2 2.19310 1
3 2.85815 3
4 3.39030 4
So, at the intermediate step, we had a merged df like this:
print(pd.merge_asof(df, df2, left_on='Distance_obs',
right_on='Cell_mid', direction='nearest').head())
Distance_obs Cell_mid
0 0.00000 0.814993
1 0.90840 0.814993
2 2.19310 1.967570
3 2.85815 3.041590
4 3.39030 3.452360
And then with .map we are retrieving the appropriate index values from df2.
Data used
import pandas as pd
Distance_obs = [0.0, 0.9084, 2.1931, 2.85815, 3.3903, 3.84815, 4.2565,
4.6287, 4.97295, 5.29475, 5.598, 5.8856, 6.15975, 6.4222,
6.67435, 6.9173, 7.152, 7.37925, 7.5997, 7.8139, 8.02235,
8.22555, 8.42385, 8.61755, 8.807, 8.99245, 9.17415, 9.35235,
9.5272, 9.6989, 9.86765, 10.0335, 10.1967, 10.3574, 10.5156,
10.6714, 10.825, 10.9765, 11.1259, 11.2732, 11.4187, 11.5622,
11.7041, 11.8442, 11.9827, 12.1197, 12.2552, 12.3891, 12.5216,
12.6527, 12.7825, 12.9109, 13.0381, 13.1641, 13.2889, 13.4126,
13.5351, 13.6565, 13.7768, 13.8961, 14.0144, 14.0733]
df = pd.DataFrame(Distance_obs, columns=['Distance_obs'])
Cell_mid = [0.814993, 1.96757, 2.56418, 3.04159, 3.45236, 3.8187, 4.15258,
4.46142, 4.75013, 5.02221, 5.28026, 5.52624, 5.76172, 5.98792,
6.20588, 6.41642, 6.62027, 6.81802, 7.01019, 7.19723, 7.37952,
7.55742, 7.73122, 7.9012, 8.0676, 8.23063, 8.39049, 8.54736,
8.70141, 8.85277, 9.00159, 9.14798, 9.29207, 9.43396, 9.57374,
9.71152, 9.84736, 9.98136, 10.1136, 10.2441, 10.373, 10.5003,
10.626, 10.7503, 10.8732, 10.9947, 11.1149, 11.2337, 11.3514,
11.4678, 11.5831, 11.6972, 11.8102, 11.9222, 12.0331, 12.143,
12.2519, 12.3599, 12.4669, 12.573, 12.6782, 12.7826]
df2 = pd.DataFrame(Cell_mid, columns=['Cell_mid'])

Compare two dataframe columns for matching strings or are substrings then count in pandas (Need for speed edition)

I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.
Dataframe A
0 "4012, 4065, 4682"
1 "4712, 2339, 5652, 10007"
2 "4618, 8987"
3 "7447, 4615, 4012"
4 "6515"
5 "4065, 2339, 4012"
Dataframe B
0 "6515, 4012, 4618, 8987" <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1 "4065, 5116, 2339, 8757, 4012" <- matches (DF A, Index 5) (4065, 2339, 4012)
2 "1101"
3 "6515" <- matches (DF A, Index 4) (6515)
4 "4012, 4615, 7447" <- matches (DF A, Index 3) (7447, 4615, 4012)
5 "7447, 6515, 4012, 4615" <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)
Desired Output:
Itemset Count
2 4618, 8987 1
3 7447, 4165, 4012 2
4 6515 3
5 4065, 2339, 4012 1
Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.
EDIT Need for speed edition:
This is a redo question from my previous post:
Compare two dataframe columns for matching strings or are substrings then count in pandas
I have millions of rows in both dfA and dfB to make these comparisons against.
In my previous post, the following code got the job done:
import pandas as pd
dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])
dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])
dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]
However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?
dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()

give this a try. Your algorithm is O(NNK): square of count * words per line. Below should improve to O(NK)
from collections import defaultdict
from functools import reduce
d=defaultdict(set)
for i,t in enumerate(dfB['values']):
for s in t.split(', '):
d[s].add(i)
dfA['count']=dfA['values'].apply(lambda x:len(reduce(lambda a,b: a.intersection(b), [d[s] for s in x.split(', ') ])))

concatenating and saving multiple pair of CSV in pandas

I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?

Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])

Extending the column name in pandas DataFrame

I have a data frame which contains 34 rows and 10 columns. I called the data frame "comp" now I did "invcomp = 1/comp", So the values changed but column name will be same. I want to replace or rename my column names, suppose the earlier name of my first column was "Cbm_m" in "comp", now I want to convert it to "Cbm_m_inv" in "invcomp". Extending or adding an extra term in last.

Use 'add_suffix':
invcomp = invcomp.add_suffix('_inv')
Setup:
invcomp = pd.DataFrame(pd.np.random.rand(5,5), columns=list('ABCDE'))
invcomp = invcomp.add_suffix('_inv')
Output:
A_inv B_inv C_inv D_inv E_inv
0 0.111604 0.016181 0.384071 0.608118 0.944439
1 0.523085 0.139200 0.495815 0.007926 0.183498
2 0.090169 0.357117 0.381938 0.222261 0.788706
3 0.802219 0.002049 0.173157 0.716345 0.182829
4 0.260781 0.376730 0.646595 0.324361 0.345097

Pandas joining based on date

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.

One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length

Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas compare each row to reference row - certain columns only - python

Related

Run values of one dataframe through another and find the index of similar value from dataframe

Compare two dataframe columns for matching strings or are substrings then count in pandas (Need for speed edition)

concatenating and saving multiple pair of CSV in pandas

Extending the column name in pandas DataFrame

Pandas joining based on date

Categories

Resources