pandas: ValueError: Can only compare identically-labeled Series objects

pandas: ValueError: Can only compare identically-labeled Series objects - python

I have 2 csv files as below, I want to find if an individual performance (in df1) is above/below class average (in df2) using come compare function after finding their values.
df1:
Name Class Test1 Test2 Test3
John 9A 75 83 77
David 9B 65 67 55
Peter 9A 85 90 88
Tom 9C 74 92 78
df2:
Class Test1 Test2 Test3
9A 80 82 84
9B 84 75 77
9C 75 78 80
Here's my method, feel free to correct/guide me if I'm wrong. I first find the Class of an individual in df1, e.g., John is 9A, then return the other columns such as Test1 or Test2 in df2 based on 9A
target_class = df1.loc[df1['Name'] == 'John', 'Class']
print(target_class)
>>>>9A
Test1_avg = df2.loc[df2['Class'] == target_class, 'Test1']
# ideally it should return 80
And I got this ValueError: Can only compare identically-labeled Series objects
Or simply, how would you compare John's Test1 in df1 vs Class 9A's Test1 in df2? Is there any easier method than mine? Thanks for your help!
Update: I'll then use a compare function like this to return a score if it fulfills the criteria
def comparison(a, b):
return 2 if a > b else 1 if a == b else -1

This is one way via pandas.merge.
# rename df2 columns
df2 = df2.rename(columns={'Test'+str(x): 'AvgTest'+str(x) for x in range(1, 4)})
# left merge df1 on df2
res = pd.merge(df1, df2, how='left', on=['Class'])
# calculate comparison results
comparison = pd.DataFrame(res.loc[:, res.columns.str.startswith('Test')].values >= \
res.loc[:, res.columns.str.startswith('AvgTest')].values,
columns=['Comp'+str(x) for x in range(1, 4)])
# join results to dataframe
res = res.join(comparison)
print(res)
# Name Class Test1 Test2 Test3 AvgTest1 AvgTest2 AvgTest3 Comp1 \
# 0 John 9A 75 83 77 80 82 84 False
# 1 David 9B 65 67 55 84 75 77 False
# 2 Peter 9A 85 90 88 80 82 84 True
# 3 Tom 9C 74 92 78 75 78 80 False
# Comp2 Comp3
# 0 True False
# 1 False False
# 2 True True
# 3 True False

Related

Using groupby() for a dataframe in pandas resulted Index Error

I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?

Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!

Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt

How to add the List data to the first column of the CSV file, which has 256 columns file via python?

I have a CSV file which has 255 columns and 16,000 rows of data, and I want to add a list of data which contains 16,000 data to the first column of my CSV file.
The code I tried to use is
# Append the name of the file to List
path = 'C:/Users/User/Desktop/Guanlin_CNN1D/CNN1D/0.3 15 and 105 circle cropped'
list = os.listdir(path)
List = []
for a in list:
List.append(str(a))
## Load the to-be-added CSV file
data = pd.read_csv('C:/Users/User/Desktop/Guanlin_CNN1D/CNN1D/0.3 15 and 105 for toolpath recreatation.csv',sep=',', engine='python' ,header=None)
tempdata = pd.DataFrame(data)
features = tempdata.values[:, 1:]
file_num = tempdata.values[:, 0]
# add the List to first columns of CSV file
Temp = {List,file_num,features}
temp = pd.DataFrame(Temp)
temp
The result shows
TypeError: unhashable type: 'list'
How to rewrite the code?
Thanks in advance!

I think you simply need to use the dataframe insert method. It looks like you are trying to create a new dataframe but I think it is not necessary. Below example inserts a new column at the zeroth position. It looks like you were trying to make a new dataframe from a dict; this link has some easy examples on way to populate a dataframe with lists and dicts. I think the number of rows and columns should not be a concern for you in this case.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)), columns=list('ABCDE'))
print(df)
df.insert(0,column='newcol', value=np.random.randint(0, 100, size=(5)))
print()
print(df)
df.to_csv( r'data.csv', index=False, header=True)
will produce this output
A B C D E
0 44 47 64 67 67
1 9 83 21 36 87
2 70 88 88 12 58
3 65 39 87 46 88
4 81 37 25 77 72
newcol A B C D E
0 9 44 47 64 67 67
1 20 9 83 21 36 87
2 80 70 88 88 12 58
3 69 65 39 87 46 88
4 79 81 37 25 77 72

How to add another name in existing index with scores?

I created a datafame as shown below. I have to add another name in index and update scores; how to append it to existing. data? I have to add 'Pandey' in index and Test1 = 56 and Test2 = 76
test_score = pd.DataFrame(
{'Test1':[82,75,83,92,85],
'Test2':[85,81,75,85,91]},
index = ['Sachin','Dravid','virat','Rohith','Dhawan'])
My result should be
Test1 Test2
Sachin 82 85
Dravid 75 81
virat 83 75
Rohith 92 85
Dhawan 85 91
Pandey 56 76

row = pd.Series({'Test1':56, 'Test2' : 76},index='pandey')
test_score= test_score.append(row)

Subtract a constant from a column in a pandas dataframe

I have a dataframe as follows:
year,value
1970,2.0729729191557147
1971,1.0184197388632872
1972,2.574009084167593
1973,1.4986879160266255
1974,3.0246498975934464
1975,1.7876222478238608
1976,2.5631745148930913
1977,2.444014336917563
1978,2.619502688172043
1979,2.268273809523809
1980,2.6086169818316645
1981,0.8452720174091145
1982,1.3158922171018947
1983,-0.12695212493599603
1984,1.4374230626622169
1985,2.389290834613415
1986,2.3489311315924217
1987,2.6002265745007676
1988,1.2623717711036955
1989,1.1793426779313878
I would like to subtract a constant from each of the values in the second column. This is the code I have tried:
df = pd.read_csv(f1, sep=",", header=0)
df2 = df["value"].subtract(1)
However when I do this, df2 becomes this:
70 1.072973
71 0.018420
72 1.574009
73 0.498688
74 2.024650
75 0.787622
76 1.563175
77 1.444014
78 1.619503
79 1.268274
80 1.608617
81 -0.154728
82 0.315892
83 -1.126952
84 0.437423
85 1.389291
86 1.348931
87 1.600227
88 0.262372
89 0.179343
The year becomes only the last two digits. How can I retain all of the digits of the year?

I think column year is not modified, only need assign back subtracted values:
df["value"] = df["value"].subtract(1)

Pandas randomly swap columns values per row

I want to train a binary classification ML model with some data that I have; something like this:
df
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
0 20 89 62 23 3 74
1 51 64 19 2 83 0
0 14 58 2 71 31 48
1 32 28 2 30 92 91
1 51 36 51 66 15 14
...
My target (y) depends on three characteristics from two groups, however I have an imbalance in my data, a count of values of my y target reveals that I have more zeros than ones in a ratio of about 2.68. I correct this by looping each row and randomly swapping values from group 1 to group 2 and viceversa, like this:
for index,row in df.iterrows():
choice = np.random.choice([0,1])
if row['y'] != choice:
df.loc[index, 'y'] = choice
for column in df.columns[1:]:
key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')
df.loc[index, column] = row[key]
Doing this reduce the ratio to no more than 1.3, so I was wondering if there is a more direct aproach using pandas methods.
¿Anyone have an idea how to accomplish this?

Whether or not swapping columns solves class unbalance aside, I would swap the whole data set, and randomly choose between the original and the swapped:
# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
df.filter(regex='_g1$')),
axis=1)
# Step 2: rename the columns
df1.columns = df.columns
# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))
# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))
Output:
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
2 0 14 58 2 71 31 48
3 1 32 28 2 30 92 91
0 0 23 3 74 20 89 62
1 1 2 83 0 51 64 19
4 1 66 15 14 51 36 51
Notice that row with indexes 1,4 have g1 swap with g2.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: ValueError: Can only compare identically-labeled Series objects - python

Related

Using groupby() for a dataframe in pandas resulted Index Error

How to add the List data to the first column of the CSV file, which has 256 columns file via python?

How to add another name in existing index with scores?

Subtract a constant from a column in a pandas dataframe

Pandas randomly swap columns values per row

Categories

Resources