Calculating value from a group of cells with Pandas - python

I am trying to read a csv file of horse track information.
I am attempting to code for the post positions (col 3) in race 1 the max value for the field qpts (col 210). I have spend days on researching this and can find no clear answer on web or youtube.
When I run the code below, I get "The truth value of a Series is ambiguous....."
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
df = pd.read_csv('track.csv', header=None, na_values=['.'])
index = list(range(0,200,1))
columns = list(range(0,1484,1))
if df.ix[2]== 1:
qpts = (df.max([210]))
print (qpts)

the problem is with
if df.ix[2] == 1. The expression df.ix[2] == 1 will return a pd.Series of truth values. By putting an if in front, you are attempting to evaluate a series of values as either True or False, which is what is throwing the error.
There are several ways to produce a series where the value is 210 and the indices are those where df.ix[2] == 1
This is one way
pd.Series(210, df.index[df.ix[2] == 1])

Here df.ix[2]== 1 is going to return a Series. You need to use a function such as .any() or .all() to combine the Series into a single value which you can do a truth statement upon. For example,
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
df = pd.read_csv('track.csv', header=None, na_values=['.'])
index = list(range(0,200,1))
columns = list(range(0,1484,1))
if (df.ix[2]== 1).any(axis=1):
qpts = (df.max([210]))
print (qpts)
In the case above we are checking to see if any of the Series elements are equal to 1. If so then the the if statement will be implemented. If we do not do this then we could have a situation as follows:
print(df)
Out[1]:
1 3
2 7
3 1
4 5
5 6
print(df.ix[2]== 1)
Out[2]:
1 False
2 False
3 True
4 False
5 False
Therefore the Series would be both simultaneously True and False.

Related

Create new dataframe from another dataframe

I've created a dataframe. I'd like to create a new dataframe depending on the current dataframe's conditions. My Python code is as follows:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],'B':[10,20,30,40,50,60,70,80,90,100]})
df
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
8 9 90
9 10 100
import pywt
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
from skimage.restoration import denoise_wavelet
wavelet_type='db6'
def new_df(df):
df0 = pd.DataFrame()
if (df.iloc[:,0]>=1) & (df.iloc[:,0]<=3):
df0['B'] = denoise_wavelet(df.loc[(df.iloc[:,0]>=1) & (df.iloc[:,0]<=3),'B'], method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
elif (df.iloc[:,0]>=4) & (df.iloc[:,0]<=6):
df0['B'] = denoise_wavelet(df.loc[(df.iloc[:,0]>=4) & (df.iloc[:,0]<=6),'B'], method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
else:
df0['B']=df.iloc[:,1]
return df0
I want a new dataframe that will denoise the values in column B that meet the conditions, but leave the remaining values alone and keep them in the new dataframe. My code gives me error message: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Could you please help me?
My desired output should look
A B
0 1 15*
1 2 25*
2 3 35*
3 4 45*
4 5 55*
5 6 65*
6 7 70
7 8 80
8 9 90
9 10 100
#* represents new values may be different when you get the result.
#this is just for a demo.
May be my code idea is wrong. Could you please help me?
(df.iloc[:,0]>=1) will return a pandas series of boolean values corresponding to which elements in the first column of df are greater than or equal to 1.
In the line
if (df.iloc[:,0]>=1) & (df.iloc[:,0]<=3):
you are hence trying to do boolean arithmetic with two pandas series which doesn't make sense.
Pandas gives you a hint in the error message as to what might solve the problem:
e.g. if you wanted to check whether any element in df.iloc[:,0] was greater than one, you could use (df.iloc[:,0]>=1).any() which would return a single bool that you could then compare with the result of (df.iloc[:,0]<=3).any().
Without more context to the problem or what you're trying to do, it is hard to help you further.
Note that since you are filtering the data while passing it to denoise_wavelet, you don't really need the if statements, but you should assign the returned value to the same "view" of the DataFrame. Here is my approach. It first copy the original DataFrame and replace the desired rows with the "denoised" data.
import numpy as np
import pandas as pd
import scipy.signal as signal
import matplotlib.pyplot as plt
from skimage.restoration import denoise_wavelet
wavelet_type='db6'
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],'B':[10,20,30,40,50,60,70,80,90,100]})
def new_df(df):
df0 = df.copy()
df0.loc[(df.iloc[:,0]>=1) & (df.iloc[:,0]<=3),'B'] = denoise_wavelet(df.loc[(df.iloc[:,0]>=1) & (df.iloc[:,0]<=3),'B'].values, method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
df0.loc[(df.iloc[:,0]>=4) & (df.iloc[:,0]<=6),'B'] = denoise_wavelet(df.loc[(df.iloc[:,0]>=4) & (df.iloc[:,0]<=6),'B'].values, method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
return df0
new_df(df)
However, I don't really know how denoise_wavelet so I don't know if the result is correct, but the values from index 6 to 9 are left unchanged.
Updated
For applying for 2 or more columns:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B1':[10,20,30,40,50,60,70,80,90,100],
'B2':[10,20,30,40,50,60,70,80,90,100],
'B3':[10,20,30,40,50,60,70,80,90,100]})
def apply_denoise(col):
col.loc[1:3] = denoise_wavelet(col.loc[1:3], method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
col.loc[4:6] = denoise_wavelet(col.loc[4:6], method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
return col
new_df = df.set_index('A').apply(apply_denoise)
new_df
Note that since you are always conditioning on column 'A' you can convert it to an index and make use of indexing to implement the condition. Then using apply you can call the function apply_denoise on each column, and it will return a new DataFrame with the resulting columns.

Get amount of a certain unique value within a column [duplicate]

I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ? occurs.
The code below shows that I am trying to find the number of times 9th appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th']
print(missing2)
Error
KeyError: '9th'
You can create subset of data with your condition and then use shape or len:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Couple of ways using count or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
An elegant way to count the occurrence of '?' or any symbol in any column, is to use built-in function isin of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df object.
We do not know which columns contain missing value ('?' symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values) official document says:
it returns boolean DataFrame showing whether each element in the DataFrame
is contained in values
Note that isin accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?']) will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum over all the rows of the above dataframe by indicating axis=0.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
Try this:
(df[education]=='9th').sum()
easy but not efficient:
list(df.education).count('9th')
Simple example to count occurrences (unique values) in a column in Pandas data frame:
import pandas as pd
# URL to .csv file
data_url = 'https://yoursite.com/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['education'].value_counts()
Outputs:
Education 47516
9th 41164
8th 25510
7th 25198
6th 25047
...
3rd 2
2nd 2
1st 2
Name: name, Length: 190, dtype: int64
for finding a specific value of a column you can use the code below
irrespective of the preference you can use the any of the method you like
df.col_name.value_counts().Value_you_are_looking_for
take example of the titanic dataset
df.Sex.value_counts().male
this gives a count of all male on the ship
Although if you want to count a numerical data then you cannot use the above method because value_counts() is used only with series type of data hence fails
So for that you can use the second method example
the second method is
#this is an example method of counting on a data frame
df[(df['Survived']==1)&(df['Sex']=='male')].counts()
this is not that efficient as value_counts() but surely will help if you want to count values of a data frame
hope this helps
EDIT --
If you wanna look for something with a space in between
you may use
df.country.count('united states')
I believe this should solve the problem
I think this could be a more easy solution. Suppose you have the following data frame.
DATE LANG POSTS
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 2
2008-08-01 c 85
2008-08-01 python 11
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 62
2008-08-01 c 85
2008-08-01 python 14
you can find the occurrence of LANG item's sum like this
df.groupby('LANG').sum()
and you will have the sum of each individual language

Creating a new column depending on the equality of two other columns [duplicate]

This question already has an answer here:
python pandas : compare two columns for equality and result in third dataframe
(1 answer)
Closed last month.
l want to compare the values of two columns where I create a new column bin_crnn. I want 1 if they are equals or 0 if not.
# coding: utf-8
import pandas as pd
df = pd.read_csv('file.csv',sep=',')
if df['crnn_pred']==df['manual_raw_value']:
df['bin_crnn']=1
else:
df['bin_crnn']=0
l got the following error
if df['crnn_pred']==df['manual_raw_value']:
File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/pandas/core/generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
You need cast boolean mask to int with astype:
df['bin_crnn'] = (df['crnn_pred']==df['manual_raw_value']).astype(int)
Sample:
df = pd.DataFrame({'crnn_pred':[1,2,5], 'manual_raw_value':[1,8,5]})
print (df)
crnn_pred manual_raw_value
0 1 1
1 2 8
2 5 5
print (df['crnn_pred']==df['manual_raw_value'])
0 True
1 False
2 True
dtype: bool
df['bin_crnn'] = (df['crnn_pred']==df['manual_raw_value']).astype(int)
print (df)
crnn_pred manual_raw_value bin_crnn
0 1 1 1
1 2 8 0
2 5 5 1
You get error, because if compare columns output is not scalar, but Series (array) of True and False values.
So need all or
any for return scalar True or False.
I think better it explain this answer.
One fast approach is to use np.where.
import numpy as np
df['test'] = np.where(df['crnn_pred']==df['manual_raw_value'], 1, 0)
No need for a loop or if statement, just need to set a new column using a boolean mask.
df['bin_crnn'].loc[df['crnn_pred']==df['manual_raw_value']] = 1
df['bin_crnn'].fillna(0, inplace = True)
Another quick way just using Pandas and not Numpy is
df['columns_are_equal'] = df.apply(lambda x: int(x['column_a'] ==x['column_b']), axis=1)
You are comparing 2 columns, try this..
bin_crnn = []
for index, row in df.iterrows():
if row['crnn_pred'] == row['manual_raw_value']:
bin_crnn.append(1)
else:
bin_crnn.append(0)
df['bin_crnn'] = bin_crnn

Getting previous row values from within pandas apply() function

import pandas as pd
def greater_or_less(d):
if d['current'] > d['previous']:
d['result']="Greater"
elif d['current'] < d['previous']:
d['result']="Less"
elif d['current'] == d['previous']:
d['result']="Equal"
else:
pass
return d
df=pd.DataFrame({'current':[1,2,2,8,7]})
# Duplicate the column with shifted values
df['previous']=df['current'].shift(1)
df['result']=""
df=df.apply(greater_or_less,axis=1)
The result is:
current previous result
1 NaN
2 1 Greater
2 2 Equal
8 2 Greater
7 8 Less
I'd then drop the previous column as it's no longer needed. Ending up with:
current result
1
2 Greater
2 Equal
8 Greater
7 Less
How can I do this without adding the extra column?
What i'd like to do, is know how to reference the previous row's value from within the greater_or_less function.
Use diff() method:
import pandas as pd
import numpy as np
df=pd.DataFrame({'current':[1,2,2,8,7]})
np.sign(df.current.diff()).map({1:"Greater", 0:"Equal", -1:"Less"})

Python Pandas: Resolving "List Object has no Attribute 'Loc'"

I import a CSV as a DataFrame using:
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
Then I'm trying to do a simple replace based on IDs:
df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson'
I get the following error:
AttributeError: 'list' object has no attribute 'loc'
Note, when I do print pd.version() I get 0.12.0, so it's not a problem (at least as far as I understand) with having pre-11 version. Any ideas?
To pickup from the comment: "I was doing this:"
df = [df.hc== 2]
What you create there is a "mask": an array with booleans that says which part of the index fulfilled your condition.
To filter your dataframe on your condition you want to do this:
df = df[df.hc == 2]
A bit more explicit is this:
mask = df.hc == 2
df = df[mask]
If you want to keep the entire dataframe and only want to replace specific values, there are methods such replace: Python pandas equivalent for replace. Also another (performance wise great) method would be creating a separate DataFrame with the from/to values as column and using pd.merge to combine it into the existing DataFrame. And using your index to set values is also possible:
df[mask]['fname'] = 'Johnson'
But for a larger set of replaces you would want to use one of the two other methods or use "apply" with a lambda function (for value transformations). Last but not least: you can use .fillna('bla') to rapidly fill up NA values.
The traceback indicates to you that df is a list and not a DataFrame as expected in your line of code.
It means that between df = pd.read_csv("test.csv") and df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson' you have other lines of codes that assigns a list object to df. Review that piece of code to find your bug
#Boud answer is correct. Loc assignment works fine if the right-hand-side list matches the number of replacing elements
In [56]: df = DataFrame(dict(A =[1,2,3], B = [4,5,6], C = [7,8,9]))
In [57]: df
Out[57]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [58]: df.loc[1,['A','B']] = -1,-2
In [59]: df
Out[59]:
A B C
0 1 4 7
1 -1 -2 8
2 3 6 9

Categories