Let's say I have a fairly simple code such as
import pandas
df_import=pandas.read_excel("dataframe.xlsx")
df_import['Company'].str.contains('value',na=False,case=False)
So this obviously imports pandas, creates a dataframe from an excel documentment and then searches the column titled Company for some value, and returns an index saying if the value of that cell contains that value (True or False)
However, I want to test 3 cases. Case A, no results were found (all False), case 2, only 1 case was found (only 1 True) and case 3, more that 1 result was found (# of True > 1).
My though is that I could set up a for loop, iterating through the column, and if a value of a cell is True, I add 1 to a variable (lets call it count). Then at the end, I have an if/elif/elif statement based on the value of count, whether it is 0,1,or >1.
Now, maybe there is a better way to check this but if not, I figured the for loop would look something like
for i in range (len(df_improt.index))
if df_import.iloc[i,0].str.contains('value',na=False,case=False)
count += 1
First of all, I'm not sure if I should use .iloc or .iat but both give me the error
AttributeError: 'str' object has no attribute 'str'
and I wasn't able to find a correction for this.
Your current code is not going to work because iloc[i, 0] returns a scalar value, and of course, those don't have str accessor methods associated with them.
A quick and easy fix would be to just call sum on the series level str.contains call.
count = df_import['Company'].str.contains('value', na=False, case=False).sum()
Now, count contains the number of matches in that column.
Related
I am trying to sum a column based on if the unique identifier is within another list I have defined. (The list is a subset of all of the unique identifiers). So I am trying to do it like this:
sum = data.loc[data['unique_identifier'] in somelist, 'number'].sum()
But I get back a TypeError: 'Series' objects are mutable, thus they cannot be hashed. I know that this works below:
sum = data.loc[data['unique_identifier'] > 100, 'number'].sum()
Any ideas would be helpful and let me know if I need to clarify anything else.
You are probably looking for .isin(values):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html
sum = data.loc[data['unique_identifier'].isin(somelist), 'number'].sum()
After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!
The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2
The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]
I'm trying to assign group numbers to products for items that have been out of stock multiple days in a row. Whenever there is a break in consecutive days out of stock, I need to assign a new group number. I've worked out the SQL so that if an item number/day combination is consecutive, it is assigned a 1, else 0 (group number iterates at 0's).
I've written the following simple function so that the variable group_num iterates +1 if the counter is 0, otherwise returns group_num as is:
def add_groups():
group_num=1
for c in df['counter']:
if c==0:
group_num+=1
else:
group_num+=0
return (group_num)
df.apply(add_groups(),axis=1)
I keep getting the error 'int' object is not callable, 'occurred at index 0' and I have no idea why.
You can check to see if you've reused a name in two different places, such as a method name and a function name.When these two names are repeated, the program will call an Int by default, but the Int object has no invocation to speak of, so the error is broken, so to fix it, either change the variable name or change the method name.I hope this is helpful.
Try to use background color in the style not sure if this is the right way
I think I should use if elif statement but it give me errors as well. I think I have to use loc or iloc for the particular column I am interested in cause there are different columns
The main one with this code is
ValueError: Function <function flt_cat_style_function_1 at 0x7f08ea52b830> returned the wrong shape.
Result has shape: (1,)
Expected shape: (11, 1)
a=df['flt_cat']
def flt_cat_style_function_1(a):
df['flt_cat'].str.contains(r'VLIFR','background-color: #9400D3')
df['flt_cat'].str.contains(r'LIFR','background-color: #FFA500')
df['flt_cat'].str.contains(r'IFR','background-color: #FF0000')
df['flt_cat'].str.contains(r'MVFR','background-color: #FFFF00')
df['flt_cat'].str.contains(r'VFR','background-color: #00b050')
highlighted=df.style.apply(flt_cat_style_function_1,subset='flt_cat').render()
0 VLIFR
1 LIFR
2 LIFR
3 LIFR
4 IFR
5 IFR
6 MVFR
7 MVFR
8 MVFR
9 MVFR
10 VFR
Name: flt_cat, dtype: object
with open('shtml.html','w') as f:
f.write(highlighted)
Your code has a couple issues:
First, the second parameter of Series.str.contains() is case, a boolean which decides if the contains function should match case-sensitive or not. In your code, you put your background-color strings there, which evaluate to True but don't actually do what you want. You should take a look at the function's documentation here.
Second, Series.str.contains() returns an Index of Booleans that indicate which cells of the Series contain a string, but it doesn't modify the Series in place. So your function flt_cat_style_function_1() actually does nothing.
Third, since the function also has no return statement, it will default to returning None. However, df.style.apply() expects a function that returns an array-like containing exactly 11 values (the amount of rows in df). This is why you are seeing the ValueError.
I would suggest the following changes:
First, put your mapping of values to background-colors into a dictionary:
cell_bg_colors = {
'VLIFR': '#9400D3',
'LIFR': '#FFA500',
'IFR': '#FF0000',
'MVFR': '#FFFF00',
'VFR': '#00b050',
}
Create a function that maps one cell to it's corresponding style:
def color_background(cell):
for value, color in cell_bg_colors.items():
if value in cell:
return "background-color: {}".format(color)
return "" # default: do nothing
Then, use Styler.applymap to apply this function to each individual cell:
highlighted = df.style.applymap(color_background, subset="flt_cat").render()
Finally, you can save highlighted to your file.
Side note
This code is only garantueed to work correctly in Python 3.7+, as earlier versions don't guarantee dictionary order to be preserved (although Python 3.6 already keeps the order in tact). For your example, this could for example mean that the IFR color gets applied to VLIFR or LIFR cells as well in earlier Python versions.
I have been working with Python for a couple of months..Now,I have to perform min-max normalization for a column of my dataset(.csv file) for which I get the above mentioned type error..I have tried a lot but it still persists..Correct values are retrieved for min and max functions but the types of the results are list rather than float/integer..
This is the line that causes error
for i in range(num):
normalized[i]=(krr[i]-min(krr)/(max(krr)-min(krr))
where krr is the column retrieved from the dataset.Please help.
I have a function "normal" which does the min-max normalization..
I have taken column values using eval as shown in code
def normal(self,arr,num):
print("------------------->entered Normalisation block----------------->")
for i in range(num):
# trr=eval(str(arr[i]))[0:-31]
self.krr[i]=map(float,eval(str(arr[i]))[0:-31]) //extracting one particular column
#mn=min(self.krr)
#mx=max(self.krr)
print(self.krr)
ls=min(self.krr)
hs=max(self.krr)
diff=hs-ls
for i in range(num):
normalized[i]=(self.krr[i]-ls)/diff
OK, so the key issue here is that you are working on a list of sublists, with each sublist containing one number.
If you look at your formula:
(krr[i]-min(krr)/(max(krr)-min(krr))
As you mention, python can deal with the max and min - it will return the sublist that contains the biggest/smallest number. (Though note that getting a list containing one number is very different to getting just the one number) However, subtraction and division between lists is not supported, hence your error message. So sooner or later, you need to get the values out of the sublists.
My recommendation is that immediately after you finish constructing krr, you add the following line to your code:
krr = [element[0] for element in krr]
which converts krr from a list of sublists, to a list of the first element of each sublist.
Edit:
An alternative that I think will work, and is more efficient, is to change
def normal(self,arr,num):
print("------------------->entered Normalisation block----------------->")
for i in range(num):
# trr=eval(str(arr[i]))[0:-31]
self.krr[i]=map(float,eval(str(arr[i]))[0:-31]) # This row
into this:
self.krr[i]=float(eval(str(arr[i]))[0:-31][0])
map applies float to each element of the following list, and creates a new list. Instead, we're asking for the first element of that list, and applying float directly to it. That float is assigned to the index in krr.
PS eval(str(arr[i]))[0:-31] looks rather scary - does eval really need to be invoked here?