The pandas documentation of .loc clearly states:
.loc is strictly label based, will raise KeyError when the items are
not found, allowed inputs are:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label
of the index. This use is not an integer position along the index)
Contrary to that, this surprisingly works for pd.Series, not for pd.DataFrame:
import numpy as np
a = np.array([1,3,1,2])
import pandas as pd
s = pd.Series(a, index=["a", "b", "c", "d"])
s.loc["a"] # yields 1
s.loc[0] # should be strictly label-based, but it works and also yields 1
Do you know why?
Related
This is driving me crazy! I want to replace all negative values in columns containing string "_p" with the value multiplied by -0.5. Here is the code, where Tdf is a dataframe.
L=list(Tdf.filter(regex='_p').columns)
Tdf[L]=Tdf[L].astype(float)
Tdf[Tdf[L]<0]= Tdf[Tdf[L]<0]*-.5
I get the following error:
"TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value"
I variefied that all columns in Tdf[L] are type float64.
Even more confusing is that when I run a code, essentially the same except looping through multiple dataframes, it works:
csv_names=['Full','Missing','MinusMissing']
for DF in csv_names:
L=list(vars()[DF].iloc[:,1:])
vars()[DF][L]=vars()[DF][L].astype(float)
vars()[DF][vars()[DF][L]<0]= vars()[DF][vars()[DF][L]<0]*-.5
What am I missing?
Please clarify your question. If your question is about the error,
Tdf[Tdf[L]<0]= Tdf[Tdf[L]<0]*-.5
likely fails to non np.nan null values, as the error describes.
If your question is instead:"How do I multiply negative values by -0.5 in columns with "_p" in the column name?"
for col in Tdf.filter(regex='_p').columns.tolist():
Tdf[col] = Tdf.apply((lambda Tdf: Tdf[col]*-.5 if Tdf[col] < 0 else Tdf[col], axis =1)
Updated: You just need to filter out the string column which is type object. And then work on the data that is left. You can disable the slice warning if you want.
import pandas as pd
import numpy as np
Tdf = pd.DataFrame(columns=["name", "a_p", "b_p", "c"],
data=[["a", -1, -2, -3],
["b", 1, -2, 3],
["c", 1, np.NaN, 3]])
# get only non object columns
sub_Tdf = Tdf.select_dtypes(exclude='object')
# then work on slice
L = list(sub_Tdf.filter(regex='_p').columns)
sub_Tdf[L] = sub_Tdf[L].astype(float)
sub_Tdf[sub_Tdf[L] < 0] = sub_Tdf[sub_Tdf[L] < 0] * -.5
# see results were applied correctly
print(Tdf)
Output:
name a_p b_p c
0 a -1 -2.0 -3
1 b 1 -2.0 3
2 c 1 NaN 3
This will trigger the setting value on slice warning. You can disable it with.
import pandas as pd
pd.options.mode.chained_assignment = None
df['CRIM'].sort_values()[-10:] = df['CRIM'].sort_values()[-10:-9]
I want to change the top 10 values of CRIM to the 10th value of CRIM
but error is cannot set using a slice indexer with a different length than the value
sorry i'm not good in english
Try selecting a value instead of a slice with .iloc:
import pandas as pd
df = pd.DataFrame({"a": range(10)})
print(df)
df["a"][-5:] = df["a"].iloc[-5]
print(df)
Your code errors because the slices on either side of the equal side are not equal in length. The slice on the left has length 10, the slice on the right has length 1. So you have one value, that you're trying to set on 10 values. This only works if the one value is a native data type, such as integers or floats.
len(df['CRIM'].sort_values()[-10:])
>>> 10
len(df['CRIM'].sort_values()[-10:-9])
>>> 1
I am tryig to apply argrelextrema function with dataframe df. But unable to apply correctly. below is my code
import pandas as pd
from scipy.signal import argrelextrema
np.random.seed(42)
def maxloc(data):
loc_opt_ind = argrelextrema(df.values, np.greater)
loc_max = np.zeros(len(data))
loc_max[loc_opt_ind] = 1
data['loc_max'] = loc_max
return data
values = np.random.rand(23000)
df = pd.DataFrame({'value': values})
np.all(maxloc_faster(df).loc_max)
It gives me error
that loc_max[loc_opt_ind] = 1
IndexError: too many indices for array
A Pandas dataframe is two-dimensional. That is, df.values is two dimensional, even when it has only one column. As a result, loc_opt_ind will contain x and y indices (two tuples; just print loc_opt_ind to see), which can't be used to index loc_max. You probably want to use either df['values'].values (which turns into <Series>.values), or np.squeeze(df.values) as input. Note that argrelextrema still returns a tuple in that case, just a one-element one, so you may need loc_opt_ind[0] (np.where has similar behaviour).
I have a pandas dataframe with two columns: x and value.
I want to find all the rows where x == 10, and for all these rows set value = 1,000. I tried the code below but I get the warning that
A value is trying to be set on a copy of a slice from a DataFrame.
I understand I can avoid this by using .loc or .ix, but I would first need to find the location or the indices of all the rows which meet my condition of x ==10. Is there a more direct way?
Thanks!
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['x']=np.arange(10,14)
df['value']=np.arange(200,204)
print df
df[ df['x']== 10 ]['value'] = 1000 # this doesn't work
print df
You should use loc to ensure you're working on a view, on your example the following will work and not raise a warning:
df.loc[df['x'] == 10, 'value'] = 1000
So the general form is:
df.loc[<mask or index label values>, <optional column>] = < new scalar value or array like>
The docs highlights the errors and there is the intro, granted some of the function docs are sparse, feel free to submit improvements.
I'm trying to find a better way to assert the column data type in Python/Pandas of a given dataframe.
For example:
import pandas as pd
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer']})
I would like to assert that specific columns in the data frame are numeric. Here's what I have:
numeric_cols = ['a', 'b'] # These will be given
assert [x in ['int64','float'] for x in [t[y].dtype for y in numeric_cols]]
This last assert line doesn't feel very pythonic. Maybe it is and I'm just cramming it all in one hard to read line. Is there a better way? I would like to write something like:
assert t[numeric_cols].dtype.isnumeric()
I can't seem to find something like that though.
You could use ptypes.is_numeric_dtype to identify numeric columns, ptypes.is_string_dtype to identify string-like columns, and ptypes.is_datetime64_any_dtype to identify datetime64 columns:
import pandas as pd
import pandas.api.types as ptypes
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer'],
'd':pd.date_range('2000-1-1', periods=3)})
cols_to_check = ['a', 'b']
assert all(ptypes.is_numeric_dtype(t[col]) for col in cols_to_check)
# True
assert ptypes.is_string_dtype(t['c'])
# True
assert ptypes.is_datetime64_any_dtype(t['d'])
# True
The pandas.api.types module (which I aliased to ptypes) has both a is_datetime64_any_dtype and a is_datetime64_dtype function. The difference is in how they treat timezone-aware array-likes:
In [239]: ptypes.is_datetime64_any_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[239]: True
In [240]: ptypes.is_datetime64_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[240]: False
You can do this
import numpy as np
numeric_dtypes = [np.dtype('int64'), np.dtype('float64')]
# or whatever types you want
assert t[numeric_cols].apply(lambda c: c.dtype).isin(numeric_dtypes).all()
Example how to simple do python's isinstance check of column's panda dtype where column is numpy datetime:
isinstance(dfe.dt_column_name.dtype, type(np.dtype('datetime64')))
note: dtype could be checked against list/tuple as 2nd argument.
If you're interested in checking column's data type consistency over rows then #ely answer using apply could be better choice