I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()
Related
I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6
I am stuck with my pandas script.
Actually , i am working with two csv file(one input and the other output file).
i want to copy all the rows of two column and want to make calculation and then copy it to another dataframe (output file).
The columns are as follows :
'lat', 'long','PHCount', 'latOffset_1', 'longOffset_1','PH_Lat_1', 'PH_Long_1', 'latOffset_2', 'longOffset_2', 'PH_Lat_2', 'PH_Long_2', 'latOffset_3', 'longOffset_3','PH_Lat_3', 'PH_Long_3', 'latOffset_4', 'longOffset_4','PH_Lat_4', 'PH_Long_4'.
i want to take the column 'lat' and 'latOffset_1' , do some calculation and put it in another new column('PH_Lat_1') which i have already created.
My function is :
def calculate_latoffset(latoffset): #Calculating Lat offset.
a=(df2['lat']-(2*latoffset))
return a
The main code :
for i in range(1,5):
print(i)
a='PH_lat_%d' % i
print (a)
b='latOffset_%d' % i
print (b)
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
Since the column name just differ by (1,2,3,4). so i want to call the function calculate_latoffset and calculate the all the rows of all the columns(PH_Lat_1, PH_Lat_2, PH_Lat_3,PH_Lat_4) in one go.
When using the above code i am getting this error :
basic_conversion.py:46: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
is it possible ?
Please kindly help
Simply use df2['a'] instead of df2.a
This is a Warning not an Error, so your code could still run through, but probably not following your intention.
Short answer: To create a new column for DataFrame, never use attribute access, the correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space bar or conflicts with an existing method name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
As to the topic, to create a new column for DataFrame, as you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, back to the Short answer.
The solution I can think of is to use .loc to get the column. You can try df.loc[:,a] instead of df.a.
Pandas dataframe columns cannot be created using the dot method to avoid potential conflicts with the dataframe attributes. Hope this helps
although all other answers are likely a much better solution, i figured that it does no harm to just ignore it move on.
import warnings
warnings.filterwarnings("ignore","Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access", UserWarning)
using the code above, the script will just disregard the warning and move on.
In df2.apply(lambda x: calculate_latoffset(x[b]), axis=1) you are creating a 5 column dataframe and you were trying to assign the value to a single field. Do df2[a] = calculate_latoffset(df2[b]) instead should deliver the desired output.
I have a Pandas series, for example like this
s = pandas.Series(data = [1,2,3], index = ['A', 'B', 'C'])
How can I change the order of the index, so that s becomes
B 2
A 1
C 3
I have tried
s['B','A','C']
but that will give me a key error. (In this particular example I could presumably take care of the order of the index while constructing the series, but I would like to have a way to do this after the series has been created.)
Use reindex:
In [52]:
s = s.reindex(index = ['B','A','C'])
s
Out[52]:
B 2
A 1
C 3
dtype: int64
I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
I have a pandas dataFrame created through a mysql call which returns the data as object type.
The data is mostly numeric, with some 'na' values.
How can I cast the type of the dataFrame so the numeric values are appropriately typed (floats) and the 'na' values are represented as numpy NaN values?
Use the replace method on dataframes:
import numpy as np
df = DataFrame({
'k1': ['na'] * 3 + ['two'] * 4,
'k2': [1, 'na', 2, 'na', 3, 4, 4]})
print df
df = df.replace('na', np.nan)
print df
I think it's helpful to point out that df.replace('na', np.nan) by itself won't work. You must assign it back to the existing dataframe.
df = df.convert_objects(convert_numeric=True) will work in most cases.
I should note that this copies the data. It would be preferable to get it to a numeric type on the initial read. If you post your code and a small example, someone might be able to help you with that.
This is what Tom suggested and is correct
In [134]: s = pd.Series(['1','2.','na'])
In [135]: s.convert_objects(convert_numeric=True)
Out[135]:
0 1
1 2
2 NaN
dtype: float64
As Andy points out, this doesn't work directly (I think that's a bug), so convert to all string elements first, then convert
In [136]: s2 = pd.Series(['1','2.','na',5])
In [138]: s2.astype(str).convert_objects(convert_numeric=True)
Out[138]:
0 1
1 2
2 NaN
3 5
dtype: float64