drop_duplicates doesn't work on multiple identical rows instances - python

I'm trying to concatenate two Pandas DataFrames and then drop the duplicates, however, for some reason drop_duplicates doesn't work for most of the identical rows (only few of them dropped). For instance these two are identical (at least in my eyes) but they are still showing up: Identical rows here
This is the code I have tried, the result with or without subset arguments varies but still doesn't give me the result I wanted. It tends to over delete and under delete when I play with arguments (for instance add or remove columns)
bigdata = pd.concat([df_q,df_q_temp]).drop_duplicates(subset=['Date', 'Value'], keep ='first').reset_index(drop=True)
Can anyone point me to a right direction?
Thanks

Expanding on my comment, here is a way to make the differences explicit and normalize your df to drop near-duplicates:
Part 1: show differences
def eq_nan(a, b):
return (a == b) | ((a != a) & (b != b)) # treat NaN as equal
Let's try with some data:
df = pd.DataFrame([
['foo\u00a0bar', 1, np.nan, None, 4.00000000001, pd.Timestamp('2021-01-01')],
['foo bar', 1, np.nan, None, 4, '2021-01-01 00:00:00']],
columns=list('uvwxyz'),
)
df.loc[1, 'z'] = str(df.loc[1, 'z']) # the init above converts the second date (str) as Timestamp
>>> df.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
>>> df.drop_duplicates()
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
1 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
Find what elements among those two rows are different:
a = df.loc[0]
b = df.loc[1]
diff = ~eq_nan(a, b)
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
print(f'{col}:\t{x!r} != {y!r}')
# output:
u: 'foo\xa0bar' != 'foo bar'
y: 4.00000000001 != 4.0
z: Timestamp('2021-01-01 00:00:00') != '2021-01-01 00:00:00'
Side note: alternatively, if you have cells containing complex types, e.g. list, dict, etc., you may use pytest (outside of testing) to get some nice verbose explanation of exactly how the values differ:
from _pytest.assertion.util import _compare_eq_verbose
for (col, x), y in zip(a[diff].iteritems(), b[diff]):
da, db = _compare_eq_verbose(x, y)
print(f'{col}:\t{da} != {db}')
# Output:
u: +'foo\xa0bar' != -'foo bar'
y: +4.00000000001 != -4.0
z: +Timestamp('2021-01-01 00:00:00') != -'2021-01-01 00:00:00'
Part 2: example of normalization to help drop duplicates
We use Pandas' own Series formatter to convert each row into a string representation:
def normalize_row(r):
vals = r.to_string(header=False, index=False, name=False).splitlines()
vals = [
' '.join(s.strip().split()) # transform any whitespace (e.g. unicode non-breaking space) into ' '
for s in vals
]
return vals
Example for the first row above:
>>> normalize_row(df.iloc[0])
['foo bar', '1', 'NaN', 'NaN', '4.0', '2021-01-01 00:00:00']
Usage to drop visually identical duplicates:
newdf = df.loc[df.apply(normalize_row, axis=0).drop_duplicates().index]
>>> newdf
u v w x y z
0 foo bar 1 NaN None 4.0 2021-01-01 00:00:00
​
>>> newdf.dtypes
u object
v int64
w float64
x object
y float64
z object
dtype: object
Note: the rows that make it through this filter are copied exactly into newdf (not the string lists that were used for near-duplicate detection).

Take care of string columns with no values.
Make sure in the dataframe the cells without value are read as None. Specially, in the object typed columns, there spaces with are different from None, but actually there is no value there.
For example:
import pandas as pd
df = pd.DataFrame({'Col_1': ['one', ' ', 'two', None],
'Col_2': [1, 2, 3, 2],
'Col_3': ['one', None, 'two', ' ']})
df
Col_1 Col_2 Col_3
0 one 1 one
1 2 None
2 two 3 two
3 None 2
As you see row 1 and row 3 don't have value in Col_1 and Col_2. But, since two of them are None and the two others are spaces, they are different.
I had the same problem and struggled a lot with the code, since I found this. I solved it by replacing None values with spaces:
df = df.fillna(' ')

Related

Is there pandas aggregate function that combines features of 'any' and 'unique'?

I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
get a hold on the group aggreagator
count number of uniques
if > 1, i.e., more than 1 uniques, put True
if == 1, i.e., only 1 unique, put that unique value
else, i.e., no uniques (full NaNs), put False
You can combine groupby with agg and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b
The expression
series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))
will give the next result:
one a
three Nan
two [b, c]
where simple value can be identified by the type:
series[series.apply(type) == str]
Think it is easy enough to use often, but probably it is not the optimal solution.

Replace list based on column condition [duplicate]

I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.
Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3
Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.
df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.
Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]
Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.
As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.
I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.
I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)

UserWarning: Pandas doesn't allow columns to be created via a new attribute name

I am stuck with my pandas script.
Actually , i am working with two csv file(one input and the other output file).
i want to copy all the rows of two column and want to make calculation and then copy it to another dataframe (output file).
The columns are as follows :
'lat', 'long','PHCount', 'latOffset_1', 'longOffset_1','PH_Lat_1', 'PH_Long_1', 'latOffset_2', 'longOffset_2', 'PH_Lat_2', 'PH_Long_2', 'latOffset_3', 'longOffset_3','PH_Lat_3', 'PH_Long_3', 'latOffset_4', 'longOffset_4','PH_Lat_4', 'PH_Long_4'.
i want to take the column 'lat' and 'latOffset_1' , do some calculation and put it in another new column('PH_Lat_1') which i have already created.
My function is :
def calculate_latoffset(latoffset): #Calculating Lat offset.
a=(df2['lat']-(2*latoffset))
return a
The main code :
for i in range(1,5):
print(i)
a='PH_lat_%d' % i
print (a)
b='latOffset_%d' % i
print (b)
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
Since the column name just differ by (1,2,3,4). so i want to call the function calculate_latoffset and calculate the all the rows of all the columns(PH_Lat_1, PH_Lat_2, PH_Lat_3,PH_Lat_4) in one go.
When using the above code i am getting this error :
basic_conversion.py:46: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
is it possible ?
Please kindly help
Simply use df2['a'] instead of df2.a
This is a Warning not an Error, so your code could still run through, but probably not following your intention.
Short answer: To create a new column for DataFrame, never use attribute access, the correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space bar or conflicts with an existing method name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
As to the topic, to create a new column for DataFrame, as you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, back to the Short answer.
The solution I can think of is to use .loc to get the column. You can try df.loc[:,a] instead of df.a.
Pandas dataframe columns cannot be created using the dot method to avoid potential conflicts with the dataframe attributes. Hope this helps
although all other answers are likely a much better solution, i figured that it does no harm to just ignore it move on.
import warnings
warnings.filterwarnings("ignore","Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access", UserWarning)
using the code above, the script will just disregard the warning and move on.
In df2.apply(lambda x: calculate_latoffset(x[b]), axis=1) you are creating a 5 column dataframe and you were trying to assign the value to a single field. Do df2[a] = calculate_latoffset(df2[b]) instead should deliver the desired output.

Passing row and column name to get value [duplicate]

I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()

Convert all numeric columns of dataframe to absolute value

I want to convert all numeric columns in a dataframe to their absolute values and am doing this:
df = df.abs()
However, it gives the error:
*** TypeError: bad operand type for abs(): 'unicode'
How to fix this? I would really prefer not having to manually specify the column names
You could use np.issubdtype to check whether your dtype of the columns is np.number or not with apply. Using #Amy Tavory example:
df = pd.DataFrame({'a': ['-1', '2'], 'b': [-1, 2]})
res = df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x)
In [14]: res
Out[14]:
a b
0 -1 1
1 2 2
Or you could use np.dtype.kind to check whether your dtype is numeric:
res1 = df.apply(lambda x: x.abs() if x.dtype.kind in 'iufc' else x)
In [20]: res1
Out[20]:
a b
0 -1 1
1 2 2
Note: You may be also interested in NumPy dtype hierarchy
Borrowing from an answer to this question, how about selecting the columns that are numeric?
Say you start with
df = pd.DataFrame({'a': ['-1', '2'], 'b': [-1, 2]})
>>> df
a b
0 -1 -1
1 2 2
Then just do
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
for c in [c for c in df.columns if df[c].dtype in numerics]:
df[c] = df[c].abs()
>>> df
a b
0 -1 1
1 2 2
Faster than the existing answers and more to the point:
df.update(df.select_dtypes(include=[np.number]).abs())
(Careful: I noticed that the update sometimes doesn't do anything when df has a non-trivial multi-index. I'll update this answer once I figure out where the problem is. This definitely works fine for trivial range-indices)
If you know the columns you want to change to absolute value use this:
df.iloc[:,2:7] = df.iloc[:,2:7].abs()
which means change all values from the third to sixth column (inclusive) to its absolute values.
If you don't, you can create a list of column names whos values are not objects
col_list = [col for col in df.columns if df[col].dtype != object]
Then use .loc instead
df.loc[:,col_list] = df.loc[:,col_list].abs()
I know it is wordy but I think it avoids the slow nature of apply or lambda

Categories