Related
I have a Series, like this:
series = pd.Series({'a': 1, 'b': 2, 'c': 3})
I want to convert it to a dataframe like this:
a b c
0 1 2 3
pd.Series.to_frame() doesn't work, it got result like,
0
a 1
b 2
c 3
How can I construct a DataFrame from Series, with index of Series as columns?
You can also try this :
df = DataFrame(series).transpose()
Using the transpose() function you can interchange the indices and the columns.
The output looks like this :
a b c
0 1 2 3
You don't need the transposition step, just wrap your Series inside a list and pass it to the DataFrame constructor:
pd.DataFrame([series])
a b c
0 1 2 3
Alternatively, call Series.to_frame, then transpose using the shortcut .T:
series.to_frame().T
a b c
0 1 2 3
you can also try this:
a = pd.Series.to_frame(series)
a['id'] = list(a.index)
Explanation:
The 1st line convert the series into a single-column DataFrame.
The 2nd line add an column to this DataFrame with the value same as the index.
Try reset_index. It will convert your index into a column in your dataframe.
df = series.to_frame().reset_index()
This
pd.DataFrame([series]) #method 1
produces a slightly different result than
series.to_frame().T #method 2
With method 1, the elements in the resulted dataframe retain the same type. e.g. an int64 in series will be kept as an int64.
With method 2, the elements in the resulted dataframe become objects IF there is an object type element anywhere in the series. e.g. an int64 in series will be become an object type.
This difference may cause different behaviors in your subsequent operations depending on the version of pandas.
I have 5 columns in data-frame called 'A','B','C','D','E'. I want to filter data-frame where values of columns 'A','C'and 'E' are equal.
I have done the following :
OutputDF = DF[DF['A']==DF['C']==DF['E']]
Its giving error as follows:
ValueError: Truth value of series is ambiguous. Use a.empty, a.boolean,a.item(),a.any() or a.all()
You can compare all filtered columns by list by first column of list by DataFrame.eq and test if all values are True by DataFrame.all:
print (df)
A B C D E
0 1 2 3 4 5
1 1 2 1 4 1
2 2 2 2 4 2
L = ['A','C','E']
df = df[df[L].eq(df[L[0]], axis=0).all(axis=1)]
print (df)
A B C D E
1 1 2 1 4 1
2 2 2 2 4 2
To address why this happens:
import pandas as pd
DF = pd.DataFrame({"A": [1, 2],
"C": [1, 2],
"E": [1, 2],
})
OutputDF = DF[DF['A']==DF['C']==DF['E']]
#ValueError: The truth value of a Series is ambiguous.
The issue is that, due to how operator chaining works, DF['A']==DF['C']==DF['E'] is being interpreted as
DF['A']==DF['C'] and DF['C']==DF['E']
Essentially, we are attempting to do a boolean and between two Series, and thus we see our error. Since a Series should be giving multiple values, while and expects a single value on both sides of the operator, Thus there is ambiguity on how to reduce the Series on either sides to a single value.
If you wanted to write the condition correctly, you could use bitwise and (&) instead as follows (the brackets are important with bitwise &):
OutputDF = DF[(DF['A']==DF['C']) & (DF['C']==DF['E'])]
I am using pandas and I have a column that has numbers but when I check for datatype I get the column is an object. I think one of the rows in that column is actually a string. How can I find out which row is the string? For example:
Name A B
John 0 1
Rich 1 0
Jim O 1
Jim has the letter "O" instead of zero on column A. what can I use in pandas to find which row has the string instead of the number if I have thousands of rows? In this example I used the letter O, but it could be any letter, really.
The dtype object means that the column holds generic Python-typed values.
Those values can be any type Python knows—an int, a str, a list of sets of some custom namedtuple type that you created, whatever.
And you can just call normal Python functions or methods on those objects (e.g., by accessing them directly, or via Pandas' apply) the same way you do with any other Python variables.
And that includes the type function, the isinstance function, etc.:
>>> df = pd.DataFrame({'A': [0, 1, 'O'], 'B': [1, 0, 1]})
>>> df.A
0 0
1 1
2 O
Name: A, dtype: object
>>> df.A.apply(type)
0 <class 'int'>
1 <class 'int'>
2 <class 'str'>
Name: A, dtype: object
>>> df.A.apply(lambda x: isinstance(x, str))
0 False
1 False
2 True
Name: A, dtype: bool
>>> df.A.apply(repr)
0 0
1 1
2 'O'
Name: A, dtype: object
… and so on.
You can use pandas.to_numeric to see what doesn't get converted to a number. Then with .isnull() you can subset your original df to see exactly which rows are the problematic ones.
import pandas as pd
df[pd.to_numeric(df.A, errors='coerce').isnull()]
# Name A B
#2 Jim O 1
If you're not sure which column is problematic, you could so something like (assuming you want to check everything other than the 1st name column):
df2 = pd.DataFrame()
for col in df.columns[1::]:
df2[col] = pd.to_numeric(df[col], errors='coerce')
df[df2.isnull().sum(axis=1).astype(bool)]
# Name A B
#2 Jim O 1
I'd like to add another very short and concise solution which would be a combination of ALollz and abarnert.
First let's find all columns that are of type object with cols = (df.dtypes == 'object').nonzero()[0]. Let us filter those out using iloc and apply pd.to_numeric (and let us also not include the name column using a slice of the col variable [1:]). Then we check for na-values and if any(1) (row-wise) then we return back the index of that row.
Full example:
import pandas as pd
data = '''\
Name A B C
John 0 1 O
Rich 1 0 2
Jim O 1 O'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
cols = (df.dtypes == 'object').nonzero()[0]
rows = df.iloc[:,cols[1:]].apply(pd.to_numeric, errors='coerce').isna().any(1).nonzero()[0]
print(rows)
Returns:
[0 2] # <-- Means that row 0 and 2 contain N/A-values in at least 1 column
This answers your question: what can I use in pandas to find which row has the string instead of the number but for all columns looking for strings by assuming they can't be converted to numbers with pd.to_numeric.
types = list(df['A'].apply(lambda x: type(x)))
names = list(df['Name'])
d = dict(zip(names, types))
This will give you a dictionary of {name:type} so you know which name has a string value in column A. Alternatively, if you just want to find the row the string is on, use this:
types = list(df['A'].apply(lambda x: type(x)))
rows = df.index.tolist()
d = dict(zip(rows, types))
# to get only the rows that have string values in column A
d = {k:v for k,v in d.items() if v == str}
For programming purpose, I want .iloc to consistently return a data frame, even when the resulting data frame has only one row. How to accomplish this?
Currently, .iloc returns a Series when the result only has one row. Example:
In [1]: df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
In [2]: df
Out[2]:
a b
0 1 3
1 2 4
In [3]: type(df.iloc[0, :])
Out[3]: pandas.core.series.Series
This behavior is poor for 2 reasons:
Depending on the number of chosen rows, .iloc can either return a Series or a Data Frame, forcing me to manually check for this in my code
- .loc, on the other hand, always return a Data Frame, making pandas inconsistent within itself (wrong info, as pointed out in the comment)
For the R user, this can be accomplished with drop = FALSE, or by using tidyverse's tibble, which always return a data frame by default.
Use double brackets,
df.iloc[[0]]
Output:
a b
0 1 3
print(type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
Short for df.iloc[[0],:]
Accessing row(s) by label: loc
# Setup
df = pd.DataFrame({'X': [1, 2, 3], 'Y':[4, 5, 6]}, index=['a', 'b', 'c'])
df
X Y
a 1 4
b 2 5
c 3 6
To get a DataFrame instead of a Series, pass a list of indices of length 1,
df.loc[['a']]
# Same as
df.loc[['a'], :] # selects all columns
X Y
a 1 4
To select multiple specific rows, use
df.loc[['a', 'c']]
X Y
a 1 4
c 3 6
To select a contiguous range of rows, use
df.loc['b':'c']
X Y
b 2 5
c 3 6
Access row(s) by position: iloc
Specify a list of indices of length 1,
i = 1
df.iloc[[i]]
X Y
b 2 5
Or, specify a slice of length 1:
df.iloc[i:i+1]
X Y
b 2 5
To select multiple rows or a contiguous slice you'd use a similar syntax as with loc.
The double-bracket approach doesn't always work for me (e.g. when I use a conditional to select a timestamped row with loc).
You can, however, just add to_frame() to your operation.
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df2 = df.iloc[0, :].to_frame().transpose()
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
please use the below options:
df1 = df.iloc[[0],:]
#type(df1)
df1
or
df1 = df.iloc[0:1,:]
#type(df1)
df1
For getting single row extraction from Dataframe use:
df_name.iloc[index,:].to_frame().transpose()
single_Sample1=df.iloc[7:10]
single_Sample1
[1]: https://i.stack.imgur.com/RHHDZ.png**strong text**
Consider the following dataframe:
columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
"""
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
"""
The following commands work:
df.groupby('A').apply(lambda x: (x['C'] - x['D']))
df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())
but none of the following work:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
# KeyError or ValueError: could not broadcast input array from shape (5) into shape (5,3)
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
# KeyError or TypeError: cannot concatenate a non-NDFrame object
Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?
For reference, below is the construction of the original dataframe above:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
Two major differences between apply and transform
There are two major differences between the transform and apply groupby methods.
Input:
apply implicitly passes all the columns for each group as a DataFrame to the custom function.
while transform passes each column for each group individually as a Series to the custom function.
Output:
The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.
Inspecting the custom function
It can help quite a bit to inspect the input to your custom function passed to apply or transform.
Examples
Let's create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Let's create a simple custom function that prints out the type of the implicitly passed object and then raises an exception so that execution can be stopped.
def inspect(x):
print(type(x))
raise
Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:
df.groupby('State').apply(inspect)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError
As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.
Now, let's do the same thing with transform
df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError
It is passed a Series - a totally different Pandas object.
So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:
def subtract_two(x):
return x['a'] - x['b']
df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')
We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:
df.groupby('State').apply(subtract_two)
State
Florida 2 -2
3 -8
Texas 0 -2
1 -5
dtype: int64
The output is a Series and a little confusing as the original index is kept, but we have access to all columns.
Displaying the passed pandas object
It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:
from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']
Screenshot:
Transform must return a single dimensional sequence the same size as the group
The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:
def return_three(x):
return np.array([1, 2, 3])
df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group
The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:
def rand_group_len(x):
return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208
Returning a single scalar object also works for transform
If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:
def group_sum(x):
return x.sum()
df.groupby('State').transform(group_sum)
a b
0 9 16
1 9 16
2 4 14
3 4 14
As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.
My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.
So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).
Consider this example (on your dataframe):
zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
will yield:
C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653
Which is exactly the same as if you would use it on only on one column at a time:
df.groupby('A')['C'].transform(zscore)
yielding:
0 0.989
1 -0.478
2 0.889
3 -0.671
4 0.034
5 1.149
6 -1.404
7 -0.509
Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:
df.groupby('A').apply(zscore)
gives error:
ValueError: operands could not be broadcast together with shapes (6,) (2,)
So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.
df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
yielding:
A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373
Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:
df.groupby('A')['C'].apply(sum)
giving:
A
bar 3.973
foo 4.373
There are also cases when .transform is used to filter the data:
df[df.groupby(['B'])['D'].transform(sum) < -1]
A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179
I hope this adds a bit more clarity.
I am going to use a very simple snippet to illustrate the difference:
test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
The DataFrame looks like this:
id price
0 1 1
1 2 2
2 3 3
3 1 2
4 2 3
5 3 1
6 1 3
7 2 1
8 3 2
There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.
Now, I want to find the minimum payment made by each customer. There are two ways of doing it:
Using apply:
grouping.min()
The return looks like this:
id
1 1
2 1
3 1
Name: price, dtype: int64
pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
Using transform:
grouping.transform(min)
The return looks like this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64
pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.
If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.
If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:
test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.
tmp = df.groupby(['A'])['c'].transform('mean')
is like
tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])
or
tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)
you can use zscore to analyze the data in column C and D for outliers, where zscore is the series - series.mean / series.std(). Use apply too create a user defined function for difference between C and D creating a new resulting dataframe. Apply uses the group result set.
from scipy.stats import zscore
columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
print(df)
standardize=df.groupby('A')['C','D'].transform(zscore)
print(standardize)
outliersC= (standardize['C'] <-1.1) | (standardize['C']>1.1)
outliersD= (standardize['D'] <-1.1) | (standardize['D']>1.1)
results=df[outliersC | outliersD]
print(results)
#Dataframe results
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
#C and D transformed Z score
C D
0 0.398046 0.801292
1 -0.300518 -1.398845
2 1.121882 -1.251188
3 -1.046514 0.519353
4 0.666781 -0.417997
5 1.347032 0.879491
6 -0.482004 1.492511
7 -1.704704 -0.624618
#filtering using arbitrary ranges -1 and 1 for the z-score
A B C D
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
>>>>>>>>>>>>> Part 2
splitting = df.groupby('A')
#look at how the data is grouped
for group_name, group in splitting:
print(group_name)
def column_difference(gr):
return gr['C']-gr['D']
grouped=splitting.apply(column_difference)
print(grouped)
A
bar 1 0.369953
3 -1.704616
5 0.861846
foo 0 0.074534
2 2.500196
4 1.365823
6 -1.332981
7 -0.658920