I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)
Here I have two cases, one that works as expected and one that doesn't(commented out):
import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)
df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})
df.sort_values(['Group', 'Number'])
Expected Output: every item in groups, to correspond to all items in numList
Group Number
a 0
a 1
a 2
a 3
b 0
b 1
b 2
b 3
c 0
c 1
c 2
c 3
Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)
Why is that? and how to fix that?
If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product.
If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.
df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)
And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter.
So you should do
df = df.sort_values(['Group', 'Number'])
or
df.sort_values(['Group', 'Number'], inplace=True)
and it should work now.
Related
I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:
import pandas as pd
import numpy as np
frames = []
names = []
frame_columns = ['DataPoint1', 'DataPoint2']
for i in range(5):
names.append("DataSet{0}".format(i))
frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))
I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].
This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).
Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].
In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:
mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T
However, I would like to have the contents of the dataframes present rather than having to then add them.
Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated.
Similarly, this thread suggests a join, which is not really what I need.
You can use concat with keys:
total_frame = pd.concat(frames, keys=names)
Output:
DataPoint1 DataPoint2
DataSet0 0 -0.656758 1.776027
1 -0.940759 1.355495
2 0.173670 0.274525
DataSet1 0 -0.744456 -1.057482
1 0.186901 0.806281
2 0.148567 -1.065477
DataSet2 0 -0.980312 -0.487479
1 2.117227 -0.511628
2 0.093718 -0.514379
DataSet3 0 0.046963 -0.563041
1 -0.663800 -1.130751
2 -1.446891 0.879479
DataSet4 0 1.586213 1.552048
1 0.196841 1.933362
2 -0.545256 0.387289
Then you can extract Dataset0 by:
total_frame.loc['DataSet0']
If you really want to use MultiIndex columns instead, you can add axis=1 to concat:
total_frame = pd.concat(frames, axis=1, keys=names)
I am having some issue with copying a dataframe. Basically, I want to replicate a dataframe with another variable but with the columns being numerical instead of categorical. Below I have function that returns dataframe mean_df when I print it out I see that the rows are categorical. I then create a new dataframe (mean_df_num) which is equal to mean_df. Then I convert the rows to index values (for mean_df_num) instead of the categorical letters. However, when I print my mean_df after I see that it has also changed indices to be numerical. Why does this happen and is there a way around this?
mean_df = mean_funct(train_df_cat)
print(mean_df)
mean_df_num = mean_df
mean_df_num.index = range(len(mean_df_num)) #Convert df to numerical indices
print(mean_df)
Output:
m00 mu02 mu11
a 1.00162 0.357137 -0.245608
c 0.766659 0.354217 0.244405
e 0.929145 0.422447 0.0602329
m 1.61799 2.85194 -1.80078
n 1.03976 0.700674 -1.0011
o 0.97873 0.754065 0.172753
r 0.623244 0.11065 1.52705
s 0.789545 0.177259 -0.154744
x 1.0039 0.404982 -1.51634
z 0.919228 0.3578 0.42973
m00 mu02 mu11
0 1.00162 0.357137 -0.245608
1 0.766659 0.354217 0.244405
2 0.929145 0.422447 0.0602329
3 1.61799 2.85194 -1.80078
4 1.03976 0.700674 -1.0011
5 0.97873 0.754065 0.172753
6 0.623244 0.11065 1.52705
7 0.789545 0.177259 -0.154744
8 1.0039 0.404982 -1.51634
9 0.919228 0.3578 0.42973
Pandas dataframe is essentially a pointer. That meas when you do mean_df_num=mean_df, then mean_df_num and mean_df point to the same object. You change one, you change the other. The way around this is .copy(), i.e. mean_df_num=mean_df.copy().
Actually, for your purpose, it's better just do mean_df_num=mean_df.reset_index(drop=True). It does both at the same time: copy the data and set index as range index.
I'm not used to pandas at all, thus the several question on my problem.
I have a function computing computing a list called solutions. This list can either be made of tuples of 3 values (a, b, c) or empty.
solutions = [(a,b,c), (d,e,f), (g,h,i)]
To save it, I first turn it into a numpy array, and then I save it with pandas after naming the columns.
solutions = np.asarray(solutions)
df = pd.DataFrame(solutions)
df.columns = ["Name1", "Name2", "Name3"]
df.to_pickle(path)
My issue is that I sometimes have a empty solutions list: solutions = []. Thus, the line df.columns raises an error. To bypass it, I currently check the size of solutions, and if it is empty, I do:
pickle.dump([], path, "wb")
I would like to be a more consistent between my data type, and to save the SAME format between both scenario.
=> If the list is empty, I would like to save the 3 columns name with an empty data frame. Ultimate goal, is to reopen the file with pd.read_pickle() and to access easily the data in it.
Second issue, I would like to reopen the files pickled, and to add a column. Could you show me the right way to do so?
And third question, how can I select a part of the dataframe. For instance, I want all lines in which the column Name1 value % 0.25 == 0.
Thanks
Create your dataframe using:
df = pandas.DataFrame(data=solutions, columns=['name1', 'name2', 'name3'])
If solutions is empty, it will nevertheless create a dataframe with 3 columns and 0 row.
In [2]: pd.DataFrame(data=[(1,2,3), (4,5,6)], columns=['a','b','c'])
Out[2]:
a b c
0 1 2 3
1 4 5 6
In [3]: pd.DataFrame(data=[], columns=['a','b','c'])
Out[3]:
Empty DataFrame
Columns: [a, b, c]
Index: []
For your third question:
df["Name1"] % 0.25 == 0
computes a series of booleans which are true where the value in the first column can be divided by 0.25. You can use it to select the rows of your dataframe:
df[ df["Name1"] % 0.25 == 0 ]
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72
How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.