Saving dataframe as another value in python

Saving dataframe as another value in python - python

I am having some issue with copying a dataframe. Basically, I want to replicate a dataframe with another variable but with the columns being numerical instead of categorical. Below I have function that returns dataframe mean_df when I print it out I see that the rows are categorical. I then create a new dataframe (mean_df_num) which is equal to mean_df. Then I convert the rows to index values (for mean_df_num) instead of the categorical letters. However, when I print my mean_df after I see that it has also changed indices to be numerical. Why does this happen and is there a way around this?
mean_df = mean_funct(train_df_cat)
print(mean_df)
mean_df_num = mean_df
mean_df_num.index = range(len(mean_df_num)) #Convert df to numerical indices
print(mean_df)
Output:
m00 mu02 mu11
a 1.00162 0.357137 -0.245608
c 0.766659 0.354217 0.244405
e 0.929145 0.422447 0.0602329
m 1.61799 2.85194 -1.80078
n 1.03976 0.700674 -1.0011
o 0.97873 0.754065 0.172753
r 0.623244 0.11065 1.52705
s 0.789545 0.177259 -0.154744
x 1.0039 0.404982 -1.51634
z 0.919228 0.3578 0.42973
m00 mu02 mu11
0 1.00162 0.357137 -0.245608
1 0.766659 0.354217 0.244405
2 0.929145 0.422447 0.0602329
3 1.61799 2.85194 -1.80078
4 1.03976 0.700674 -1.0011
5 0.97873 0.754065 0.172753
6 0.623244 0.11065 1.52705
7 0.789545 0.177259 -0.154744
8 1.0039 0.404982 -1.51634
9 0.919228 0.3578 0.42973

Pandas dataframe is essentially a pointer. That meas when you do mean_df_num=mean_df, then mean_df_num and mean_df point to the same object. You change one, you change the other. The way around this is .copy(), i.e. mean_df_num=mean_df.copy().
Actually, for your purpose, it's better just do mean_df_num=mean_df.reset_index(drop=True). It does both at the same time: copy the data and set index as range index.

Related

How to retrieve cells from a dataframe based on condition from another dataframe

We have two dataframes, first one contains some float values (which mean average speed).
0 1 2
1 15.610826 19.182879 6.678087
2 13.740250 15.666897 17.640749
3 2.379010 2.889702 2.955097
4 20.540628 9.661226 9.479921
And another dataframe with geographical coordinates, where the average speed takes place.
0 1 2
1 [52.2399255, 21.0654495] [52.23893150000001, 21.06087] [52.23800850000001,21.056779]
2 [52.2449705, 21.0755175] [52.2452905, 21.075118000000003] [52.245557500000004, 21.0748175]
3 [52.2401885, 21.012981500000002] [52.239134, 21.009432] [52.238420500000004, 21.007080000000002]
4 [52.221506500000004, 20.9665085] [52.222458, 20.968952] [52.224409, 20.969248999999998]
Now I want to create a list with coordinates where average speed is above 18, in this case this would be
list_above_18=[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]
How can I select values from a dataframe based on values in another dataframe?

You can use enumerate to zip the dataframes and work on the elements seperately. See below (A,B are your dataframes, in same order you provided them):
list_above_18=[]
p=list(enumerate(zip(A.values, B.values)))
for i in p:
for k in range(3):
if i[1][0][k]>18:
list_above_18.append(i[1][1][k])
Output:
>>>print(list_above_18)
[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]

Considering the shape of the Average Speed dataset will remain same as the coordinates dataset, you can try the below
coord_df[data_df.iloc[:,:] > 18].T.stack().values
Here,
coord_df = DataFrame with coordinate values
data_df = Average Speed values
This would return a numpy array with just the coordinate values where the Average speed is greater than 18
How this works :
data_df.iloc[:,:] > 18
Creates a dataframe mask such that all the values which are smaller than 18 are marked as False and rest as True
coord_df[data_df.iloc[:,:] > 18]
Passes the mask in the Target Dataframe i.e. coordinate dataframe which then results in a dataframe which shows coordinate values only for those cells where the mask has True i.e. where the average speed was above 18
.T.stack().values
This then retrieves only the non-null values from the resultant dataframe and returns a numpy array
References I took :
Get non-null elements in a pandas DataFrame --- To get only the non null values from a dataframe (.T.stack().values)

Let the first df be df1 and second df be df2
output_array = df2[df1>18].values.flatten() # df1>18 would create the mask
output_array = [val for val in output_array if type(val) == list] # removing the nan values. We can't use np.isnan as it would not work for list
Sample Input:
df1
df2
output_array
[[15.1, 20.5], [91.5, 95.8]]

Pandas Create DataFrame with two lists behaving differently

I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)
Here I have two cases, one that works as expected and one that doesn't(commented out):
import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)
df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})
df.sort_values(['Group', 'Number'])
Expected Output: every item in groups, to correspond to all items in numList
Group Number
a 0
a 1
a 2
a 3
b 0
b 1
b 2
b 3
c 0
c 1
c 2
c 3
Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)
Why is that? and how to fix that?

If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product.
If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.
df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)
And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter.
So you should do
df = df.sort_values(['Group', 'Number'])
or
df.sort_values(['Group', 'Number'], inplace=True)
and it should work now.

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?

Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.

I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72

Adding calculated column(s) to a dataframe in pandas

I have an OHLC price data set, that I have parsed from CSV into a Pandas dataframe and resampled to 15 min bars:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500047 entries, 1998-05-04 04:45:00 to 2012-08-07 00:15:00
Freq: 15T
Data columns:
Close 363152 non-null values
High 363152 non-null values
Low 363152 non-null values
Open 363152 non-null values
dtypes: float64(4)
I would like to add various calculated columns, starting with simple ones such as period Range (H-L) and then booleans to indicate the occurrence of price patterns that I will define - e.g. a hammer candle pattern, for which a sample definition:
def closed_in_top_half_of_range(h,l,c):
return c > l + (h-l)/2
def lower_wick(o,l,c):
return min(o,c)-l
def real_body(o,c):
return abs(c-o)
def lower_wick_at_least_twice_real_body(o,l,c):
return lower_wick(o,l,c) >= 2 * real_body(o,c)
def is_hammer(row):
return lower_wick_at_least_twice_real_body(row["Open"],row["Low"],row["Close"]) \
and closed_in_top_half_of_range(row["High"],row["Low"],row["Close"])
Basic problem: how do I map the function to the column, specifically where I would like to reference more than one other column or the whole row or whatever?
This post deals with adding two calculated columns off of a single source column, which is close, but not quite it.
And slightly more advanced: for price patterns that are determined with reference to more than a single bar (T), how can I reference different rows (e.g. T-1, T-2 etc.) from within the function definition?

The exact code will vary for each of the columns you want to do, but it's likely you'll want to use the map and apply functions. In some cases you can just compute using the existing columns directly, since the columns are Pandas Series objects, which also work as Numpy arrays, which automatically work element-wise for usual mathematical operations.
>>> d
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
>>> (d.A + d.B) / d.C
0 4.800000
1 3.250000
2 1.833333
3 1.714286
4 0.142857
>>> d.A > d.C
0 True
1 True
2 True
3 False
4 False
If you need to use operations like max and min within a row, you can use apply with axis=1 to apply any function you like to each row. Here's an example that computes min(A, B)-C, which seems to be like your "lower wick":
>>> d.apply(lambda row: min([row['A'], row['B']])-row['C'], axis=1)
0 6
1 2
2 -3
3 -3
4 -7
Hopefully that gives you some idea of how to proceed.
Edit: to compare rows against neighboring rows, the simplest approach is to slice the columns you want to compare, leaving off the beginning/end, and then compare the resulting slices. For instance, this will tell you for which rows the element in column A is less than the next row's element in column C:
d['A'][:-1] < d['C'][1:]
and this does it the other way, telling you which rows have A less than the preceding row's C:
d['A'][1:] < d['C'][:-1]
Doing ['A"][:-1] slices off the last element of column A, and doing ['C'][1:] slices off the first element of column C, so when you line these two up and compare them, you're comparing each element in A with the C from the following row.

You could have is_hammer in terms of row["Open"] etc. as follows
def is_hammer(rOpen,rLow,rClose,rHigh):
return lower_wick_at_least_twice_real_body(rOpen,rLow,rClose) \
and closed_in_top_half_of_range(rHigh,rLow,rClose)
Then you can use map:
df["isHammer"] = map(is_hammer, df["Open"], df["Low"], df["Close"], df["High"])

For the second part of your question, you can also use shift, for example:
df['t-1'] = df['t'].shift(1)
t-1 would then contain the values from t one row above.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html

The first four functions you list will work on vectors as well, with the exception that lower_wick needs to be adapted. Something like this,
def lower_wick_vec(o, l, c):
min_oc = numpy.where(o > c, c, o)
return min_oc - l
where o, l and c are vectors.
You could do it this way instead which just takes the df as input and avoid using numpy, although it will be much slower:
def lower_wick_df(df):
min_oc = df[['Open', 'Close']].min(axis=1)
return min_oc - l
The other three will work on columns or vectors just as they are. Then you can finish off with
def is_hammer(df):
lw = lower_wick_at_least_twice_real_body(df["Open"], df["Low"], df["Close"])
cl = closed_in_top_half_of_range(df["High"], df["Low"], df["Close"])
return cl & lw
Bit operators can perform set logic on boolean vectors, & for and, | for or etc. This is enough to completely vectorize the sample calculations you gave and should be relatively fast. You could probably speed up even more by temporarily working with the numpy arrays underlying the data while performing these calculations.
For the second part, I would recommend introducing a column indicating the pattern for each row and writing a family of functions which deal with each pattern. Then groupby the pattern and apply the appropriate function to each group.

Find row where values for column is maximal in a pandas DataFrame

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.

Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.

You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985

Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64

df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.

A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.

Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10

If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.

The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()

mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.

Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032

The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().

Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row

If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object

what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving dataframe as another value in python - python

Related

How to retrieve cells from a dataframe based on condition from another dataframe

Pandas Create DataFrame with two lists behaving differently

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

Adding calculated column(s) to a dataframe in pandas

Find row where values for column is maximal in a pandas DataFrame

Categories

Resources