I'm trying to loop through a pandas dataframe and for every row add a new column called upper, whose value should be set according to a simple condition based on the values of two other columns of the same row.
I tried to do that using list comprehension:
df['upper'] = [df['Close'][i] if df['Close'][i] > df['Open'][i] else df['Open'][i] for i in df]
But this line of code gives me the following error:
raise KeyError(key) from err KeyError: 'Date'
Where Date is just another column of the dataframe that isn't even involved in that line of code. What am i doing wrong here? Is there a better way to do this? Thanks in advance!
pandas is an advanced library, looping over DataFrame is a bad practice
df['upper'] = df[['Close', 'Open']].max(axis=1)
import numpy as np
df['upper'] = np.maximum(df['Close'], df['Open'])
Related
Is there a better, more idiomatic way to do this?
import pandas as pd
df : pd.DataFrame
try:
row = df.loc[key]
except KeyError:
row = None
If I wanted a column instead of a row I could just use df.get(key). I tried df.transpose().get(key), which does work, but the transpose is not just a view, it physically transposes the data so it is quite slow when used for this purpose.
Use reindex, but that would return a series of NaN:
df.reindex([key]).loc[key]
The objective of the code below is to create another identical pandas dataframe, where all values are replaced with zero.
input numpy as np
import pandas as pd
#Given preexisting dataframe
len(df) #Returns 1502
def zeroCreator(data):
zeroFrame = pd.DataFrame(np.zeros(len(data),1))
return zeroFrame
print(zeroCreator(df)) #Returns a TypeError: data type not understood
How do I work around this TypeError?
Edit: Thank you for all your clarifications, it appears that I hadn't entered the dataframe parameters correctly into np.zeros (missing a pair of parentheses), although a simpler solution does exist.
Just clone a new df and assign 0 to it
zero_df = df.copy()
zero_df[:] = 0
I wanted to use the if condition and df.loc[..] to compare two values in the same column.
If the previous value is higher then next one, I want to delete the complete row.
This what I tried and my example:
import pandas as pd
data = [('cycle',[1,1,2,2,3,3,4,4]),
('A',[0.1,0.5,0.2,0.6,0.15,0.43,0.13,0.59]),
('B',[ 500, 600, 510,580,512,575,499,598]),
('time',[0.0,0.2,0.5,0.4,0.6,0.7,0.5,0.8]),]
df = pd.DataFrame.from_items(data)
df = df.drop(df.loc[i,'time']<df.loc[i-1,'time'].index)
print(df)
and I got the following error :
TypeError: '<' not supported between instances of 'numpy.ndarray' and
'str'
Help is very is much appreciated
Try this:
df.drop(df.loc[df.time< df.time.shift()].index, inplace=True)
One problem is you are applying .index on the second df, before the comparison. You might try something like this:
df = df.drop((df.loc[i,'time'] < df.loc[i-1,'time']).index)
Try using pd.DataFrame.shift
Using shift:
df[df.time > df.time.shift()]
df.time.shift will return the original series where the index has been incremented by 1, so you are able to compare it to the original series. Each value will be compared to the one immediately below it. You can also set the fill_value parameter to determine the behavior of the first index:
df[df.time > df.time.shift(fill_value=0)]
Anyone can tell me how i should select one column with 'loc' in a dataframe using dask?
As a side note, when i am loading the dataframe using dd.read_csv with header equals to "None", the column name is starting from zero to 131094. I am about to select the last column with column name as 131094, and i get the error.
code:
> import dask.dataframe as dd
> df = dd.read_csv('filename.csv', header=None)
> y = df.loc['131094']
error:
File "/usr/local/dask-2018-08-22/lib/python2.7/site-packages/dask-0.5.0-py2.7.egg/dask/dataframe/core.py", line 180, in _loc
"Can not use loc on DataFrame without known divisions")
ValueError: Can not use loc on DataFrame without known divisions
Based on this guideline http://dask.pydata.org/en/latest/dataframe-indexing.html#positional-indexing, my code should work right but don't know what causes the problem.
If you have a named column, then use: df.loc[:,'col_name']
But if you have a positional column, like in your example where you want the last column, then use the answer by #user1717828.
I tried this on a dummy csv and it worked. I can't help you for sure without seeing the file giving you problems. That said, you might be picking rows, not columns.
Instead, try this.
import dask.dataframe as dd
df = dd.read_csv('filename.csv', header=None)
y = df[df.columns[-1]]
I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0