Is there a better, more idiomatic way to do this?
import pandas as pd
df : pd.DataFrame
try:
row = df.loc[key]
except KeyError:
row = None
If I wanted a column instead of a row I could just use df.get(key). I tried df.transpose().get(key), which does work, but the transpose is not just a view, it physically transposes the data so it is quite slow when used for this purpose.
Use reindex, but that would return a series of NaN:
df.reindex([key]).loc[key]
Related
I'm trying to loop through a pandas dataframe and for every row add a new column called upper, whose value should be set according to a simple condition based on the values of two other columns of the same row.
I tried to do that using list comprehension:
df['upper'] = [df['Close'][i] if df['Close'][i] > df['Open'][i] else df['Open'][i] for i in df]
But this line of code gives me the following error:
raise KeyError(key) from err KeyError: 'Date'
Where Date is just another column of the dataframe that isn't even involved in that line of code. What am i doing wrong here? Is there a better way to do this? Thanks in advance!
pandas is an advanced library, looping over DataFrame is a bad practice
df['upper'] = df[['Close', 'Open']].max(axis=1)
import numpy as np
df['upper'] = np.maximum(df['Close'], df['Open'])
I'm accessing a fairly large series of json files and storing them in a pandas series, part of a larger dataframe. There are several fields I want in said json, some of which are nested. I've been extracting them using json_normalize. The goal in to then merge these new fields with my original dataframe.
My problem is when I do so, instead of getting a dataframe with J rows and K columns, I get a J length series with each element being 1xK dataframe. I'm wondering if there is either an efficient vectorized way to turn this nested series/dataframe into a regular dataframe or get a regular dataframe from the start.
I've used map/lambda to create my nested series. Right now I'm unnesting with iteritems/append, but there has to be a more efficient way.
url_base = 'http:\\foo.bar='
df['http'] = df['id'].map(lambda x: url_base + x)
df['json'] = df['http'].map(lambda x: nf.get_json(x))
nest_ser = df['json'].map(lambda x: json_normalize(x))
df = pd.DataFrame()
for index, item in nest_ser.iteritems():
df = df.append(item)
json_normalize produces:
pd.Series([pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]))
instead of
pd.DataFrame([col1,col2...])
suppose your name of the output series out of json_normalize is sr:
pd.concat(sr.tolist())
So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.
I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0
I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]