How to unpack the columns of a pandas DataFrame to multiple variables - python

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?

I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]

The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

Related

Use ast.literal_eval on all columns of a Pandas Dataframe

I have a data frame that looks like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
org1 C
['org3', 'org4'] A
org2 A
['org2', 'org4'] B
...
When I read in this data using Pandas, the lists are read in as strings (e.g., dat['Category][0][0] returns [ rather than returning org1). I have several columns like this. I want every categorical column that already contains at least one list to have all records be a list. For example, the above data frame should look like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
['org1'] C
['org3', 'org4'] A
['org2'] A
['org2', 'org4'] B
...
Notice how the singular values in the Category column are now contained in lists. When I reference dat['Category][0][0], I'd like org1 to be returned.
What is the best way to accomplish this? I was thinking of using ast.literal_eval with an apply and lambda function, but I'd like to try and use best-practices if possible. Thanks in advance!
You could create a boolean mask of the values that need to changed. If there are no lists, no change is needed. If there are lists, you can apply literal_eval or a list creation lambda to subsets of the data.
import ast
import pandas as pd
def normalize_category(df):
is_list = df['Category'].str.startswith('[')
if is_list.any():
df.loc[is_list,'Category'] = df.loc[is_list, 'Category'].apply(ast.literal_eval)
df.loc[~is_list,'Category'] = df.loc[~is_list]['Category'].apply(lambda val: [val])
df = pd.DataFrame({"Category":["['org1', 'org2']", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
df = pd.DataFrame({"Category":["org2", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
You can do it like this:
df['Category'] = df['Category'].apply(lambda x: literal_eval(x) if x.startswith('[') else [x])

How to iterate over very big dataframes in python?

I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))

Pandas convert columns type from list to np.array

I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Categories