Convert Array of DataFrames to Single DataFrame - python

TLDR: I don't know how to take an array of DataFrames and build a MultiIndex around it.
TLDR 2: From my research, it is recommended to deal with the return result than to try to work off of a global/single/shared DataFrame in Pool().Map(). If somebody has a way to share it, I'm all for it.
I am trying to merge an array of MultiIndex Pandas DataFrames that have been returned by a Pool().Map() function.
p = Pool()
results = p.map(run_experiment, experiment_collection)
Pool().map() returns as an array. Let's say the process returns the array with 1000 DataFrame with a first index of [0:5] and second [0:50].
What I want is to create a final output that is a single DataFrame that would separate each experiment so [0:1000] / [0:5] /[0:50].
I know how to create the MultiIndex using np.zeros_like and then fill the DataFrame, but I don't know how to take an array of DataFrames and build a MultiIndex around it.
rounds = range(0,1000)
levels = [... some set of levels ...]
labels = [... some set of labels ...]
iterables = [rounds, labels, levels]
names = ['round', 'label', 'values']
index = pd.MultiIndex.from_product(iterables, names=names)
index_names = [... some set of index names...]
empty_df = pd.DataFrame(
np.zeros_like(np.random.randn(5, 50000)),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
In my first example, results is an array of MultiIndex DataFrame of levels / labels. What I am trying to do is create a final DataFrame that replaces the top level (an array holding all these DataFrames) as just a new index.
When I try
p = Pool()
results = pd.DataFrame(
p.map(run_experiment, experiment_collection),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
I am getting ValueError: Shape of passed values is (1, 1000), indices imply (shape of intended index), which makes sense because it is an array of 1000 DataFrames.
If I concatenate (which feels like the better way to go)
results = pd.concat(p.map(run_experiment, experiment_collection))
I get a DataFrame with levels / labels, but no round.
iterables = [rounds, labels, levels]
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False)
I'm not sure what of the options (keys, levels, names) I should be manipulating here to get my rounds back into the DataFrame.
results = pd.concat(
p.map(run_experiment, experiment_collection),
levels=iterables,
names=names,
axis=1
)
Gets me really close to the format I want, but no round.
I can get more specific if I need to, but not entirely sure what else would be helpful in getting to the answer.

Lacking a better answer, I am recreating my original DataFrame, and iterating through the returned result from Pool().Map(), inserting each list position into the DataFrame. Seems there has to be a better way, but I can't think of it.
p = Pool()
results = p.map(run_experiment, experiment_collection)
final_df = pd.DataFrame(
np.zeros_like(np.random.randn(5, 50000)),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
for result in results:
final_df[increment_value] = result

There are several good ways to do this:
1) If you are starting with a bunch of Series objects :
You set the series objects name parameter to be a tuple. Then use pd.concat([series list], axis=1)
2) If you have a single level map of dataframes, you can use the fact that pd.concat can accept a dict as its first argument. E.g.
pd.concat({A:df1, B:df2}, axis=1)
will create a multilevel index with A, B as the top level and the columns of your df as the second level.Although you cannot nest dicts, you can do this multiple times to build an index of arbitary depth.
3) You can use the Dataframe constructor on a DF but pass a list of tuples as the column names. E.f. if you have a df with columns A,B and you do df_new = pd.DataFrame(df, columns=[("Foo","A"), ("Foo","B")]) and this will create a new df with a multilevel index, you can do this for your df's individually and then concatenate them. Pandas will appropriately concatenate two dataframes with indexes with the same number of levels.

Related

Dropping column in dataframe with assignment not workng in a loop

I have two dataframes (df_train and df_test) containing a column ('Date') that I want to drop.
As far as I understood, I could do it in two ways, i.e. either by using inplace or by assigning the dataframe to itself, like:
if 'Date' in df_train.columns:
df_train.drop(['Date'], axis=1, inplace=True)
OR
if 'Date' in df_train.columns:
df_train = df_train.drop(['Date'], axis=1)
Both the methods work on the single dataframe, but the former way should be more memory friendly, since with the assignent a copy of the dataframe is created.
The weird thing is, I have to do it for both the dataframes, so I tried to do the same within a loop:
for data in [df_train, df_test]:
if 'Date' in data.columns:
data.drop(['Date'], axis=1, inplace=True)
and
for data in [df_train, df_test]:
if 'Date' in data.columns:
data = data.drop(['Date'], axis=1)
and the weird thing is that, in this case, only the first ways (using inplace) works. If I use the second way, the 'Date' columns aren't dropped.
Why is that?
It doesn't work because iterating through the list and changing what's in the list doesn't actually change the actual list of dataframes because it only changes the iterators, so you should try:
lst = []
for data in [df_train, df_test]:
if 'Date' in data.columns:
lst.append(data.drop(['Date'], axis=1))
print(lst)
Now lst contains all the dataframes.
Its better to use a list comprehension:
res = [data.drop(['Date'], axis=1) for data in [df_train, df_test] if 'Date' in data.columns]
Here, you will get a copy of both dataframes after columns are dropped.

Sum columns from multiples DataFrames

I have N Dataframes, named data1,data2...etc
Each dataframe has two columns 'X' and 'Y'. The lenght of each dataframe it's not the same.
I need a new dataframe consisting in the sum of the 'X' columns.
I just tried something like:
dataframesum = pd.DataFrame(0, index=np.arange(Some_number),columns = ['X']
for i in range(N):
dataframesum.add(globals()['Data%s'%i]['X'], fill_values = 0)
but it doesn't works (i'm not sure what should be the value of Some_number) and i am getting the next error:
NotImplementedError: fill_value 0 not supported
You should use a dictionary to store an arbitrary number of variables.
So let's assume you have dataframes stored in dfs = {1: df1, 2: df2, 3: df3...}.
You can then concatenate them via pd.concat:
df_concat = pd.concat(list(dfs.values()))
Finally, you can sum columns via pd.DataFrame.sum:
sums = df_concat.sum()
To take advantage of vectorised operations, you should avoid a manual for loop. In addition, use of globals() is poor practice, and can be avoided by using dict or list to store your dataframes.

Concat pandas dataframes without following a certain sequence

I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

How do I combine two dataframes?

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Categories