I have N Dataframes, named data1,data2...etc
Each dataframe has two columns 'X' and 'Y'. The lenght of each dataframe it's not the same.
I need a new dataframe consisting in the sum of the 'X' columns.
I just tried something like:
dataframesum = pd.DataFrame(0, index=np.arange(Some_number),columns = ['X']
for i in range(N):
dataframesum.add(globals()['Data%s'%i]['X'], fill_values = 0)
but it doesn't works (i'm not sure what should be the value of Some_number) and i am getting the next error:
NotImplementedError: fill_value 0 not supported
You should use a dictionary to store an arbitrary number of variables.
So let's assume you have dataframes stored in dfs = {1: df1, 2: df2, 3: df3...}.
You can then concatenate them via pd.concat:
df_concat = pd.concat(list(dfs.values()))
Finally, you can sum columns via pd.DataFrame.sum:
sums = df_concat.sum()
To take advantage of vectorised operations, you should avoid a manual for loop. In addition, use of globals() is poor practice, and can be avoided by using dict or list to store your dataframes.
Related
I need a smart and fast algorithm to delete all the rows of a Pandas dataframe [10000:37] for which I observe Boolean value False at least in one of the columns (for each row) of a twin dictionary to the dataframe (I mean that the dictionary has keys equal to the name of the columns of the dataframe while the values of each keys are lists of length 9999 of Boolean values).
I would like to apply this operation easily even in view of future implementations and program modifications, thus avoiding separate operations on the different series of values.
I state that I am not a professional programmer. can anyone recommend an appropriate route?
I will assume here that the dictionary and the dataframe have different values but share same indices. Said differently, I assume that the index of the dataframe is a RangeIndex(start=0, stop=10000, step=1).
In that case, I would build a dataframe from the twin dictionary, and use np.all to identify the rows having at least a False in any column.
Let us call df the dataframe and twin the twin dictionary, code could be:
df_twin = pd.DataFrame(twin)
df_twin['to_drop'] = np.all(df_twin, axis=1)
df_clean = df.drop(df_twin.loc[~df_twin.to_drop].index)
Using this as an example dataframe:
test_df = pd.DataFrame({ 'A': [True,True,True], 'B': [False,True,True], 'C' : [True,False,True], 'D' : [True,True,True]})
We only want the 3rd row which has True in each column:
mask = test_df.all(axis=1)
keep_df = test_df[mask]
If you only want to check columns that are keys in your dictionary:
d = { 'A': [1,2,3], 'C': [4,5,6] }
mask = test_df[d].all(axis=1)
keep_df = test_df[mask]
The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')
I have two dataframes df1 and df2. Df1 has columns A,B,C,D,E,F and df2 A,B,J,D,E,K. I want to update the second dataframe with the rows of the first but only when two first columns have the same value in both dataframes. For each row that the following two conditions are true:
df1.A = df2.A
df1.B = df2.B
then update accordingly:
df2.D = df1.D
df2.E = df1.E
My dataframes have different number of rows.
When I tried this code I get a TypeError :cannot do positional indexing with these indexers of type 'str'.
for a in df1:
for t in df2:
if df1.iloc[a]['A'] == df2.iloc[t]['A'] and df1.iloc[a]['B'] == df2.iloc[t]['B']:
df2.iloc[t]['D'] = df1.iloc[a]['D']
df2.iloc[t]['E'] = df1.iloc[a]['E']
The Question:
You'd be better served merging the dataframes than doing nested iteration.
df2 = df2.merge(df1[['A', 'B', 'D', 'E']], on=['A', 'B'], how='left', suffixes=['_old', ''])
df2['D'] = df2['D'].fillna(df2['D_old'])
df2['E'] = df2['E'].fillna(df2['E_old'])
del df2['D_old']
del df2['E_old']
The first row attaches columns to df2 with values for columns D and E from corresponding rows of df1, and renames the old columns.
The next two lines fill in the rows for which df1 had no matching row, and the next two delete the initial, now outdated versions of the columns.
The Error:
Your TypeError happened because for a in df1: iterates over the columns of a dataframe, which are strings here, while .iloc only takes integers. Additionally, though you didn't get to this point, to set a value you'd need both index and column contained within the brackets.
So if you did need to set values by row, you'd want something more like
for a in df1.iterrows():
for t in df2.iterrows():
if df1.loc[a, 'A'] == ...
Though I'd strongly caution against doing that. If you find yourself thinking about it, there's probably either a much faster, less painful way to do it in pandas, or you're better off using another tool less focused on tabular data.
When calling a function using groupby + apply, I want to go from a DataFrame to a Series groupby object, apply a function to each group that takes a Series as input and returns a Series as output, and then assign the output from the groupby + apply call as a field in the DataFrame.
The default behavior is to have the output from groupby + apply indexed by the grouping fields, which prevents me from assigning it back to the DataFrame cleanly. I'd prefer to have the function I call with apply take a Series as input and return a Series as output; I think it's a bit cleaner than DataFrame to DataFrame. (This isn't the best way of getting to the result for this example; the real application is pretty different.)
import pandas as pd
df = pd.DataFrame({
'A': [999, 999, 111, 111],
'B': [1, 2, 3, 4],
'C': [1, 3, 1, 3]
})
def less_than_two(series):
# Intended for series of length 1 in this case
# But not intended for many-to-one generally
return series.iloc[0] < 2
output = df.groupby(['A', 'B'])['C'].apply(less_than_two)
I want the index on output to be the same as df, otherwise I cant assign
to df (cleanly):
df['Less_Than_Two'] = output
Something like output.index = df.index seems too ugly, and using the group_keys argument doesn't seem to work:
output = df.groupby(['A', 'B'], group_keys = False)['C'].apply(less_than_two)
df['Less_Than_Two'] = output
transform returns the results with the original index, just as you've asked for. It will broadcast the same result across all elements of a group. Caveat, beware that the dtype may be inferred to be something else. You may have to cast it yourself.
In this case, in order to add another column, I'd use assign
df.assign(
Less_Than_Two=df.groupby(['A', 'B'])['C'].transform(less_than_two).astype(bool))
A B C Less_Than_Two
0 999 1 1 True
1 999 2 3 False
2 111 3 1 True
3 111 4 3 False
Assuming your groupby is necessary (and the resulting groupby object will have fewer rows than your DataFrame -- this isn't the case with the example data), then assigning the Series to the 'Is.Even' column will result in NaN values (since the index to output will be shorter than the index to df).
Instead, based on the example data, the simplest approach will be to merge output -- as a DataFrame -- with df, like so:
output = df.groupby(['A','B'])['C'].agg({'C':is_even}).reset_index() # reset_index restores 'A' and 'B' from indices to columns
output.columns = ['A','B','Is_Even'] #rename target column prior to merging
df.merge(output, how='left', on=['A','B']) # this will support a many-to-one relationship between combinations of 'A' & 'B' and 'Is_Even'
# and will thus properly map aggregated values to unaggregated values
Also, I should note that you're better off using underscores than dots in variable names; unlike in R, for instance, dots act as operators for accessing object properties, and so using them in variable names can block functionality/create confusion.
TLDR: I don't know how to take an array of DataFrames and build a MultiIndex around it.
TLDR 2: From my research, it is recommended to deal with the return result than to try to work off of a global/single/shared DataFrame in Pool().Map(). If somebody has a way to share it, I'm all for it.
I am trying to merge an array of MultiIndex Pandas DataFrames that have been returned by a Pool().Map() function.
p = Pool()
results = p.map(run_experiment, experiment_collection)
Pool().map() returns as an array. Let's say the process returns the array with 1000 DataFrame with a first index of [0:5] and second [0:50].
What I want is to create a final output that is a single DataFrame that would separate each experiment so [0:1000] / [0:5] /[0:50].
I know how to create the MultiIndex using np.zeros_like and then fill the DataFrame, but I don't know how to take an array of DataFrames and build a MultiIndex around it.
rounds = range(0,1000)
levels = [... some set of levels ...]
labels = [... some set of labels ...]
iterables = [rounds, labels, levels]
names = ['round', 'label', 'values']
index = pd.MultiIndex.from_product(iterables, names=names)
index_names = [... some set of index names...]
empty_df = pd.DataFrame(
np.zeros_like(np.random.randn(5, 50000)),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
In my first example, results is an array of MultiIndex DataFrame of levels / labels. What I am trying to do is create a final DataFrame that replaces the top level (an array holding all these DataFrames) as just a new index.
When I try
p = Pool()
results = pd.DataFrame(
p.map(run_experiment, experiment_collection),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
I am getting ValueError: Shape of passed values is (1, 1000), indices imply (shape of intended index), which makes sense because it is an array of 1000 DataFrames.
If I concatenate (which feels like the better way to go)
results = pd.concat(p.map(run_experiment, experiment_collection))
I get a DataFrame with levels / labels, but no round.
iterables = [rounds, labels, levels]
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False)
I'm not sure what of the options (keys, levels, names) I should be manipulating here to get my rounds back into the DataFrame.
results = pd.concat(
p.map(run_experiment, experiment_collection),
levels=iterables,
names=names,
axis=1
)
Gets me really close to the format I want, but no round.
I can get more specific if I need to, but not entirely sure what else would be helpful in getting to the answer.
Lacking a better answer, I am recreating my original DataFrame, and iterating through the returned result from Pool().Map(), inserting each list position into the DataFrame. Seems there has to be a better way, but I can't think of it.
p = Pool()
results = p.map(run_experiment, experiment_collection)
final_df = pd.DataFrame(
np.zeros_like(np.random.randn(5, 50000)),
index=index_names,
columns=index
)
.sort_index()
.sort_index(axis=1)
for result in results:
final_df[increment_value] = result
There are several good ways to do this:
1) If you are starting with a bunch of Series objects :
You set the series objects name parameter to be a tuple. Then use pd.concat([series list], axis=1)
2) If you have a single level map of dataframes, you can use the fact that pd.concat can accept a dict as its first argument. E.g.
pd.concat({A:df1, B:df2}, axis=1)
will create a multilevel index with A, B as the top level and the columns of your df as the second level.Although you cannot nest dicts, you can do this multiple times to build an index of arbitary depth.
3) You can use the Dataframe constructor on a DF but pass a list of tuples as the column names. E.f. if you have a df with columns A,B and you do df_new = pd.DataFrame(df, columns=[("Foo","A"), ("Foo","B")]) and this will create a new df with a multilevel index, you can do this for your df's individually and then concatenate them. Pandas will appropriately concatenate two dataframes with indexes with the same number of levels.