I have a list of lists each containing a common column, A, B to be used as indices as follows
df_list=[[df1], [df2],[df3], [df4], [df5, df6]]
I would like to merge the dataframes into a single dataframe based on the common columns A, B in all dataframes
I have tried pd.concat(df_list). This doesnt work and yields an error
list indices must be integers or slices, not list
Is there a way to accomplish this
You need to deliver flat list (or other flat structure) to pd.concat your
df_list=[[df1], [df2],[df3], [df4], [df5, df6]]
is nested. If you do not have command over filling said df_list you need to flatten it first for example using itertools.chain as follow:
import itertools
flat_df_list = list(itertools.chain(*df_list))
and then deliver flat_df_list or harnessing fact that pd.concat also work with itertators:
total_df = pd.concat(itertools.chain(*df_list))
try simply using something like:
dfs = [df1, df2, df3, ... etc]
print(pd.concat(dfs))
when storing the dfs in a list you shouldn't keep them in a second list, note pd.concat takes a list of dfs.
The solution is as follows
flat_list = [item for sublist in df_list for item in sublist]
#L3 = pd.DataFrame.from_dict(map(dict,D))
L3=pd.concat(flat_list)
Related
I need to find all the columns that are in all of 5 different Pandas dataframes. Currently I'm using code like this:
dfs = [df0, df1, df2, df3, df4]
cols = dfs[0].columns
for df in dfs[1:]:
cols &= df.columns
Assuming this code is correct, I'm wondering if it's possible to do this in a list comprehension, and if not, if there's a more effecient or less verbose way of getting the same result.
If you really want to work with regular Python sets, you can pass in a lambda function that returns the intersection of two sets to the reduce function:
from functools import reduce
dfs = [df0, df1, df2, df3, df4] # list of pandas DataFrames
columns_in_common = reduce( # a Python set
lambda s1, s2: s1.intersection(s2),
(set(df) for df in dfs)
)
Although I'd agree that pd.Index.Intersection makes more sense when you're dealing with pandas DataFrames.
Edit:
for completion's sake, here's a way to do it with an actual list comprehension - although we're severely misusing a list comprehension by simply relying on its "side-effect" of iterating over a sequence in a single line. I repeat: this is not what list comprehensions are built for: they should be used for one thing and one thing only, that is build a list.
result = []
_ = [
result.append(colname)
if (
colname not in result
and all(
colname in df.columns
for df in dfs
)
else None
for df in dfs
for colname in df.columns
]
Here, the list _ (the outcome of the list comprehension) is not important. What matters is that values have been appended to the result list, based on whether they appear in all DataFrame columns, and whether they don't already exist in the result list itself.
Besides being needlessly complex, the list comprehension is also slower than the set version (likely due to the triple for-loop inside it, and maybe also due to searches in Python lists being slower than Python sets). Here's a simple test case:
Sets
>>> timeit.timeit("import functools;lists=[[1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 6]];functools.reduce(lambda s1, s2: s1.intersection(s2), (set(s) for s in lists))")
1.8992446570046013
List Comprehension
>>> timeit.timeit("mylist=[];lists=[[1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 6]];[mylist.append(e) if e not in mylist and all(e in sublist for sublist in lists) else None for slist in lists for e in slist]")
6.89644429100008
The real answer to why can't we simply append new column names to the same list being create in the list comprehension? is because we don't have a reference to it - not before the list comprehension itself finishes, anyway. We need that reference to be able to create a list from a list comprehension with a logic more complex than a simple if/else statement.
tl;dr you can't really use list comprehensions to create new lists in-the-spot, if your list creation logic isn't a simple if/else test.
We have reduce to handle this which can be used with Index.intersection, pretty much like set.intersection:
from functools import reduce
reduce(pd.Index.intersection,[i.columns for i in dfs])
Dummy example:
df1 = pd.DataFrame(columns=list('ABCDE'))
df2 = pd.DataFrame(columns=list('ABDE'))
from functools import reduce
dfs = [df1,df2]
reduce(pd.Index.intersection,[i.columns for i in dfs])
#Index(['A', 'B', 'D', 'E'], dtype='object')
I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).
I know I can use isin to filter dataframe
my question is that, when I'm doing this for many dataframes, the code looks a bit repetitive.
For example, below is how I filter some datasets to limit only to specific user datasets.
## filter data
df_order_filled = df_order_filled[df_order_filled.user_id.isin(df_user.user_id)]
df_liquidate_order = df_liquidate_order[df_liquidate_order.user_id.isin(df_user.user_id)]
df_fee_discount_ = df_fee_discount_[df_fee_discount_.user_id.isin(df_user.user_id)]
df_dep_wit = df_dep_wit[df_dep_wit.user_id.isin(df_user.user_id)]
the name of the dataframe is repeated 3 times for each df, which is kind unnecessary.
How can I simplify my code?
Thanks!
Use list comprehension with list of DataFrames:
dfs = [df_order_filled, df_liquidate_order, df_fee_discount_, df_dep_wit]
dfs1 = [x[x.user_id.isin(df_user.user_id) for x in dfs]
Output is another list with filtered DataFrames.
Another similar idea is use dictionary:
dict1 = {'df_order_filled': df_order_filled,
'df_liquidate_order': df_liquidate_order,
'df_fee_discount':df_fee_discount,
'df_dep_wit':df_dep_wit}
dict2 = {k: x[x.user_id.isin(df_user.user_id) for k, x in dict1.items()}
I have a list of dataframes with different shapes, which is why I cant put the data into a 3-dimensional df or array.
Now I want to get some specific dataframes from that list in a new list now containing only the needed dfs.
list_of_df = [df1, df2, df3, ...]
index = [0,3,7,9,29,11,18,77,1009]
new_list = list_of_df[index]
the only way I can think of this is very unsexy:
new_list = []
for i in index:
new_list.append(list_of_df[i])
is there some better solution or in general a more convenient way to store and access thousands of different dataframes?
You can use a list comprehension:
new_list = [df for (i, df) in enumerate(list_of_df) if i in index]
the Answer of #Quang Hoang solves the issue:
new_list = np.array(list_of_df)[index]
I had a problem, which is a for loop program.like below:
list = [1,2,3,4]
for index in list:
new_df_name = "user_" + index
new_df_name = origin_df1.join(origin_df2,'id','left')
but the "new_df_name" is just a Variable and String type.
how to realize these?
I assume, what you really need is to have a list of dataframes (which non necessary have any specific names) and then union them all together.
dataframes = [df1, df2, df3, etc... ]
res_df, tail_dfs = dataframes[0], dataframes[1:]
for df in tail_dfs:
res_df = res_df.unionAll(df)
upd.
even better option to union described in comment.