I have a list of dataframes with different shapes, which is why I cant put the data into a 3-dimensional df or array.
Now I want to get some specific dataframes from that list in a new list now containing only the needed dfs.
list_of_df = [df1, df2, df3, ...]
index = [0,3,7,9,29,11,18,77,1009]
new_list = list_of_df[index]
the only way I can think of this is very unsexy:
new_list = []
for i in index:
new_list.append(list_of_df[i])
is there some better solution or in general a more convenient way to store and access thousands of different dataframes?
You can use a list comprehension:
new_list = [df for (i, df) in enumerate(list_of_df) if i in index]
the Answer of #Quang Hoang solves the issue:
new_list = np.array(list_of_df)[index]
Related
I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).
I have a list of lists each containing a common column, A, B to be used as indices as follows
df_list=[[df1], [df2],[df3], [df4], [df5, df6]]
I would like to merge the dataframes into a single dataframe based on the common columns A, B in all dataframes
I have tried pd.concat(df_list). This doesnt work and yields an error
list indices must be integers or slices, not list
Is there a way to accomplish this
You need to deliver flat list (or other flat structure) to pd.concat your
df_list=[[df1], [df2],[df3], [df4], [df5, df6]]
is nested. If you do not have command over filling said df_list you need to flatten it first for example using itertools.chain as follow:
import itertools
flat_df_list = list(itertools.chain(*df_list))
and then deliver flat_df_list or harnessing fact that pd.concat also work with itertators:
total_df = pd.concat(itertools.chain(*df_list))
try simply using something like:
dfs = [df1, df2, df3, ... etc]
print(pd.concat(dfs))
when storing the dfs in a list you shouldn't keep them in a second list, note pd.concat takes a list of dfs.
The solution is as follows
flat_list = [item for sublist in df_list for item in sublist]
#L3 = pd.DataFrame.from_dict(map(dict,D))
L3=pd.concat(flat_list)
I know I can use isin to filter dataframe
my question is that, when I'm doing this for many dataframes, the code looks a bit repetitive.
For example, below is how I filter some datasets to limit only to specific user datasets.
## filter data
df_order_filled = df_order_filled[df_order_filled.user_id.isin(df_user.user_id)]
df_liquidate_order = df_liquidate_order[df_liquidate_order.user_id.isin(df_user.user_id)]
df_fee_discount_ = df_fee_discount_[df_fee_discount_.user_id.isin(df_user.user_id)]
df_dep_wit = df_dep_wit[df_dep_wit.user_id.isin(df_user.user_id)]
the name of the dataframe is repeated 3 times for each df, which is kind unnecessary.
How can I simplify my code?
Thanks!
Use list comprehension with list of DataFrames:
dfs = [df_order_filled, df_liquidate_order, df_fee_discount_, df_dep_wit]
dfs1 = [x[x.user_id.isin(df_user.user_id) for x in dfs]
Output is another list with filtered DataFrames.
Another similar idea is use dictionary:
dict1 = {'df_order_filled': df_order_filled,
'df_liquidate_order': df_liquidate_order,
'df_fee_discount':df_fee_discount,
'df_dep_wit':df_dep_wit}
dict2 = {k: x[x.user_id.isin(df_user.user_id) for k, x in dict1.items()}
I have some difficulties to create multiple lists using pandas from a list of multiple dataframes:
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
...
dfN = pd.read_csv('df1.csv')
dfs = [df1, df2, ..., dfN]
So far, I am able to convert each dataframe into a list by df1 = df1.values.tolist(). Since I have multiple data frames, I would like to convert each dataframe into a list with a loop.
Appreciate any suggestions!
Use list comprehensions:
dfs = [i.values.tolist() for i in dfs]
same as you are storing dataframes?
lists = []
for df in dfs:
temp_list = df.values.tolist()
lists.append(temp_list)
This will give you a list of lists. Each list within will be values from a dataframe. Or did I understand the question incorrectly?
Edit: If you wish to name each list, then you can use a dictionary instead? Would be better than trying to create thousands of variables dynamically.
dict_of_lists = {}
for index, df in enumerate(dfs):
listname = "list" + str(index)
dict_of_lists[listname] = df.values.tolist()
use pd.concat to join all dataframes to one big dataframe
df_all = pd.concat(dfs,axis=1)
df_all.values.tolist()
I had a problem, which is a for loop program.like below:
list = [1,2,3,4]
for index in list:
new_df_name = "user_" + index
new_df_name = origin_df1.join(origin_df2,'id','left')
but the "new_df_name" is just a Variable and String type.
how to realize these?
I assume, what you really need is to have a list of dataframes (which non necessary have any specific names) and then union them all together.
dataframes = [df1, df2, df3, etc... ]
res_df, tail_dfs = dataframes[0], dataframes[1:]
for df in tail_dfs:
res_df = res_df.unionAll(df)
upd.
even better option to union described in comment.