How to loop over multiple DataFrames and produce multiple list? - python

I have some difficulties to create multiple lists using pandas from a list of multiple dataframes:
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
...
dfN = pd.read_csv('df1.csv')
dfs = [df1, df2, ..., dfN]
So far, I am able to convert each dataframe into a list by df1 = df1.values.tolist(). Since I have multiple data frames, I would like to convert each dataframe into a list with a loop.
Appreciate any suggestions!

Use list comprehensions:
dfs = [i.values.tolist() for i in dfs]

same as you are storing dataframes?
lists = []
for df in dfs:
temp_list = df.values.tolist()
lists.append(temp_list)
This will give you a list of lists. Each list within will be values from a dataframe. Or did I understand the question incorrectly?
Edit: If you wish to name each list, then you can use a dictionary instead? Would be better than trying to create thousands of variables dynamically.
dict_of_lists = {}
for index, df in enumerate(dfs):
listname = "list" + str(index)
dict_of_lists[listname] = df.values.tolist()

use pd.concat to join all dataframes to one big dataframe
df_all = pd.concat(dfs,axis=1)
df_all.values.tolist()

Related

using a for loop to rename columns of a list of data frames

I have a list of dataframes I have saved in a variable x.
x=[df_1963,df_1974,df_1985,df_1996,df_2007,df_2018]
I wish to change all the headers to lowercase but nothing happens after the running the code below.
for df in x:
for column in df.columns:
df = df.withColumnRenamed(column, '_'.join(column.split()).lower())
You can do it in a list and dict comprehension way. Try with:
renamed_x = [a.rename(columns={y:y.lower() for y in a}) for a in x]
The inner dict comprehension generates a dictionary with the columns' original values and its lowercase, so that is how each dataframe's columns will be renamed in lowercase; while the outer list comprehension iterates over all the dfs.
Can use df.columns to access each df's columns instead of looping through:
for df in x:
df.columns = df.columns.str.lower()
You could use Series.map for columns as well:
for df in x:
df.columns = df.columns.map(str.lower)

What's the most efficient way to export multiple pandas dataframes to csv files?

I have multiple pandas dataframes:
df1
df2
df3
And I want to export them all to csv files.
df1.to_csv('df1.csv', index = False)
df2.to_csv('df2.csv', index = False)
df3.to_csv('df3.csv', index = False)
What's the most efficient way to do this?
def printToCSV(number):
num = str(num)
csvTemp = "df99.csv"
csv = csvTemp.replace('99',num)
dfTemp = "df99"
dfString = dfTemp.replace('99',num)
#i know i cant use the .to_csv function on a string
#but i dont know how iterate through different dataframes
dfString.to_csv(csv, index = False)
for i in range(1,4):
printToCSV(i)
How can I can call different dataframes to export?
You can add them all to a list with:
list_of_dataframes = [df1, df2, df3]
Then you can iterate through the list using enumerate to keep a going count of which dataframe you are writing, using f-strings for the filenames:
for count, dataframe in enumerate(list_of_dataframes):
dataframe.to_csv(f"dataframe_{count}.csv", index=False)
When you are creating the dataframes you can store them in a suitable data structure like list and when you want to create csv you can use map and do the same.
dfs = []
for i in range(10):
dfs.append(DataFrame())
result = [dfs[i].to_csv(f'df{i}') for i in range(10)]
If you want to stick with your function approach, use:
df_dict = {"df1": df1, "df2": df2, "df3": df3}
def printToCSV(num):
df_dict[f"df{num}"].to_csv(f"df{num}.csv", index=False)
printToCSV(1) # will create "df1.csv"
However, if you want to increase efficiency, I'd propose to use a df list (as #vtasca proposed as well):
df_list = [df1, df2, df3]
for num, df in enumerate(df_list): # this will give you a number (starting from 0) and the df, in each loop
df.to_csv(f"df{num}.csv", index=False)
Or working with a dict:
df_dict = {"df1": df1, "df2": df2, "df3": df3}
for name, df in df_dict.items():
df.to_csv(f"{name}.csv", index=False)

Yield multiple empty dataframes by a function in Python

I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).

Python & Pandas: how to elegantly filter many dataframes?

I know I can use isin to filter dataframe
my question is that, when I'm doing this for many dataframes, the code looks a bit repetitive.
For example, below is how I filter some datasets to limit only to specific user datasets.
## filter data
df_order_filled = df_order_filled[df_order_filled.user_id.isin(df_user.user_id)]
df_liquidate_order = df_liquidate_order[df_liquidate_order.user_id.isin(df_user.user_id)]
df_fee_discount_ = df_fee_discount_[df_fee_discount_.user_id.isin(df_user.user_id)]
df_dep_wit = df_dep_wit[df_dep_wit.user_id.isin(df_user.user_id)]
the name of the dataframe is repeated 3 times for each df, which is kind unnecessary.
How can I simplify my code?
Thanks!
Use list comprehension with list of DataFrames:
dfs = [df_order_filled, df_liquidate_order, df_fee_discount_, df_dep_wit]
dfs1 = [x[x.user_id.isin(df_user.user_id) for x in dfs]
Output is another list with filtered DataFrames.
Another similar idea is use dictionary:
dict1 = {'df_order_filled': df_order_filled,
'df_liquidate_order': df_liquidate_order,
'df_fee_discount':df_fee_discount,
'df_dep_wit':df_dep_wit}
dict2 = {k: x[x.user_id.isin(df_user.user_id) for k, x in dict1.items()}

How to fetch very particular elements from a list?

I have a list of dataframes with different shapes, which is why I cant put the data into a 3-dimensional df or array.
Now I want to get some specific dataframes from that list in a new list now containing only the needed dfs.
list_of_df = [df1, df2, df3, ...]
index = [0,3,7,9,29,11,18,77,1009]
new_list = list_of_df[index]
the only way I can think of this is very unsexy:
new_list = []
for i in index:
new_list.append(list_of_df[i])
is there some better solution or in general a more convenient way to store and access thousands of different dataframes?
You can use a list comprehension:
new_list = [df for (i, df) in enumerate(list_of_df) if i in index]
the Answer of #Quang Hoang solves the issue:
new_list = np.array(list_of_df)[index]

Categories