calculating length of several files in pandas using for loop

calculating length of several files in pandas using for loop - python

I have five data frames (df1, df2, df3, df4, df5), and I am going to calculate their lengths using the following code:
df1 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_1.xlsx")
df2 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_2.xlsx")
df3 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_3.xlsx")
df4 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_4.xlsx")
df5 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_5.xlsx")
for i in [1,2,3,4,5]:
print(len(dfi.index))
But it throws the following error:
"name 'dfi' is not defined"
I also tried this:
for i in [1,2,3,4,5]:
print(len(df[i].index))
But that did not work.
This code works:
print(len(df1.index))
But I have to change name of the file each time.
What is problem and how can I solve it?

There are no dynamic variable names in Python - so dfi refers to a variable explicitly called dfi. It doesn't change to df1 just because i is 1 (or something else).
In your case you could simply iterate over a sequence of the dataframes:
df1 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_1.xlsx")
df2 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_2.xlsx")
df3 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_3.xlsx")
df4 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_4.xlsx")
df5 = pd.read_excel("/Users/us/Desktop/cymbalta_rated_5.xlsx")
for dfi in (df1, df2, df3, df4, df5): # explicitly defines the variable "dfi"!
print(len(dfi.index))

Related

With a PANDAS dataframe, how do I print out the variable name along with the data inside it

Using the Python package PANDAS, I have a simple loop where I want to print some data frame information
for df in (df1, df2, df3, df4, df5, df6, df7):
print(df.keys())
However, I also want to label each dataframe with its variable name (ex: df1 or df2). Is there a way to print the variable name?
Looking at How do I print the variable name holding an object? there doesn't seem to be a good answer for java but I'm not sure with Python. Any tips? Thanks.

Use zip() to iterate over two lists simultaneously.
dfs=[df1, df2, df3, df4, df5, df6, df7]
dfs_names=['df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7']
for name, df in zip(dfs_names, dfs):
print(n ,df.keys())

How to use magic store commands within a python function?

I have the following python functions:
def fun1():
df1 = pd.read_csv("test1.csv")
df2 = pd.read_csv("test2.csv")
return df1, df2
def fun2(df1, df2):
df3 = df1
return df1, df2, df3
def fun3(df1, df2, df3):
df3.to_csv("df3.csv", index=False)
%store df3 #here is the problem
return df1, df2, df3
When I do the above, I get an error message like as shown below
UsageError: Unknown variable 'df3'
However, I would like to store df3 using magic command, so I can use it other jupyter notebooks etc.
Can someone help me avoid this error?

Could I define multiple dataframe using pd.DataFrame()?

Is is possible to create multiple dataframes under pandas library?
The following codes is what I have tried but it doesn't work...
df1, df2 = pd.DataFrame()

You could do something like this:
df1, df2 = (pd.DataFrame(),) * 2
Or, more explicitly:
df1, df2 = pd.DataFrame(), pd.DataFrame()
Or even:
df1 = df2 = pd.DataFrame()
See this answer for a great explanation.

Concatenate multiple dataframe and columns names

I have a list of data-frames
liste = [df1, df2, df3, df4]
sharing same index called "date". I concatenate this as follow:
pd.concat( (dd for dd in ll ), axis=1, join='inner')
But the columns have the same name. I can override the columns name manually, but I wonder if there is a way that the columns name will take the corresponding data-frame names, in this case "df1", "df2".

You can replace them as followes:
import pandas as pd
from functools import reduce
liste = [df1, df2, df3, df4]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), liste)
Or:
... code snippet ...
df1.merge(df2,on='col_name').merge(df3,on='col_name').merge(df4,on='col_name')
Update based on comment:
An example for automated grabbing the column names of each you may integrate below code (while I assume its a single column array) to your liking:
colnames = {}
for i in range(len(dfs)):
name = df[i].columns
colnames[i+1] = name
... merge with code above ...

you could use merge
df=liste[0]
for data_frame in liste[1:]:
df=df.merge(date_frame, left_index=True,right_index=True)
by default you'll get y_ appended to the columns so you'll end up with _y_y etc but you can control this with suffixes= so perhaps you use the position with an enumerate in the loop?

Variable instead of dataframe name in pandas function

I have something like
df3 = pd.merge(df1, df2, how='inner', left_on='x', right_on='y')
But I would like the the two dataframes to be represented by variables instead:
df3 = pd.merge(df_var, df_var2, how='inner', left_on='x', right_on='y')
I get this error: ValueError: can not merge DataFrame with instance of type
I'm stuck on how to get pandas to recognize the variable as the name of the dataframe. thanks!

How about storing the dataframes in a dict and referencing them using the keys
df_dict = {var: df1, var2: df2}
df3 = pd.merge(df_dict[var], df_dict[var2], how='inner', left_on='x', right_on='y')

If the DataFrames have been converted to strings, you need to convert them back to DataFrames before merging. Here's one way you could do it:
from io import StringIO
# to strings
df_var = df1.to_string(index=False)
df_var2 = df2.to_string(index=False)
# back to DataFrames
df_var = pd.read_csv(StringIO(df_var))
df_var2 = pd.read_csv(StringIO(df_var2))
df3 = pd.merge(df_var, df_var2, how='inner', left_on='x', right_on='y')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculating length of several files in pandas using for loop - python

Related

With a PANDAS dataframe, how do I print out the variable name along with the data inside it

How to use magic store commands within a python function?

Could I define multiple dataframe using pd.DataFrame()?

Concatenate multiple dataframe and columns names

Variable instead of dataframe name in pandas function

Categories

Resources