How to append multiple dataframes with same prefix in python - python

I have multiple sequential dataframe like this:
df1 = pd.DataFrame( [['tom', 10], ['nick', 15], ['juli', 14]] , columns = ['Name', 'Age'])
df2 = pd.DataFrame([['tom', 10], ['nick', 15], ['juli', 14]] , columns = ['Name', 'Age'])
df3 = pd.DataFrame([['tom', 10], ['nick', 15], ['juli', 14]] , columns = ['Name', 'Age'])
df4 = pd.DataFrame([['tom', 10], ['nick', 15], ['juli', 14]] , columns = ['Name', 'Age'])
I need to create a for loop to append them and get a new dataframe.
I tried the codes below, but it doesnt work, as python recognise df1 as a string.
tempdf = df1
for i in range(2,4):
tempdf = tempdf.append(("df"+str(i)))
print(tempdf)
How do I get python to recognise them as dataframe objects I created?

First, I should highlight that having to do this suggests a problem in the way the source dataframes were generated, and you should look into fixing that.
With Python, there are ways to do almost anything you want. Whether it is desirable to make use of such power is another question altogether.
In this case, the safest way would probably be to use globals():
n_dataframes = 4
g = globals()
dataframes = [g[f'df{i}'] for i in range(1, n_dataframes + 1)]
result_df = pd.concat(dataframes)
print(result_df)
Output:
Name Age
0 tom 10
1 nick 15
2 juli 14
0 tom 10
1 nick 15
2 juli 14
0 tom 10
1 nick 15
2 juli 14
0 tom 10
1 nick 15
2 juli 14
You can perform further processing on the result, such as calling reset_index.
Another alternative is to use eval, which veers firmly into "you shouldn't do this unless you really know what you're doing" territory, because it allows execution of arbitrary code:
dataframes = [eval(f'df{i}') for i in range(1, n_dataframes + 1)]
Note that the above code uses f-strings, which are syntax introduced only in Python 3.6. Accordingly, if your Python version is below that, replace f'df{i}' with 'df{}'.format(i).

You were proceeding in correct direction, just use eval:
tempdf = df1
for i in range(2,4):
tempdf = tempdf.append(eval("df"+str(i)))
print(tempdf)
Note: Using eval can run arbitrary code, using it is considered a bad practice. Please try to use other ways, if possible.

Related

Pandas-InnerJoin- Multiplication of Rows

I have two sets of data, with one common column. Some rows have repetitions so I created a similar small example.
Here are my dataframes:
#Dataframe1
import pandas as pd
data = [['tom', 10], ['tom', 11], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#Dataframe2
data2 = [['tom', 'LA'], ['tom', 'AU'], ['nick', 'NY'], ['juli', 'London']]
df2 = pd.DataFrame(data2, columns = ['Name', 'City'])
#InnerJoin
a = pd.merge(df, df2, how= 'inner', on = 'Name')
a
The result is:
So, Instead of 2 rows with Tom, we have 4 rows. How can I solve this issue?
Thank you,
Create a temporary key for duplicate name in order, such that the first Tom in df joins to the first Tom in df2 and 2nd Tom joins to 2nd Tom in df2, etc.
df = df.assign(name_key = df.groupby('Name').cumcount())
df2 = df2.assign(name_key = df.groupby('Name').cumcount())
df.merge(df2, how='inner', on=['Name', 'name_key'])
Output:
Name Age name_key City
0 tom 10 0 LA
1 tom 11 1 AU
2 nick 15 0 NY
3 juli 14 0 London

How to sum same columns (differentiated by suffix) in pandas?

I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample

Adding column in a dataframe with 0,1 values based on another column values

In the example dataframe created below:
Name Age
0 tom 10
1 nick 15
2 juli 14
I want to add another column 'Checks' and get the values in it as 0 or 1 if the list check contain s the value as check=['nick']
I have tried the below code:
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
check = ['nick']
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df['Checks'] = np.where(df['Name']== check[], 1, 0)
#print dataframe.
print(df)
print(check)
str.containts
phrase = ['tom', 'nick']
df['check'] = df['Name'].str.contains('|'.join(phrase))
You can use pandas.Series.isin:
check = ['nick']
df['check'] = df['Name'].isin(check).astype(int)
output:
Name Age check
0 tom 10 0
1 nick 15 1
2 juli 14 0

frequency of values in column in multiple panda data frame

I have multiple panda data frames ( more than 70), each having same columns. Let say there are only 10 rows in each data frame. I want to find the column A' value occurence in each of data frame and list it. Example:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
data = [['sam', 12], ['nick', 15], ['juli', 14]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
I am expecting the output as
Name Age
tom 1
sam 1
nick 2
juli 2
You can do the following:
from collections import Counter
d={'df1':df1, 'df2':df2, ..., 'df70':df70}
l=[list(d[i]['Name']) for i in d]
m=sum(l, [])
result=Counter(m)
print(result)
Do you want value counts of Name column across all dataframes?
main = pd.concat([df,df2])
main["Name"].value_counts()
juli 2
nick 2
sam 1
tom 1
Name: Name, dtype: int64
This can work if your data frames are not costly to concat:
pd.concat([x['Name'] for x in [df,df2]]).value_counts()
nick 2
juli 2
tom 1
sam 1
You can try this:
df = pd.concat([df, df2]).groupby('Name', as_index=False).count()
df.rename(columns={'Age': 'Count'}, inplace=True)
print(df)
Name Count
0 juli 2
1 nick 2
2 sam 1
3 tom 1
You can try this:
df = pd.concat([df1,df2])
df = df.groupby(['Name'])['Age'].count().to_frame().reset_index()
df = df.rename(columns={"Age": "Count"})
print(df)

String increment of characters for a column

I've tried researching but din't get any leads so posting a question,
I have a df and I want the string column values to be incremented based on their ascii values of each character of string by 3
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 Tom 10
1 Nick 15
2 Juli 14
Final answer should be like Name is incremented by 3 ASCII numbers
Name Age
0 Wrp 10
1 Qlfn 15
2 Myol 14
This action has to be carried out on a df with 32,000 row. Please suggest me on how to achieve this result?
Here's one way using python's built-in chr and ord (it seems like you want an increment of 3 not 2):
df['Name'] = [''.join(chr(ord(s)+3) for s in i) for i in df.Name]
print(df)
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14
Try the code below,
data = [['Tom', 10], ['Nick', 15], ['Juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def fn(inp_str):
return ''.join([chr(ord(i) + 3) for i in inp_str])
df['Name'] = df['Name'].apply(fn)
df
Output is
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14

Categories