Pandas-InnerJoin- Multiplication of Rows - python

I have two sets of data, with one common column. Some rows have repetitions so I created a similar small example.
Here are my dataframes:
#Dataframe1
import pandas as pd
data = [['tom', 10], ['tom', 11], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#Dataframe2
data2 = [['tom', 'LA'], ['tom', 'AU'], ['nick', 'NY'], ['juli', 'London']]
df2 = pd.DataFrame(data2, columns = ['Name', 'City'])
#InnerJoin
a = pd.merge(df, df2, how= 'inner', on = 'Name')
a
The result is:
So, Instead of 2 rows with Tom, we have 4 rows. How can I solve this issue?
Thank you,

Create a temporary key for duplicate name in order, such that the first Tom in df joins to the first Tom in df2 and 2nd Tom joins to 2nd Tom in df2, etc.
df = df.assign(name_key = df.groupby('Name').cumcount())
df2 = df2.assign(name_key = df.groupby('Name').cumcount())
df.merge(df2, how='inner', on=['Name', 'name_key'])
Output:
Name Age name_key City
0 tom 10 0 LA
1 tom 11 1 AU
2 nick 15 0 NY
3 juli 14 0 London

Related

Adding column in a dataframe with 0,1 values based on another column values

In the example dataframe created below:
Name Age
0 tom 10
1 nick 15
2 juli 14
I want to add another column 'Checks' and get the values in it as 0 or 1 if the list check contain s the value as check=['nick']
I have tried the below code:
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
check = ['nick']
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df['Checks'] = np.where(df['Name']== check[], 1, 0)
#print dataframe.
print(df)
print(check)
str.containts
phrase = ['tom', 'nick']
df['check'] = df['Name'].str.contains('|'.join(phrase))
You can use pandas.Series.isin:
check = ['nick']
df['check'] = df['Name'].isin(check).astype(int)
output:
Name Age check
0 tom 10 0
1 nick 15 1
2 juli 14 0

frequency of values in column in multiple panda data frame

I have multiple panda data frames ( more than 70), each having same columns. Let say there are only 10 rows in each data frame. I want to find the column A' value occurence in each of data frame and list it. Example:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
data = [['sam', 12], ['nick', 15], ['juli', 14]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
I am expecting the output as
Name Age
tom 1
sam 1
nick 2
juli 2
You can do the following:
from collections import Counter
d={'df1':df1, 'df2':df2, ..., 'df70':df70}
l=[list(d[i]['Name']) for i in d]
m=sum(l, [])
result=Counter(m)
print(result)
Do you want value counts of Name column across all dataframes?
main = pd.concat([df,df2])
main["Name"].value_counts()
juli 2
nick 2
sam 1
tom 1
Name: Name, dtype: int64
This can work if your data frames are not costly to concat:
pd.concat([x['Name'] for x in [df,df2]]).value_counts()
nick 2
juli 2
tom 1
sam 1
You can try this:
df = pd.concat([df, df2]).groupby('Name', as_index=False).count()
df.rename(columns={'Age': 'Count'}, inplace=True)
print(df)
Name Count
0 juli 2
1 nick 2
2 sam 1
3 tom 1
You can try this:
df = pd.concat([df1,df2])
df = df.groupby(['Name'])['Age'].count().to_frame().reset_index()
df = df.rename(columns={"Age": "Count"})
print(df)

Finding the difference between two data frames

I have two data frames say df1, df2 each has two columns ['Name', 'Marks']
I want to find the difference between the two ifs for corresponding Name Values.
Eg:
df = pd.DataFrame([["Shivi",70],["Alex",40]],columns=['Names', 'Value'])
df2 = pd.DataFrame([["Shivi",40],["Andrew",40]],columns=['Names', 'Value'])
For df1-df2 I want
pd.DataFrame([["Shivi",30],["Alex",40],["Andrew",40]],columns=['Names', 'Value'])
You can use:
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
So a complete program will look like this:
import pandas as pd
data1 = {'Name': ["Ashley", "Tom"], 'Marks': [40, 50]}
data2 = {'Name': ["Ashley", "Stan"], 'Marks': [80, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
print(diff)
Output:
Marks
Name
Ashley -40.0
Stan -90.0
Tom 50.0

String increment of characters for a column

I've tried researching but din't get any leads so posting a question,
I have a df and I want the string column values to be incremented based on their ascii values of each character of string by 3
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 Tom 10
1 Nick 15
2 Juli 14
Final answer should be like Name is incremented by 3 ASCII numbers
Name Age
0 Wrp 10
1 Qlfn 15
2 Myol 14
This action has to be carried out on a df with 32,000 row. Please suggest me on how to achieve this result?
Here's one way using python's built-in chr and ord (it seems like you want an increment of 3 not 2):
df['Name'] = [''.join(chr(ord(s)+3) for s in i) for i in df.Name]
print(df)
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14
Try the code below,
data = [['Tom', 10], ['Nick', 15], ['Juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def fn(inp_str):
return ''.join([chr(ord(i) + 3) for i in inp_str])
df['Name'] = df['Name'].apply(fn)
df
Output is
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14

Pandas: Converting Columns to Rows based on ID

I am new to pandas,
I have the following dataframe:
df = pd.DataFrame([[1, 'name', 'peter'], [1, 'age', 23], [1, 'height', '185cm']], columns=['id', 'column','value'])
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
I need to create a single row for each ID. Like so:
id name age height
0 1 peter 23 185cm
Any help is greatly appreciated, thank you.
You can use pivot_table with aggregate join:
df = pd.DataFrame([[1, 'name', 'peter'],
[1, 'age', 23],
[1, 'height', '185cm'],
[1, 'age', 25]], columns=['id', 'column','value'])
print (df)
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
3 1 age 25
df1 = df.astype(str).pivot_table(index="id",columns="column",values="value",aggfunc=','.join)
print (df1)
column age height name
id
1 23,25 185cm peter
Another solution with groupby + apply join and unstack:
df1 = df.astype(str).groupby(["id","column"])["value"].apply(','.join).unstack(fill_value=0)
print (df1)
column age height name
id
1 23,25 185cm peter
Assuming your dataframe as "df", below line would help:
df.pivot(index="subject",columns="predicate",values="object")

Categories