Iterate over different dataframe - python

I am trying to iterate over three data frames to find the difference between them. I have a master data frame which contains everything and two other data frames which contains partial of master data frame. I am trying to write a python code to identify what is missing in the other two files. Master file looks like following:
ID Name
1 Mike
2 Dani
3 Scott
4 Josh
5 Nate
6 Sandy
second data frame looks like following:
ID Name
1 Mike
2 Dani
3 Scott
6 Sandy
Third data frame looks like following:
ID Name
1 Mike
2 Dani
3 Scott
4 Josh
5 Nate
So there will be two output data frame. Desired output for looks like following for second data frame:
ID Name
4 Josh
5 Nate
desired output for third data frame looks like following:
ID Name
6 Sandy
I didn't find anything similar on Google. I tried this:
for i in second['ID'], third['ID']:
if i not in master['ID']:
print(i)
It returns all the data in master file.
Also if I try this code :
import pandas as pd
names = ["Mike", "Dani", "Scott", "Josh", "Nate", "Sandy"]
ids = [1, 2, 3, 4, 5, 6]
master = pd.DataFrame({"ID": ids, "Name": names})
# print(master)
names_second = ["Mike", "Dani", "Scott", "Sandy"]
ids_second = [1, 2, 3, 6]
second = pd.DataFrame({"ID": ids_second, "Name": names_second})
# print(second)
names_third = ["Mike", "Dani", "Scott", "Josh", "Nate"]
ids_third = [1, 2, 3, 4, 5]
third = pd.DataFrame({"ID": ids_third, "Name": names_third})
# print(third)
for i in master['ID']:
if i not in second["ID"]:
print("NOT IN SECOND", i)
if i not in third["ID"]:
print("NOT IN THIRD", i)
OUTPUT ::
NOT IN SECOND 4
NOT IN SECOND 5
NOT IN THIRD 5
NOT IN SECOND 6
NOT IN THIRD 6
Why it says NOT IN SECOND 6 and NOT IN THIRD 5?
Any suggestion? Thanks in advance.

You can try using .isin with ~ to filter dataframes. To compare with second you can use master[~master.ID.isin(second.ID)] and similar for third:
cmp_master_second, cmp_master_third = master[~master.ID.isin(second.ID)], master[~master.ID.isin(third.ID)]
print(cmp_master_second)
print('\n-------- Seperate dataframes -----------\n')
print(cmp_master_third)
Result:
Name
ID
4 Josh
5 Nate
-------- Seperate dataframes -----------
Name
ID
6 Sandy

You could do a set difference on the master and the other DataFrames
In [315]: set(d1[0]) - set(d2[0])
Out[315]: {'Josh', 'Nate'}
In [316]: set(d1[0]) - set(d3[0])
Out[316]: {'Sandy'}

Related

How to convert the values in the row with same id to one list? (Pandas in python)

I uploaded the csv file
#Open the first dataset
train=pd.read_csv("order_products__train.csv",index_col="order_id")
The data looks like:
product_id
order_id
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
What I want is the data frame looks like,
order_id product_id
1 1,2,3,4
2 1,2,3,4,5,6
Since I want to generate a list like
[[1,2,3,4],[1,2,3,4,5,6]]
Could anyone help?
You can use the the function .groupby() to do that
train = train.groupby(['order_id'])['product_id'].apply(list)
That would give you expected output :
order_id
1 [1, 2, 3, 4]
2 [1, 2, 3, 4, 5]
Finally, you can cast this to a DataFrame or directly to a list to get what you want :
train = train.to_frame() # To pd.DataFrame
# Or
train = train.to_list() # To nested lists [[1,2,3,4],[1,2,3,4,5]]
There must be better ways but I guess you can simply do the following:
list_product = []
for i in train["order_id"].unique():
tmp = train[train["order_id"] == i]
list_product.append(tmp["product_id"].to_list())

duplicate and update a row in pandas given its index

I am trying to link patient's ID with patient images. Once patient could have more than one image attached to them. I have added a new column, image_ID in my dataframe that already has patient_ID.
So the code I've written below, only adds the last image_ID of a patient. How can I duplicate and add rows knowing their indices (the index that corresponds to the patient ID) so that I can duplicate all other information of the same patient for all of its images?
Since my shuffled_balanced data frame initially doesn't have the image_name column, I have created it and have set it to None. Please note if row['patient_ID'] in sample is due to the fact that patient_ID is part of image_ID.
I am also open to other ways of approaching this.
shuffled_balanced['image_ID'] = 'None'
for dirpath, dirname, filename in os.walk('/SeaExpNFS/images'):
if dirpath.endswith('20.0'):
splits = dirpath.split('/')
sample = splits[-2][:-6]
for index, row in shuffled_balanced.iterrows():
if row['patient_ID'] in sample:
shuffled_balanced.at[index,'image_ID']=sample
I think you're looking for merge. Say you have two dataframes that look something like this:
import pandas as pd
patient_df = pd.DataFrame({"patient_id": [1, 2, 3, 4, 5],
"patient_name": ["Penny",
"Leonard",
"Amy",
"Sheldon",
"Rajesh"]})
img_df = pd.DataFrame({"patient_id": [2, 3, 4, 4, 1],
"img_file": ["leonard.jpg",
"amy.jpg",
"sheldon.jpg",
"sheldon2.jpg",
"penny.jpg"]})
>>> patient_df
patient_id patient_name
0 1 Penny
1 2 Leonard
2 3 Amy
3 4 Sheldon
4 5 Rajesh
>>> img_df
patient_id img_file
0 2 leonard.jpg
1 3 amy.jpg
2 4 sheldon.jpg
3 4 sheldon2.jpg
4 1 penny.jpg
You can merge them like so:
>>> patient_df.merge(img_df, on="patient_id", how="outer")
patient_id patient_name img_file
0 1 Penny penny.jpg
1 2 Leonard leonard.jpg
2 3 Amy amy.jpg
3 4 Sheldon sheldon.jpg
4 4 Sheldon sheldon2.jpg
5 5 Rajesh NaN

Keep strings present in a list from a column in pandas

I have a problem similar to this question but an opposite challenge. Instead of having a removal list, I have a keep list - a list of strings I'd like to keep. My question is how to use a keep list to filter out the unwanted strings and retain the wanted ones in the column.
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"name": [
"Mitty, Kitty",
"Kandy, Puppy",
"Judy, Micky, Loudy",
"Cindy, Judy",
"Kitty, Wicky",
],
}
)
ID name
0 1 Mitty, Kitty
1 2 Kandy, Puppy
2 3 Judy, Micky, Loudy
3 4 Cindy, Judy
4 5 Kitty, Wicky
To_keep_lst = ["Kitty", "Kandy", "Micky", "Loudy", "Wicky"]
Use Series.str.findall with Series.str.join:
To_keep_lst = ["Kitty", "Kandy", "Micky", "Loudy", "Wicky"]
df['name'] = df['name'].str.findall('|'.join(To_keep_lst)).str.join(', ')
print (df)
ID name
0 1 Kitty
1 2 Kandy
2 3 Micky, Loudy
3 4
4 5 Kitty, Wicky
Use a comprehension to filter out names you want to keep:
keep_names = lambda x: ', '.join([n for n in x.split(', ') if n in To_keep_lst])
df['name'] = df['name'].apply(keep_names)
print(df)
# Output:
ID name
0 1 Kitty
1 2 Kandy
2 3 Micky, Loudy
3 4
4 5 Kitty, Wicky
Note: the answer of #jezrael is much faster than mine.

Format data in Pandas for multi-level Sankey in Plotly: source and target columns

I have data on the sequence of courses taken by students and I would like to represent the flows between classes using a Sankey diagram. My data is in a Pandas dataframe in a long format, where each step that someone took has a row and the order of those steps is specified by a column order:
student
course
order
Jerry
A
1
Jerry
B
2
Jerry
C
NaN
Jessy
C
1
Jessy
A
2
Jessy
B
3
Raphael
A
1
Raphael
C
2
Raphael
C
3
Raphael
B
4
Sally
A
1
Sally
B
2
Sally
C
NaN
I pivoted this table to aggregate it into sequences with the count of each sequence:
course1
course2
course3
course4
count
A
B
End
End
2
A
C
C
B
1
C
A
B
End
1
Note that I want to retain the End value, but if this causes problems, I am happy to abandon this and just have people stop at a step.
Building a Sankey in Plotly requires a data format with the source and target. Here is the example on the Plotly docs.
source = [0, 1, 0, 2, 3, 3],
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
I need to get my data into the format above, but for my entire dataframe.
If I was dealing with a small dataset, like the toy one above, I could manually create this. However, I have a dataframe with thousands of rows and I have no idea how to do this in Pandas. It looks like some sort of window calculation, but I have no idea how to do it.
I am also kinda confused about how to create the coding scheme because my failed attempts create a viz that doesn't have a sequence of four courses.
Any help is much appreciated.
Toy data:
student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})

Make index first row in group in pandas dataframe

I was wondering if it were possible to make the first row of each group based on index, the name of that index. Suppose we have a df like this:
dic = {'index_col': ['a','a','a','b','b','b'],'col1': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(dic).set_index('index_col')
Is it possible to transform the dataframe above to one that looks like the one below? What happened here is that the index has been reset, and for every group, the first row is the index name?
The result is a pandas.Series;
df_list = []
for label , group in df.groupby('index_col'):
df_list.append(pandas.concat([pandas.Series([label]), group['col1']]))
df_result = pandas.concat(df_list).reset_index(drop=True)
Output;
0 a
1 1
2 2
3 3
4 b
5 4
6 5
7 6
dtype: object
Call df_result.to_frame() if you want a data-frame.

Categories