I am trying to link patient's ID with patient images. Once patient could have more than one image attached to them. I have added a new column, image_ID in my dataframe that already has patient_ID.
So the code I've written below, only adds the last image_ID of a patient. How can I duplicate and add rows knowing their indices (the index that corresponds to the patient ID) so that I can duplicate all other information of the same patient for all of its images?
Since my shuffled_balanced data frame initially doesn't have the image_name column, I have created it and have set it to None. Please note if row['patient_ID'] in sample is due to the fact that patient_ID is part of image_ID.
I am also open to other ways of approaching this.
shuffled_balanced['image_ID'] = 'None'
for dirpath, dirname, filename in os.walk('/SeaExpNFS/images'):
if dirpath.endswith('20.0'):
splits = dirpath.split('/')
sample = splits[-2][:-6]
for index, row in shuffled_balanced.iterrows():
if row['patient_ID'] in sample:
shuffled_balanced.at[index,'image_ID']=sample
I think you're looking for merge. Say you have two dataframes that look something like this:
import pandas as pd
patient_df = pd.DataFrame({"patient_id": [1, 2, 3, 4, 5],
"patient_name": ["Penny",
"Leonard",
"Amy",
"Sheldon",
"Rajesh"]})
img_df = pd.DataFrame({"patient_id": [2, 3, 4, 4, 1],
"img_file": ["leonard.jpg",
"amy.jpg",
"sheldon.jpg",
"sheldon2.jpg",
"penny.jpg"]})
>>> patient_df
patient_id patient_name
0 1 Penny
1 2 Leonard
2 3 Amy
3 4 Sheldon
4 5 Rajesh
>>> img_df
patient_id img_file
0 2 leonard.jpg
1 3 amy.jpg
2 4 sheldon.jpg
3 4 sheldon2.jpg
4 1 penny.jpg
You can merge them like so:
>>> patient_df.merge(img_df, on="patient_id", how="outer")
patient_id patient_name img_file
0 1 Penny penny.jpg
1 2 Leonard leonard.jpg
2 3 Amy amy.jpg
3 4 Sheldon sheldon.jpg
4 4 Sheldon sheldon2.jpg
5 5 Rajesh NaN
Related
I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.
And this is what I want to check. I want to ensure that these columns have their values in the same order.
I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.
EDIT:
The name of the columns change, in my example, they are just two examples.
Refer here How to check if 3 columns are same and add a new column with the value if the values are same?
Here is a code that would check if more columns are the same and returns the index of rows which are the same
arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))
You need to tweak it for your usage
Edit
columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values
If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object, making the operation O(n):
pd.util.hash_pandas_object(df.T, index=False)
We clearly see that store/employee/columnX have the same hash:
store 18266754969677227875
quarter 11367719614658692759
employee 18266754969677227875
foo 92544834319824418
columnX 18266754969677227875
dtype: uint64
You can further use groupby to identify the identical values:
df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))
output:
{ 92544834319824418: ['foo'],
11367719614658692759: ['quarter'],
18266754969677227875: ['store', 'employee', 'columnX']}
I uploaded the csv file
#Open the first dataset
train=pd.read_csv("order_products__train.csv",index_col="order_id")
The data looks like:
product_id
order_id
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
What I want is the data frame looks like,
order_id product_id
1 1,2,3,4
2 1,2,3,4,5,6
Since I want to generate a list like
[[1,2,3,4],[1,2,3,4,5,6]]
Could anyone help?
You can use the the function .groupby() to do that
train = train.groupby(['order_id'])['product_id'].apply(list)
That would give you expected output :
order_id
1 [1, 2, 3, 4]
2 [1, 2, 3, 4, 5]
Finally, you can cast this to a DataFrame or directly to a list to get what you want :
train = train.to_frame() # To pd.DataFrame
# Or
train = train.to_list() # To nested lists [[1,2,3,4],[1,2,3,4,5]]
There must be better ways but I guess you can simply do the following:
list_product = []
for i in train["order_id"].unique():
tmp = train[train["order_id"] == i]
list_product.append(tmp["product_id"].to_list())
I have two dataframes. In the first one I have the customers and a column with a list of every restaurant he/she visited.
In [1]: df_customers
Out[1]:
Document Restaurants
0 '000000984 [20504916171, 20504916171, 20499859164]
1 '000010076 [20505918674, 20505918674, 20505918674]
2 '000010319 [20253346711, 20524403863, 20508246677]
3 '000018468 [20253346711, 20538456226, 20505918674]
4 '000024409 [20553255881, 20553596441, 20553255881]
5 '000025944 [20492255719, 20600654226]
6 '000031162 [20600351398, 20408462399, 20499859164]
7 '000055177 [20524403863, 20524403863]
8 '000058303 [20600997239, 20524403863, 20600997239]
9 '000074791 [20517920178, 20517920178, 20517920178]
In my other dataframe I have a column with the restaurants and another with a given value for each
In [2]: df_rest
Out [2]:
Restaurant Points
0 10026575473 1
1 10037003331 1
2 10072208299 1
3 10179698400 2
4 10214262750 1
I need to create a column in my customers dataframe with the sum of the points given to each restaurant he/she visited.
I tried something like this:
df_customers["Sum"]=df_rest.loc[df_rest["Restaurant"].isin(df_customers["Restaurants"]),"Points"].sum()
But I'm getting this error:
TypeError: unhashable type: 'list'
I'm trying not to iterate on my customers dataframe, it takes too long. Any help?
Aim not to use lists within Pandas series. Using list removes the possibility of vectorised operations. More efficient is to expand your jagged array of restaurant lists into a single dataframe, then map to points via a dictionary and sum.
Here's a minimal example:
df1 = pd.DataFrame({'Document': [1, 2],
'Restaurants': [[20504916171, 20504916171, 20499859164],
[20505918674, 20505918674]]})
df2 = pd.DataFrame({'Restaurant': [20504916171, 20504916171, 20499859164,
20505918674, 20505918674],
'Points': [1, 2, 1, 3, 2]})
ratmap = df2.set_index('Restaurant')['Points'].to_dict()
df1['score'] = pd.DataFrame(df1['Restaurants'].values.tolist())\
.applymap(ratmap.get).fillna(0).sum(1).astype(int)
print(df1)
Document Restaurants score
0 1 [20504916171, 20504916171, 20499859164] 5
1 2 [20505918674, 20505918674] 4
I would first expand the df into:
d = {c: df_customers[c].values.repeat(df_customers.Restaurants.str.len(), axis=0) for c in df_customers.columns}
d['Restaurants'] = [i for sub in df_customers.Restaurants for i in sub]
df3 = pd.DataFrame(d)
Document Restaurants
0 000000984 20504916171
1 000000984 20504916171
2 000000984 20499859164
3 000010076 20505918674
4 000010076 20505918674
5 000010076 20505918674
6 000010319 20253346711
7 000010319 20524403863
Then map
df3['Point'] = df3.Restaurants.map(df_rest.set_index('Restaurant').Points).fillna(0)
Document Restaurants Point
0 000000984a 20504916171 1
1 000000984a 20504916171 1
2 000000984a 20499859164 0
3 000010076a 20505918674 0
4 000010076a 20505918674 0
5 000010076a 20505918674 0
Then groupby document and sum
df3.groupby('Document').sum()
Restaurants Point
Document
000000984 61509691506 2.0
000010076 61517756022 0.0
000010319 61285997251 0.0
000018468 61297721611 0.0
Values are mocked, because no restaurant id from your df_customers is present in your df_rest in the example you provided.
I am trying to iterate over three data frames to find the difference between them. I have a master data frame which contains everything and two other data frames which contains partial of master data frame. I am trying to write a python code to identify what is missing in the other two files. Master file looks like following:
ID Name
1 Mike
2 Dani
3 Scott
4 Josh
5 Nate
6 Sandy
second data frame looks like following:
ID Name
1 Mike
2 Dani
3 Scott
6 Sandy
Third data frame looks like following:
ID Name
1 Mike
2 Dani
3 Scott
4 Josh
5 Nate
So there will be two output data frame. Desired output for looks like following for second data frame:
ID Name
4 Josh
5 Nate
desired output for third data frame looks like following:
ID Name
6 Sandy
I didn't find anything similar on Google. I tried this:
for i in second['ID'], third['ID']:
if i not in master['ID']:
print(i)
It returns all the data in master file.
Also if I try this code :
import pandas as pd
names = ["Mike", "Dani", "Scott", "Josh", "Nate", "Sandy"]
ids = [1, 2, 3, 4, 5, 6]
master = pd.DataFrame({"ID": ids, "Name": names})
# print(master)
names_second = ["Mike", "Dani", "Scott", "Sandy"]
ids_second = [1, 2, 3, 6]
second = pd.DataFrame({"ID": ids_second, "Name": names_second})
# print(second)
names_third = ["Mike", "Dani", "Scott", "Josh", "Nate"]
ids_third = [1, 2, 3, 4, 5]
third = pd.DataFrame({"ID": ids_third, "Name": names_third})
# print(third)
for i in master['ID']:
if i not in second["ID"]:
print("NOT IN SECOND", i)
if i not in third["ID"]:
print("NOT IN THIRD", i)
OUTPUT ::
NOT IN SECOND 4
NOT IN SECOND 5
NOT IN THIRD 5
NOT IN SECOND 6
NOT IN THIRD 6
Why it says NOT IN SECOND 6 and NOT IN THIRD 5?
Any suggestion? Thanks in advance.
You can try using .isin with ~ to filter dataframes. To compare with second you can use master[~master.ID.isin(second.ID)] and similar for third:
cmp_master_second, cmp_master_third = master[~master.ID.isin(second.ID)], master[~master.ID.isin(third.ID)]
print(cmp_master_second)
print('\n-------- Seperate dataframes -----------\n')
print(cmp_master_third)
Result:
Name
ID
4 Josh
5 Nate
-------- Seperate dataframes -----------
Name
ID
6 Sandy
You could do a set difference on the master and the other DataFrames
In [315]: set(d1[0]) - set(d2[0])
Out[315]: {'Josh', 'Nate'}
In [316]: set(d1[0]) - set(d3[0])
Out[316]: {'Sandy'}
I have two csv files.Depending upon the value of a cell in csv file 1 I should be able to search that value in a column of csv file 2 and get he corresponding value from other column in csv file 2.
I am sorry if this very confusing.It will probably get clear by illustration
CSV file 1
Car Mileage
A 8
B 6
C 10
CSV file 2
Score Mileage(Min) Mileage(Max)
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
And my desired output CSV file is something like this
Car Mileage Score
A 8 3
B 6 2
C 10 4
Car A is given a score of 3 depending upon its mileage 8 and then looking that mileage in csv file 2 in what range it falls and then getting corresponding score value for that range.
Any help will be appreciated
Thanks in advance
As of writing this, the current stable release is v0.21.
To read your files, use pd.read_csv -
df0 = pd.read_csv('file1.csv')
df1 = pd.read_csv('file2.csv')
df0
Car Mileage
0 A 8
1 B 6
2 C 10
df1
Score Mileage(Min) Mileage(Max)
0 1 1 3
1 2 4 6
2 3 7 9
3 4 10 12
4 5 13 15
To find the Score, use pd.IntervalIndex by calling IntervalIndex.from_tuples. This should be really fast -
v = df1.loc[:, 'Mileage(Min)':'Mileage(Max)'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, closed='both') # you can also use `from_arrays`
df0['Score'] = df1.iloc[idx.get_indexer(df0.Mileage.values), 'Score'].values
df0
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
Other methods of creating an IntervalIndex are outlined here.
To write your result, use pd.DataFrame.to_csv -
df0.to_csv('file3.csv')
Here's a high level outline of what I've done here.
First, read in your CSV files
Use pd.IntervalIndex to build an interval index tree. So, searching is now logarithmic in complexity.
Use idx.get_indexer to find the index of each value in the tree
Use the index to locate the Score value in df1, and assign this back to df0. Note that I call .values, otherwise, the values will be misaligned when assigning back.
Write your result back to CSV
For more information on Intervalindex, take a look at this SO Q/A - Finding matching interval(s) in pandas Intervalindex
Note that IntervalIndex is new in v0.20, so if you have an older version, make sure you update your version with
pip install --upgrade pandas
You can use IntervalIndex, new in version 0.20.0+:
First create DataFrames by read_csv:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
Create IntervalIndex by from_arrays:
s = pd.IntervalIndex.from_arrays(df2['Mileage(Min)'], df2['Mileage(Max)'], 'both')
print (s)
IntervalIndex([[1, 3], [4, 6], [7, 9], [10, 12], [13, 15]]
closed='both',
dtype='interval[int64]')
Select Mileage values by intervalindex and set to new column by array created by values, because else indices are not aligned and get:
TypeError: incompatible index of inserted column with frame index
df1['Score'] = df2.set_index(s).loc[df1['Mileage'], 'Score'].values
print (df1)
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4
And last write to file by to_csv:
df1.to_csv('file3.csv', index=False)
Setup
data = [(1,1,3), (2,4,6), (3,7,9), (4,10,12), (5,13,15)]
df = pd.DataFrame(data, columns=['Score','MMin','MMax'])
car_data = [('A', 8), ('B', 6), ('C', 10)]
car = pd.DataFrame(car_data, columns=['Car','Mileage'])
def find_score(x, df):
result = -99
for idx, row in df.iterrows():
if x >= row.MMin and x <= row.MMax:
result = row.Score
return result
car['Score'] = car.Mileage.apply(lambda x: find_score(x, df))
Which yields
In [58]: car
Out[58]:
Car Mileage Score
0 A 8 3
1 B 6 2
2 C 10 4