Compare two values for different rows in Pandas dataframe - python

I have a dataset of submission records with different submission times that are grouped by id and sub_id. There will be several submissions with different sub_id under the id to indicate they are the sub-events of the original event. For instance:
id sub_id submission_time valuation_time amend_time
G1 Original 2021-05-13T00:11:05Z 2021-05-13T00:12:05Z
G1 Valuation 2021-05-13T06:11:05Z 2021-05-13T06:12:10Z
G1 Amend 2021-05-14T08:09:01Z 2021-05-14T09:09:05Z 2021-05-18T19:19:15Z
G2 Original 2021-04-12T00:11:05Z 2021-04-12T00:12:05Z
G2 Valuation 2021-04-12T06:11:05Z 2021-04-12T06:12:10Z
...
I would like to go through the dataset and examine if valuation_time of sub_id == "Valuation" is after the submission_time of sub_id == "Original" under the same id reference. If that is true, I would like to input a new column and populate sub_id == "Valuation" to be pass, otherwise fail.
I would really appreciate your help on this as I have no clue on this challenge. Thank you so much.

Please try this
import datetime
df=pd.read_excel('C:\MyCodes\samplepython.xlsx')
df['Status']=''
df_new=pd.DataFrame()
for index, row in df.iterrows():
sub_time = datetime.datetime.strptime(row['submission_time'], "%Y-%m-
%dT%H:%M:%SZ")
val_time = datetime.datetime.strptime(row['valuation_time'], "%Y-%m-
%dT%H:%M:%SZ")
if row['sub_id']=='Valuation' and val_time>sub_time:
row['Status']='Pass'
elif row['sub_id']=='Valuation' and val_time<=sub_time:
row['Status']='Fail'
df_new=df_new.append(row)

Code:
import datetime
import pandas as pd
list_values=[['G1','Original',datetime.datetime.strptime('2021-05-13T00:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-05-13T00:12:05Z', "%Y-%m-%dT%H:%M:%SZ")],
[< please load other values>],
['G2','Valuation',datetime.datetime.strptime('2021-04-12T06:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-04-12T06:12:10Z', "%Y-%m-%dT%H:%M:%SZ")]]
df=pd.DataFrame(list_values,columns = ['id', 'sub_id',
'submission_time', 'valuation_time'])
df.sort_values(by=['id', 'sub_id'])
status=[]
level=0
for index,row in df.iterrows():
if level==0 and row['sub_id']=='Original':
sub_time=row['submission_time']
status.append('')
level+=1
elif level==1 and row['sub_id']=='Valuation':
val_time=row['valuation_time']
if sub_time>val_time:
status.append('Fail')
else:
status.append('Pass')
level=0
else:
level=0
status.append('')
df["Status"]=status
print(df)
Result:

Related

KeyError: "['China'] not in index"

I am trying to select China's nationality tourists, this is my code snippet-
import pandas as pd
China = ses[(ses["S1Country"] == 1)]
List_China = China[['Case','S1Country']]
List_China
This is what i put before the error
Here I am trying to selecting certain data - most sources that people used, the code snippet to perform it-
import pandas as pd
Ranking1 = ses[(ses["Q7Infor1"] == 1)]
List_Ranking1 = Ranking1[['China','Q7Infor1']]
List_Ranking1
Then I wrote this code and it reported back to me
'KeyError: "['China'] not in index'
How do I solve it?
Thanks for checking in!
sample of the data:
Assuming that you are trying to filter the column Q7Infor1 by the value China, then you can use df[df['col'] == value].
Thus, your original code:
List_Ranking1 = Ranking1[['China','Q7Infor1']]
Becomes this:
List_Ranking1 = Ranking1[Ranking1['Q7Infor1'] == 'China']
Check this answer on the different ways you can filter a column by a row value.
Updated with OP's dataset
'China' is not a valid value in column Q7Infor1. So assuming that China=1, then we can filter by value 1:
China = 1
List_Ranking1 = Ranking1[Ranking1['Q7Infor1'] == China]
To count number of rows:
print(len(List_Ranking1))

Return a matching value from 2 dataframes (1 dataframe with single value in cell, 1 with a list in a cell) into 1 dataframe

I have 2 dataframes:
df1
ID Type
2456-AA Coolant
2457-AA Elec
df2
ID Task
[2456-AA, 5656-BB] Check AC
[2456-AA, 2457-AA] Check Equip.
I'm trying return the matched ID's 'Type' from df1 to df2. With the result looking something like this:
df2
ID Task Type
[2456-AA, 5656-BB] Check AC [Coolant]
[2456-AA, 2457-AA] Check Equip. [Coolant , Elec]
I tried the following for loop. I udnerstand it isn't the fastest but i'm struggling to workout a faster alternative:
def type_identifier(type):
df = df1.copy()
device_type = []
for value in df1.ID:
for x in type:
if x == value:
device_type.append(df1.Type.tolist())
else:
None
return device_type
df2['test'] = df2['ID'].apply(lambda x: type_identifier(x))
Could somebody help me out? and also refer me to material that could help me to better approach problems like these?
Thank you,
Use the to_dict of pandas to convert df1 to a dictionary, so we can efficiently translate id to type.
Then, apply lamda that for each ID in df2 converts it to the right type, and assign it to test column as you wished.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':['2456-AA', '2457-AA'],
'Type':['Coolant', 'Elec']})
df2 = pd.DataFrame({'ID':[['2456-AA', '5656-BB'], ['2456-AA', '2457-AA']],
'Task':['Check AC', 'Check Equip.']})
# Use to dict to convert df1 ids to types
id_to_type = df1.set_index('ID').to_dict()['Type']
# {'2456-AA': 'Coolant', '2457-AA': 'Elec'}
print(id_to_type)
# Apply lamda that for each `ID` in `df2` converts it to the right type
df2['test'] = df2['ID'].apply(lambda x: [id_to_type[t] for t in x if t in id_to_type])
print(df2)
Output:
ID Task test
0 [2456-AA, 5656-BB] Check AC [Coolant]
1 [2456-AA, 2457-AA] Check Equip. [Coolant, Elec]

Using Panda, Update column values based on a list of ID and new Values

I have a df with and ID and Sell columns. I want to update the Sell column, using a list of new Sells (not all raws need to be updated - just some of them). In all examples I have seen, the value is always the same or is coming from a column. In my case, I have a dynamic value.
This is what I would like:
file = ('something.csv') # Has 300 rows
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410] # Sells values
csv = path_pattern = os.path.join(os.getcwd(), file)
df = pd.read_csv(file)
df.loc[df['Id'].isin(IDList[x]), 'Sell'] = SellList[x] # Update the rows with the corresponding Sell value of the ID.
df.to_csv(file)
Any ideas?
Thanks in advance
Assuming 'id' is a string (as mentioned in IDList) & is not index of your df
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if row['id'] in IDList:
df.loc[str(index),'Sell']=id_dict[row['id']]
If id is index:
IDList= [['453164259','453106168','453163869','453164463'] # [ID]
SellList=[120,270,350,410]
id_dict={x:y for x,y in zip(IDList,SellList)}
for index,row in df.iterrows():
if index in IDList:
df.loc[str(index),'Sell']=id_dict[index]
What I did is created a dictionary using IDlist & SellList & than looped over the df using iterrows()
df = pd.read_csv('something.csv')
IDList= ['453164259','453106168','453163869','453164463']
SellList=[120,270,350,410]
This will work efficiently, specially for large files:
df.set_index('id', inplace=True)
df.loc[IDList, 'Sell'] = SellList
df.reset_index() ## not mandatory, just in case you need 'id' back as a column
df.to_csv(file)

Pandas DataFrame - Creating a new column from a comparison

I'm trying to create a columns called 'city_code' with values from the 'code' column. But in order to do this I need to compare if 'ds_city' and 'city' values are equal.
Here is a table sample:
https://i.imgur.com/093GJF1.png
I've tried this:
def find_code(data):
if data['ds_city'] == data['city'] :
return data['code']
else:
return 'UNKNOWN'
df['code_city'] = df.apply(find_code, axis=1)
But since there are duplicates in the 'ds_city' columns that's the result:
https://i.imgur.com/geHyVUA.png
Here is a image of the expected result:
https://i.imgur.com/HqxMJ5z.png
How can I work around this?
You can use pandas merge:
df = pd.merge(df, df[['code', 'city']], how='left',
left_on='ds_city', right_on='city',
suffixes=('', '_right')).drop(columns='city_right')
# output:
# code city ds_city code_right
# 0 1500107 ABAETETUBA ABAETETUBA 1500107
# 1 2900207 ABARE ABAETETUBA 1500107
# 2 2100055 ACAILANDIA ABAETETUBA 1500107
# 3 2300309 ACOPIARA ABAETETUBA 1500107
# 4 5200134 ACREUNA ABARE 2900207
Here's pandas.merge's documentation. It takes the input dataframe and left joins itself's code and city columns when ds_city equals city.
The above code will fill code_right when city is not found with nan. You can further do the following to fill it with 'UNKNOWN':
df['code_right'] = df['code_right'].fillna('UNKNOWN')
This is more like np.where
import numpy as np
df['code_city'] = np.where(data['ds_city'] == data['city'],data['code'],'UNKNOWN')
You could try this out:
# Begin with a column of only 'UNKNOWN' values.
data['code_city'] = "UNKNOWN"
# Iterate through the cities in the ds_city column.
for i, lookup_city in enumerate(data['ds_city']):
# Note the row which contains the corresponding city name in the city column.
row = data['city'].tolist().index(lookup_city)
# Reassign the current row's code_city column to that code from the row we found in the last step.
data['code_city'][i] = data['code'][row]

Pandas dataframe cross-referenced query

I have a dataframe containing an id column, a linked id column, and a value column. The linked id is "optional" and refers to a different row in the same dataframe (with -1 denoting no link). What I want to do is select rows that have a valid link where value is equal to value in the row given by the linked id:
import pandas as pd
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
print(df)
# should match row 1 (only): id 1 has value 20 and linkid 3 also has value 20
# should not match
matched = df.loc[df.value == df.loc[df.id == df.linkid].value]
# ValueError: Can only compare identically-labeled Series objects
My attempt above results in an error. I suspect my attempt is pretty far from the mark but not sure how to proceed. I want to avoid loops for performance reasons. Any help gratefully received
I thought it was clear enough but as per the comment in the code, my required output in this example is row 1 from the original dataframe:
id linkid value
1 3 20.0
I think you can try this:
new_df = df.merge(df[['id','value']].rename(columns={'id':'linkid'}),how='left',on="linkid")
new_df[new_df.value_x == new_df.value_y]
Create another column value_link for the column linkid that is the value of the id == linkid . As follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
df['value_link'] = df.linkid.apply(lambda x: df[df['id'] == x].value.values[0] if x != -1 else np.nan)
matched = df[df.value == df.value_link]

Categories