Pandas dataframe cross-referenced query - python

I have a dataframe containing an id column, a linked id column, and a value column. The linked id is "optional" and refers to a different row in the same dataframe (with -1 denoting no link). What I want to do is select rows that have a valid link where value is equal to value in the row given by the linked id:
import pandas as pd
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
print(df)
# should match row 1 (only): id 1 has value 20 and linkid 3 also has value 20
# should not match
matched = df.loc[df.value == df.loc[df.id == df.linkid].value]
# ValueError: Can only compare identically-labeled Series objects
My attempt above results in an error. I suspect my attempt is pretty far from the mark but not sure how to proceed. I want to avoid loops for performance reasons. Any help gratefully received
I thought it was clear enough but as per the comment in the code, my required output in this example is row 1 from the original dataframe:
id linkid value
1 3 20.0

I think you can try this:
new_df = df.merge(df[['id','value']].rename(columns={'id':'linkid'}),how='left',on="linkid")
new_df[new_df.value_x == new_df.value_y]

Create another column value_link for the column linkid that is the value of the id == linkid . As follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": [0,1,2,3,4,5], "linkid": [-1,3,-1,0,5,-1], "value": [10, 20, 30, 20, 40, 50]})
df['value_link'] = df.linkid.apply(lambda x: df[df['id'] == x].value.values[0] if x != -1 else np.nan)
matched = df[df.value == df.value_link]

Related

new column(s) in a pandas dataframe based on values calculated from rest of the columns

I am struggling with figuring out how to return a conditional value to a column based on values in selected columns on same index row. see the attached picture.
I have a df where "<0" column is supposed to count the number of instances in previous 18 columns where valueas are less than 0.
I also need to count the total number of columns excluding NaN for each row.
any suggestions?
You can use:
s1 = df.lt(0).sum(axis=1)
s2 = df.notna().sum(axis=1)
df['<0'] = s1
df['TotCount (ex Nan)'] = s2
Or:
cols = df.columns
df['<0'] = df[cols].lt(0).sum(axis=1)
df['TotCount (ex Nan)'] = df[cols].notna().sum(axis=1)
import pandas as pd
data = {
"1": [420, -3, 390],
"2": [50, None, 45],
"3": [-2,None,4]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
for i in range(len(df.columns)):
df.loc[i,"Na_Counter"] = df.iloc[i].isnull().sum()
df.loc[i,"Negative_Counter"] = (df.iloc[i] < 0).sum().sum()
i hope it works

Inserting rows in df based on groupby using value of previous row

I need to insert rows based on the column week based on the groupby type, in some cases i have missing weeks in the middle of the dataframe at different positions and i want to insert rows to fill in the missing rows as copies of the last existing row, in this case copies of week 7 to fill in the weeks 8 and 9 and copies of week 11 to fill in rows for week 12, 13 and 14 : on this table you can see the jump from week 7 to 10 and from 11 to 15:
the perfect output would be as follow: the final table with incremental values in column week the correct way :
Below is the code i have, it inserts only one row and im confused why:
def middle_values(final : DataFrame) -> DataFrame:
finaltemp= pd.DataFrame()
out= pd.DataFrame()
for i in range(0, len(final)):
for f in range(1, 52 , 1):
if final.iat[i,8]== f and final.iat[i-1,8] != f-1 :
if final.iat[i,8] > final.iat[i-1,8] and final.iat[i,8] != (final.iat[i-1,8] - 1):
line = final.iloc[i-1]
c1 = final[0:i]
c2 = final[i:]
c1.loc[i]=line
concatinated = pd.concat([c1, c2])
concatinated.reset_index(inplace=True)
concatinated.iat[i,11] = concatinated.iat[i-1,11]
concatinated.iat[i,9]= f-1
finaltemp = finaltemp.append(concatinated)
if 'type' in finaltemp.columns:
for name, groups in finaltemp.groupby(["type"]):
weeks = range(groups['week'].min(), groups['week'].max()+1)
out = out.append(pd.merge(finaltemp, pd.Series(weeks, name='week'), how='right').ffill())
out.drop_duplicates(subset=['project', 'week'], keep = 'first', inplace=True)
out.drop_duplicates(inplace = True)
out.sort_values(["Budget: Budget Name", "Budget Week"], ascending = (False, True), inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
out.reset_index(inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
return out
else :
return final
For the first part of your question. Suppose we have a dataframe like the following:
df = DataFrame({"project":[1,1,1,2,2,2], "week":[1,3,4,1,2,4], "value":[12,22,18,17,18,23]})
We can create a new multi index to get the additional rows that we need
new_index = pd.MultiIndex.from_arrays([sorted([i for i in df['project'].unique()]*52),
[i for i in np.arange(1,53,1)]*df['project'].unique().shape[0]], names=['project', 'week'])
We can then apply this index to get the new dataframe that you need with blanks in the new rows
df = df.set_index(['project', 'week']).reindex(new_index).reset_index().sort_values(['project', 'week'])
You would then need to apply a forward fill (using ffill) or a back fill (using bfill) with groupby and transform to get the required values in the rows that you need.

Compare two values for different rows in Pandas dataframe

I have a dataset of submission records with different submission times that are grouped by id and sub_id. There will be several submissions with different sub_id under the id to indicate they are the sub-events of the original event. For instance:
id sub_id submission_time valuation_time amend_time
G1 Original 2021-05-13T00:11:05Z 2021-05-13T00:12:05Z
G1 Valuation 2021-05-13T06:11:05Z 2021-05-13T06:12:10Z
G1 Amend 2021-05-14T08:09:01Z 2021-05-14T09:09:05Z 2021-05-18T19:19:15Z
G2 Original 2021-04-12T00:11:05Z 2021-04-12T00:12:05Z
G2 Valuation 2021-04-12T06:11:05Z 2021-04-12T06:12:10Z
...
I would like to go through the dataset and examine if valuation_time of sub_id == "Valuation" is after the submission_time of sub_id == "Original" under the same id reference. If that is true, I would like to input a new column and populate sub_id == "Valuation" to be pass, otherwise fail.
I would really appreciate your help on this as I have no clue on this challenge. Thank you so much.
Please try this
import datetime
df=pd.read_excel('C:\MyCodes\samplepython.xlsx')
df['Status']=''
df_new=pd.DataFrame()
for index, row in df.iterrows():
sub_time = datetime.datetime.strptime(row['submission_time'], "%Y-%m-
%dT%H:%M:%SZ")
val_time = datetime.datetime.strptime(row['valuation_time'], "%Y-%m-
%dT%H:%M:%SZ")
if row['sub_id']=='Valuation' and val_time>sub_time:
row['Status']='Pass'
elif row['sub_id']=='Valuation' and val_time<=sub_time:
row['Status']='Fail'
df_new=df_new.append(row)
Code:
import datetime
import pandas as pd
list_values=[['G1','Original',datetime.datetime.strptime('2021-05-13T00:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-05-13T00:12:05Z', "%Y-%m-%dT%H:%M:%SZ")],
[< please load other values>],
['G2','Valuation',datetime.datetime.strptime('2021-04-12T06:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-04-12T06:12:10Z', "%Y-%m-%dT%H:%M:%SZ")]]
df=pd.DataFrame(list_values,columns = ['id', 'sub_id',
'submission_time', 'valuation_time'])
df.sort_values(by=['id', 'sub_id'])
status=[]
level=0
for index,row in df.iterrows():
if level==0 and row['sub_id']=='Original':
sub_time=row['submission_time']
status.append('')
level+=1
elif level==1 and row['sub_id']=='Valuation':
val_time=row['valuation_time']
if sub_time>val_time:
status.append('Fail')
else:
status.append('Pass')
level=0
else:
level=0
status.append('')
df["Status"]=status
print(df)
Result:

Pandas iterate over each row of a column and change its value

I have a pandas dataframe which looks like this:
Name Age
0 tom 10
1 nick 15
2 juli 14
I am trying to iterate over each name --> connect to a mysql database --> match the name with a column in the database --> fetch the id for the name --> and replace the id in the place of name
in the above data frame. The desired output is as follows:
Name Age
0 1 10
1 2 15
2 4 14
The following is the code that I have tried:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
for index, rows in df.iterrows():
cquery="select id from students where studentsName="+'"' + rows['Name'] + '"'
sid = pd.read_sql(cquery, con=engine)
df['Name'] = sid['id'].iloc[0]
print(df[['Name','Age')
The above code prints the following output:
Name Age
0 1 10
1 1 15
2 1 14
Name Age
0 2 10
1 2 15
2 2 14
Name Age
0 4 10
1 4 15
2 4 14
I understand it iterates through the entire table for each matched name and prints it. How do you get the value replaced only once.
Slight rewrite of your code, if you want to do a transformation in general on a dataframe this is a better way to go about it
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def replace_name(name: str) -> int:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
return sid['id'].iloc[0]
df[Name] = df[Name].apply(lambda x: replace_name(x.value))
This should perform the transformation you're looking for
The problem in your code as written is the line:
df['Name'] = sid['id'].iloc[0]
This sets every value in the Name column to the first id entry in your query result.
To accomplish what you want, you want something like:
df.loc[index, 'Name'] = sid['id'].iloc[0]
This will set the value at index location index in column name to the first id entry in your query result.
This will accomplish what you want to do, and you can stop reading here if you're in a hurry. If you're not in a hurry, and you'd like to become wiser, I encourage you to read on.
It is generally a mistake to loop over the rows in a dataframe. It's also generally a mistake to iterate through a list carrying out a single query on each item in the list. Both of these are slow and error-prone.
A more idiomatic (and faster) way of doing this would be to get all the relevant rows from the database in one query, merge them with your current dataframe, and then drop the column you no longer want. Something like the following:
names = df['Name'].tolist()
query = f"select id, studentsName as Name where name in({','.join(names)})"
student_ids = pd.read_sql(query, con=engine)
df_2 = df.merge(student_ids, on='Name', how='left')
df_with_ids = df_2[['id', 'Age']]
One query executed, no loops to worry about. Let the database engine and Pandas do the work for you.
You can do this kind of operations the following way, please follow comments and feel free to ask questions:
import pandas as pd
# create frame
x = pd.DataFrame(
{
"name": ["A", "B", "C"],
"age": [1, 2, 3]
}
)
# create some kind of db
mock_database = {"A": 10, "B": 20, "C": 30}
x["id"] = None # add empty column
print(x)
# change values in the new column
for i in range(len(x["name"])):
x["id"][i] = mock_database.get(x["name"][i])
print("*" * 100)
print(x)
A good way to do that would be :
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
name_ids = []
for student_name in df['Name']:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
name_ids.append(sid if sid is not None else None )
# DEBUGED WITH name_ids = [1,2,3]
df['Name'] = name_ids
print(df)
I checked with an example list of ids and it works , I guess if the query format is correct this will work.
Performance-wise I could not think a better solution , since you will have to do a lot of queries (one for each student) but there probably is some way to get all the ids with less queries.

dataframe generating own column names

For a project, I want to create a script that allows the user to enter values (like a value in centimetres) multiple times. I had a While-loop in mind for this.
The values need to be stored in a dataframe, which will be used to generate a graph of the values.
Also, there is no maximum nr of entries that the user can enter, so the names of the variables that hold the values have to be generated with each entry (such as M1, M2, M3…Mn). However, the dataframe will only consist of one row (only for the specific case that the user is entering values for).
So, my question boils down to this:
How do I create a dataframe (with pandas) where the script generates its own column name for a measurement, like M1, M2, M3, …Mn, so that all the values are stored.
I can't acces my code right now, but I have created a While-loop that allows the user to enter values, but I'm stuck on the dataframe and columns part.
Any help would be greatly appreciated!
I agree with #mischi, without additional context, pandas seems overkill, but here is an alternate method to create what you describe...
This code proposes a method to collect the values using a while loop and input() (your while loop is probably similar).
colnames = []
inputs = []
counter = 0
while True:
value = input('Add a value: ')
if value == 'q': # provides a way to leave the loop
break
else:
key = 'M' + str(counter)
counter += 1
colnames.append(key)
inputs.append(value)
from pandas import DataFrame
df = DataFrame(inputs, colnames) # this creates a DataFrame with
# a single column and an index
# using the colnames
df = df.T # This transposes the DataFrame to
# so the indexes become the colnames
df.index = ['values'] # Sets the name of your row
print(df)
The output of this script looks like this...
Add a value: 1
Add a value: 2
Add a value: 3
Add a value: 4
Add a value: q
M0 M1 M2 M3
values 1 2 3 4
pandas seems a bit of an overkill, but to answer your question.
Assuming you collect numerical values from your users and store them in a list:
import numpy as np
import pandas as pd
values = np.random.random_integers(0, 10, 10)
print(values)
array([1, 5, 0, 1, 1, 1, 4, 1, 9, 6])
columns = {}
column_base_name = 'Column'
for i, value in enumerate(values):
columns['{:s}{:d}'.format(column_base_name, i)] = value
print(columns)
{'Column0': 1,
'Column1': 5,
'Column2': 0,
'Column3': 1,
'Column4': 1,
'Column5': 1,
'Column6': 4,
'Column7': 1,
'Column8': 9,
'Column9': 6}
df = pd.DataFrame(data=columns, index=[0])
print(df)
Column0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 1 5 0 1 1 1 4 1
Column8 Column9
0 9 6

Categories