How can we Match DataFrame using loops in python? - python

I am trying to create a python code aiming to match 2 tables puit_1 and puit_2 in order to retrieve the identifier id of the table puit_2 by creating a new column id in the table puit_1, if the name of the table puit_1 equals the table puit_2 .
below the loop :
puits_sig[id_credo] = pd.series([])
for j in range(len(puits_credo['noms'])):
id_pcredo = []
for i in range(len(puits_sig['sigle_puits'])):
if puits_credo['noms'][j] == puits_credo['noms'][j]:
if str(puits_sig['sigle_puits'][i]) in puits_credo['noms'][j]:
id_pcredo.append(puits_credo['ID_Objet'][j])
print(id_pcredo)
puits_sig['id_credo'] = id_pcredo
ValueError : length of values 1 does not match length of index 1212
This code gives me an error that I couldn't solve (for information, I'm a beginner in programming).
Any help below some extract from tables?
Extract from table 1
[sigle_puit]
[categorie]
BNY2D
ACTIF
BRM2
INACTIF
Extract from table 2
[Nom]
[ID Object]
BLY23
89231
BRM1
12175

What I believe you're looking for is an inner-merge Pandas operation. What this does is create a new dataframe which has the 'name' column which both puit_1 and puit_2 have in common, and all other information which comes from the columns in dataframes 1 and 2.
merged_df = puit_1.merge(puit_2, how = 'inner', on = ['name'])
See this question for more:
Merge DataFrames with Matching Values From Two Different Columns - Pandas
EDIT: If you want to keep rows which have a name which only exists in one dataframe, you might want to use how='outer' in your merge operation

Related

Merging dataframe with "in" operator like approach

I want to merge two dataframes. One consists of a "hit" dataframe where I have a number of protein ids. The second is a database dataframe, which contains all known proteins and their functions. See examples below.
The goal is to associate my protein hits with the respective row in the database dataframe. What complicates this is that some of my hits have multiple proteins in one row (Protein_3A; Protein_B below). Is there a way to merge these data frames using an in operation so that the row for protein 3 will be matched as shown in the df below, despite only one of the subtypes (Protein_3B) being present in the database dataframe?
Example starting dataframes
hit_df = pd.DataFrame({"Hits": ["Protein_1", "Protein_2", "Protein_3A; Protein_3B", "Protein_8", "Protein_5"]})
database_df = pd.DataFrame({"Proteins": ["Protein_1", "Protein_2", "Protein_3B", "Protein_4", "Protein_9", "Protein10"],"Function": ["FuncX", "FuncY", "FuncZ", "Unknown", "Unkwown", "FuncA"]})
Desired result dataframe
matched_results_df = pd.DataFrame({"Hits":["Protein_1", "Protein_2", "Protein_3B"], "Function":["FuncX", "FuncY", "FuncZ"]})
First split each expression according to the ; symbol on a new row.
hit_df= hit_df["Hits"].str.split("; ", expand = False).explode().to_frame()
print(hit_df)
Hits
0 Protein_1
1 Protein_2
2 Protein_3A
2 Protein_3B
3 Protein_8
4 Protein_5
then use isin function and get only matching values.
matched_results_df = database_df[database_df['Proteins'].isin(hit_df['Hits'])]

Attempting to compare different tables and cut out rows based on comparison in Pandas

So I have two different tables right now. These tables contain a series of information including one column being a specific date.
Example:
[Table 1]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 1, 2020 / Apples
[Table 2]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 5, 2020 / Oranges
I am trying to compare the two tables if the second table has a date that is AFTER the first table (for the same unique identifier), I would like to write this to a new table. There are a lot of rows in this table, and I need to keep going through the rows. However I can't seem to get this to work. This is what I am doing:
import pandas as pd
from pyspark.sql.functions import desc
from pyspark.sql import functions as F
def fluvoxamine_covid_hospital_progression(fluvoxamine_covids_that_were_also_in_hospital_at_any_time, fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk):
df_fluvoxamine_covid_outpatients = pd.DataFrame(fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk)
df_fluvoxamine_covid_outpatients.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_covid_outpatients.sort(desc('visit_start_date'))
df_fluvoxamine_converted_hospital = pd.DataFrame(fluvoxamine_covids_that_were_also_in_hospital_at_any_time)
df_fluvoxamine_converted_hospital.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_converted_hospital.sort(desc('visit_start_date'))
i = 0
if df_fluvoxamine_covid_outpatients.sort('visit_start_date') < df_fluvoxamine_converted_hospital.sort('visit_start_date'):
i = i + 1
Try to break it down into steps. I renamed your variables for readability.
# renamed variables
converted = fluvoxamine_covids_that_were_also_in_hospital_at_any_time
outpatients = fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk
For the first step, keep the first few lines of code as you wrote.
# Load and clean the data
covid_outpatients = pd.DataFrame(outpatients)
converted_hospital = pd.DataFrame(converted)
covid_outpatients.dropDuplicates(['visit_occurrence_id'])
converted_hospital.dropDuplicates(['visit_occurrence_id'])
Next, join the data using the unique identifier column.
all_data = covid_outpatients.set_index('Unique Identifier (Primary Key)').join(converted_hospital.set_index('Unique Identifier (Primary Key)'), lsuffix='_outpatients', rsuffix='_converted')
Reset the index with the unique identifier column.
all_data['Unique Identifier (Primary Key)'] = all_data.index
all_data.reset_index(drop=True, inplace=True)
Generate a mask based on the date comparison. A mask is a series of boolean values with the same size/shape as the DataFrame. In this case, the mask is True if the outpatients date is less than the converted date, otherwise, the value in the series is False.
filtered_data = all_data[all_data['visit_start_date_outpatients'] < all_data['visit_start_date_converted']]
Note, if your data is not in a date format, it might need to be converted or cast for the mask to work properly.
Lastly, save the output as a comma-separated-value CSV file.
// Generate an output. For example, it's easy to save it as a CSV file.
filtered_data.to_csv('outpatient_dates_less_than_converted_dates.csv')
In addition to the official documentation for pandas dataframes, the website https://towardsdatascience.com/ has many good tips. I hope this helps!

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.
Also, I'll explain why
merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')
does not work in this case.
"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.
The desired output table should have all the unique columns from two tables:
output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]
I hope my question makes sense...
Any help is really really appreciated!
note
The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments), but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.
why not do something like
Count = 0
def MergeFunction(rowElement):
global Count
df2_row = df2.iloc[[Count]]
if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber']
in rowElement['Comments']
rowElement['Amount'] = df2_row['Amount']
Count+=1
return rowElement
df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)
Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.
import pandas as pd
import re
x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
'Dir': ['N','S','E','W','S','E']})
y = pd.DataFrame({'Location': ['Miami','Dallas'],
'Season': ['Spring','Fall']})
def findval(row):
comment, location, season = map(lambda x: str(x).lower(),row)
return location in comment or season in comment
merged = pd.concat([x,y])
merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)
Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.
You could index the comments field using a library like Whoosh and then do a text search for each shipment number that you want to search by.

pandas update multiple fields

I am trying to add and update multiple columns in a pandas dataframe using a second dataframe. The problem I get is when the number of columns I want to add doesn't match the number of columns in the base dataframe I get the following error: "Shape of passed values is (2, 3), indices imply (2, 2)"
A simplified version of the problem is below
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2, b ** 3
#create three new fields from the data
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
if the number of fields being added matches the number already in the table the opertaion works as expected.
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2
#create three new fields from the data
tst[["One^2", "Two^2"]] = tst.apply(square, axis=1)
I realise I could do each field seperately but in the actual problem I am trying to solve I perform a join between the table being updated and an external table within the "updater" (i.e. square) and want to be able to grab all the required information at once.
Below is how I would do it in SQL. Unfortunately the two dataframes contain data from different database technologies, hence why I have to do perform the operation in pandas.
update tu
set tu.a_field = upd.the_field_i_want
tu.another_field = upd.the_second_required_field
from to_update tu
inner join the_updater upd
on tu.item_id = upd.item_id
and tu.date between upd.date_from and upd.date_to
Here you can see the exact details of what I am trying to do. I have a table "to_update" that contains point-in-time information against an item_id. The other table "the_updater" contains date range information against the item_id. For example a particular item_id may sit with customer_1 from DateA to DateB and with customer_2 between DateB and DateC etc. I want to be able to align information from the table containing the date ranges against the point-in-time table.
Please note a merge won't work due to problems with the data (this is actually being written as part of a dataquality test). I really need to be able to replicate the functionality of the update statement above.
I could obviously do it as a loop but I was hoping to use the pandas framework where possible.
Declare a empty column in dataframe and assign it to zero
tst["Two^3"] = 0
Then do the respective operations for that column, along with other columns
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
Try printing it
print tst.head(5)

Categories