Sorting dataframe by multiple changing values - python

I have a DataFrame that is the result of a large SQL query. I am trying to sort the DataFrame into 2 separate DataFrames. NVI and Main. They are both a list of repairs to trucks. I need to sort it based on if there is a specific profile id which is 7055. Which will go into the NVI DataFrame
If that job is encountered I need to grab the values from the "RO" "Unit Number" and Repair Date column. I then need to take those values and search the DataFrame again and grab any rows that have a matching RO and Unit number or a matching Unit number and a Repair date that is equal to or earlier than the date value in the the row that the 7055 was found. Those rows then need to go into the NVI df. Any remaining rows that do not match will go into the Main df.
The only static value is the profile id of 7055. The RO Unit Number and Repair date will all be different.
class nvi_dict(dict):
def __setitem__(self, key, value):
key = key.profile()
super().__setitem__(key, value)
nvisort = pd.DataFrame()
def sort_nvi_dict(row, component):
if row ['PROFILE_ID'] in cfg[component]['nvi']:
nvi_ro = nvi_dict()
nvi_ro ['RO'] = row ['RO']
nvi_ro ['UnitNum'] = row ['VFUNIT']
nvi_ro ['date']= row['REPAIR_DATE']
nvisort = nvidf.apply(lambda x: sort_nvi_dict(x, 'nvi_ro'), axis=1, result_type='expand')
I thought about trying to use a class to create a temp dict object to store the values from RO, UnitNum and Date. Which I can then call on to iterate over the df again looking for matching values.
I am using a .yml file to store dictionaries. That I am using to further sort each of the NVI and Main df's after they have been sorted out. Because they will then need to each be sorted by truck manufacturer

I think this might work, unable to test without the test data though...
df1 = nvisort[nvisort['profile_id'] = 7055]
df2 = pd.merge(nvisort,df1[['RO','Unit Number']],on=['RO','Unit number'],how='right')
df3 = pd.merge(nvisort,df1[['Unit Number','Repair Date']],on='Unit Number'],how='right')
df3 = df3[df3['Repair Date_x'] <= df3['Repair Date_y']]
df3 = df3.drop(columns='Repair Date_y']
df3 = df3.rename(columns={'Repair Date_x':'Repair Date'})
NVI = pd.concat([df1,df2,df3])
Main = pd.concat([NVI,nvisort]).drop_duplicates(keep=False)
I'm assuming that your original/starting dataframe here is the nvisort, and then we filter that just to get profile_id of 7055 and call that df1
Then we are going to get your two different pieces of criteria into df2 and df3.
df2 is just a filter on the original dataframe where RO and Unit Number match, so we can use pd.merge() to effectively get that filter.
df3 is a more complicated filter since it is the less than or equal, not the equal. So first we do the merge to filter on matching unit numbers, but we also bring over the Repair Date from both tables into df3, and these get appended _x and _y on the column names. So then we filter where the date on the _x is less than on _y and then clean it up.
Last, you get Main by finding everything from the original nvisort that is not in NVI. Since NVI is a subset of nvisort, you can just concat them and drop all duplicates, leaving only data that exists in one of the dataframes.

From what i understand of your question, you want to divide a dataframe into 2 based on certain conditions?
df1 = df[<condition>]
condition can be - df[profile id] == 7055 and Allunits.contains(df[unit])

Related

How to use duplicated, sort values, and pivot table to group the data w.r.t. cell value and keep the cell value that have occurred more than one time

I want group the data from the table according to 'name' column values and keep the information where the 'name' column values have occurred more than one time in the table. The following code works fine for the given small data table.
import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
'Email':['danny#gmail.com','danny#gmail.com','monny#gmail.com','quony#gmail.com','danny#gmail.com','danny#gmail.com'],
'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']} #data with three columns
df=pd.DataFrame(data) #creation of dataframe
df['No Dutch Bank']=None #creation of extra column for analysis
df.loc[df['IBAN'].str.find('NL') == -1, 'No Dutch Bank']='ja'# to find rows, which contain non dutch bank numbers.
df_filt=df[['Name',"Email", "IBAN"]]#filtering columns needed in the final results
df_gb = df_filt[df_filt.duplicated(subset=['Name'], keep=False)].sort_values(by='Name', ascending=False).reset_index(drop=True)#filtering on a required column
piv_tab = pd.pivot_table(df_gb, index=['Name',"Email", "IBAN"])#applying pivot table
piv_tab
This works fine for original data of three columns and six rows. In practice I have data of thirty columns and thirty thousand (30000) rows. When I select three columns (Name, Email and IBAN) and run the same code, the code does not filter out occurrence of rows which appeared in the table only once.
Why?
Everything except for the pivot logic is correct and should work for larger dataset as well.
Another approach would be calculate the count of names, filter by count > 1, and merge the names back to this table via left join.
This way is more explicit in filter by count > 1, and is more flexible to support filter by count > 2 etc.
import pandas as pd
# count names
df_name = df_filt.groupby('Name')['Email'].count()
# get names with count > 1
df_name = df_name[df_name > 1].reset_index()
# merge filtered names back to original df to get filtered df
df_gb = pd.merge(df_name['Name'], df_filt, on=['Name'], how='left')
# sorting etc.
df_gb = df_gb.sort_values(by='Name', ascending=False).reset_index(drop=True)
# and some pivot stuff
...

How can i keep original index when doing outer merge and dropping rows?

I have a big df (rates) that contains all information, then I have a second dataframe (aig_df) that contains a couple of rows of the first one.
I need to get a 3rd dataframe that is basically the big one (rates) without the rows on the second one (aig_df), but I need to keep the corresponding indices of the rows that results of rates without aig_df.
With the code I have now, I can get the 3rd dataframe with all the information needed but with int index and I need the index corresponding to each row (Index = Stock Ticker).
rates = pd.read_sql("SELECT Ticker, Carrier, Product, Name, CDSC,StrategyTerm,ParRate,Spread,Fee,Cap FROM ProductRates ", conn).set_index('Ticker')
aig_df = rates.query('Product == "X5 Advantage AnnuitySM"')
competitors_df = pd.merge(rates, aig_df[['Carrier', 'Product', 'Name','CDSC','StrategyTerm','ParRate','Spread','Fee','Cap']],indicator=True,
how='outer').query('_merge=="left_only"').drop('_merge',axis=1)
¿Is there any way to do what I need?
Thanks for your attention
In your specific case, you don't need a merge to do what you want:
result = rates[rates["Product"] != "X5 Advantage AnnuitySM"]

How Can I Segregate data in pandas From my Timestamp Column

I am working with an Excel sheet with pandas where I am analysing some data from it.
Inside the excel sheet I have 8 Columns one is Timestamp and another one is City column and so on like Domain, State etc.
I want to analyse city and timestamp columns only.
I have selected the city and Timestamp column from the excel sheet inside a Data Frame. I have find out the city count, means how many rows contain the same city using cities_df['Count'] = df['City_Town_Village '].value_counts()
After finding the city count I have find out the percentages of all the cities using cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Now my question here is, As I am finding the city_count The rows inside my dataframe is decreasing means my df has 238 rows but after the count they are decreasing to 128 no issue at all till now. They are decreasing just because of the count.
I also have the timestamp column inside my df, lets say for city Delhi some people registered in 28-May-2021 and some people registered in 29-May-2021 just like that. But after finding out the city_count my df is just showing me timestamp for the beginning date only i.e 28-may..
I don't know why this is happening, actually I want to segregate the data into two week and want to plot the graph on week wise. and also for city percentage.
Here is my Excel file
This is the code I'm using:
import pandas as pd
df = pd.read_excel('PCS_NWR_Sheet.xlsx')
df.head()
pd.set_option('display.max_rows', 300)
cities_df = pd.DataFrame()
cities_df['Count'] = df['City_Town_Village '].value_counts()
cities_df.index.names=['City']
cities_df.reset_index(inplace = True)
cities_df['Timestamp'] = df['Timestamp']
column_names = ['Timestamp', 'City', 'Count']
cities_df = cities_df.reindex(columns=column_names)
cities_df['PctCnt'] =(cities_df['Count']/sum(cities_df['Count'])*100).apply("{0:.2f}".format)
Metro_list = ['Hyderabad', 'Kolkata', 'Delhi', 'Pune', 'Bengaluru', 'Noida', 'Kanpur', 'Gurgaon']
top_metro=cities_df[cities_df['City'].isin(Metro_list)]
top_metro
.value_counts() will return a series where the size is equal to the number of unique elements in what you are counting. So you are getting less rows because it is grouping those things.
I can think of two ways to solve the question (if I understand it right).
Do .value_counts() on both the date column and the city column.
df[['City_Town_Village','Date']].value_counts()
If you don't currently have your timestamps as dates, you'll need to make a date column that does that (you probably won't be able to group on datetimes since the times will vary. This will give a series where the row count is equal to the size of every existing combination of the two columns.
Make a separate dataframe with the value_count of the town column, and merge them. That is, if you want a column in your main dataframe that has the number of times that the town comes up ever in the data, that column is a different size (as we said), so you'll store it somewhere else but can bring it back in as needed.
df2 = pd.DataFrame(df['City_Town_Village'].value_counts())
df2.reset_index(inplace=True) # by fault, making the df from value_counts() will make your city/town the index, this makes a normal index)
df2.columns = ['City_Town_Village','Count'] #rename the columns
df = df.merge(df2,how='left',on='City_Town_Village')
This will make df to have the Counts column added, where it will be the count of the City/Town in the original dataset.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Replicating Excel VLOOKUP in Python

So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')

Categories