I have looked at many similar questions, yet I still cannot get pandas to rename the rows of a df from a list of values from another df. What am I doing wrong?
def calculate_liabilities(stakes_df):
if not stakes_df.empty:
liabilities_df = pd.DataFrame( decimal_odds_lay.values * stakes_df.values ) #makes df with stakes rows, decimal odds columns
stakes_list = stakes_df.to_dict()
print(stakes_list)
liabilities_df = liabilities_df.rename(stakes_list)
return liabilities_df
else:
print ("Failure to calculate liabilities")
stakes_list = stakes_df.to_dict() gives the following dict:
{'Stakes': {0: 3.7400000000000002, 1: 5.5999999999999996, 2: 7.0700000000000003}}
I want the rows of liabilities_df to be renamed 3.7400000000000002, 5.5999999999999996 and 7.0700000000000003 respectively.
if you want to rename liabilities_df's row name(index) to stakes_df's value, you need to give dict not dict of dict.
liabilities_df = liabilities_df.rename(stakes_list['Stakes'])
example:
df= pd.DataFrame([1,2,3])
0
0 1
1 2
2 3
df.rename({0: 3.7400000000000002, 1: 5.5999999999999996, 2: 7.0700000000000003})
0
3.74 1
5.60 2
7.07 3
You can rename the rows with a data.frame, here you have a dictionary, that's why.
would be better if you gave us the data, but here you don't have to make a dictionary from stakes_list
Related
I want to compare two very similar DataFrames, one is loaded from json file and resamples, the second one is loaded from CSV file from some more complicated use-case.
Those are the first values of df1:
page
logging_time
2021-07-04 18:14:47.000 748.0
2021-07-04 18:14:47.100 0.0
2021-07-04 18:14:47.200 0.0
2021-07-04 18:14:47.300 3.0
2021-07-04 18:14:47.400 4.0
[5 rows x 1 columns]
And those are the second values of df2 :
#timestamp per 100 milliseconds Sum of page
0 2021-04-07 18:14:47.000 748.0
1 2021-04-07 18:14:47.100 0.0
2 2021-04-07 18:14:47.200 0.0
3 2021-04-07 18:14:47.300 3.0
4 2021-04-07 18:14:47.400 4.0
[5 rows x 2 columns]
I'm comparing them with pandas.testing.assert_frame_equal, trying to do some customizations for the data in order to be equal, would like some help with that.
The first column should be removed and the labels names should be ignored.
I want to do that in the most pandas-native way, and not compare only the values.
Any help would be appreciated
This is a lot of code but is an almost comprehensive compare of two data frames given a join key and column(s) to ignore. Its current weakness is that it does not compare/analyze the values that may not exists in each of the data sets.
Also please note that this script will write out .csv files of the rows that are different with the join key specified and only the column values from the two data sets. (comment out that portion if you don't want to write out those files)
Here is a link in git if you like the way Jupyter notebook looks more. https://github.com/marckeelingiv/MyPyNotebooks/blob/master/Test-Prod%20Compare.ipynb
# Imports
import pandas as pd
# Set Target Data Sets
test_csv_location = 'test.csv'
prod_csv_location = 'prod.csv'
# Set what columns to join on and what colmns to remove
join_columns = ['ORIGINAL_IID','CLAIM_IID','CLAIM_LINE','EDIT_MNEMONIC']
columns_to_remove = ['Original Clean']
# Peek at the data to get a list of the column names
test_df = pd.read_csv(test_csv_location,nrows=10)
prod_df = pd.read_csv(prod_csv_location,nrows=10)
# Create a dictinary to set all colmns to strings
all_columns = set()
for c in test_df.columns.values:
all_columns.add(c)
for c in prod_df.columns.values:
all_columns.add(c)
dtypes = {}
for c in all_columns:
dtypes[f'{c}']=str
# Perform full import setting data types and specifiying index
test_df = pd.read_csv(test_csv_location,dtype=dtypes,index_col=join_columns)
prod_df = pd.read_csv(prod_csv_location,dtype=dtypes,index_col=join_columns)
# Drop desired columns
for c in columns_to_remove:
try:
del test_df[f'{c}']
except:
pass
try:
del prod_df[f'{c}']
except:
pass
# Join Data Frames to prepare for comparing
compare_df = test_df.join(
prod_df,
how='outer',
lsuffix='_test',rsuffix='_prod'
).fillna('')
# Create list of columns to compare
columns_to_compare = []
for c in all_columns:
if c not in columns_to_remove and c not in join_columns:
columns_to_compare.append(c)
# Show the difference in columns for each data set
list_of_different_columns = []
for column in columns_to_compare:
are_different = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
differences = sum(are_different)
test_not_nulls = ~(compare_df[f'{column}_test']=='')
prod_not_nulls = ~(compare_df[f'{column}_prod']=='')
temp_df = compare_df[are_different & test_not_nulls & prod_not_nulls]
if len(temp_df)>0:
print(f'{differences} differences in {column}')
print(f'\t{(test_not_nulls).sum()} Nulls in Test')
print(f'\t{(prod_not_nulls).sum()} Nulls in Prod')
to_file = temp_df[[f'{column}_test',f'{column}_prod']].copy()
to_file.to_csv(path_or_buf=f'{column}_Test.csv')
list_of_different_columns.append(column)
del to_file
del temp_df,prod_not_nulls,test_not_nulls,differences,are_different
# Functions to show/analyze differences
def return_detla_df(column):
mask = ~(compare_df[f'{column}_test']==compare_df[f'{column}_prod'])
mask2 = ~(compare_df[f'{column}_test']=='')
mask3 = ~(compare_df[f'{column}_prod']=='')
df = compare_df[mask & mask2 & mask3][[f'{column}_test',f'{column}_prod']].copy()
try:
df['Delta'] = df[f'{column}_prod'].astype(float)-df[f'{column}_test'].astype(float)
df.sort_values(by='Delta',ascending=False,inplace=True)
except:
pass
return df
def show_count_of_differnces(column):
df = return_detla_df(column)
return pd.DataFrame(
df.groupby(by=[f'{column}_test',f'{column}_prod']).size(),
columns=['Count']
).sort_values('Count',ascending=False).copy()
# ### Code to run to see differences
# Copy and resulting code into individual jupyter notebook cells to dig into the differences
for c in list_of_different_columns:
print(f"## {c}")
print(f"return_detla_df('{c}')")
print(f"show_count_of_differnces('{c}')")
You can use the equals function to compare the dataframes. The catch is that column names must match:
data = [
["2021-07-04 18:14:47.000", 748.0],
["2021-07-04 18:14:47.100", 0.0],
["2021-07-04 18:14:47.200", 0.0],
["2021-07-04 18:14:47.300", 3.0],
["2021-07-04 18:14:47.400", 4.0],
]
df1 = pd.DataFrame(data, columns = ["logging_time", "page"])
df1.set_index("logging_time", inplace=True)
df2 = pd.DataFrame(data1, columns = ["logging_time", "page"])
df2.columns = df2.columns
print(df1.reset_index().equals(df2))
Output:
True
from pandas.testing import assert_frame_equal
Dataframes used by me:
df1=pd.DataFrame({'page': {'2021-07-04 18:14:47.000': 748.0,
'2021-07-04 18:14:47.100': 0.0,
'2021-07-04 18:14:47.200': 0.0,
'2021-07-04 18:14:47.300': 3.0,
'2021-07-04 18:14:47.400': 4.0}})
df1.index.names=['logging_time']
df2=pd.DataFrame({'#timestamp per 100 milliseconds': {0: '2021-07-04 18:14:47.000',
1: '2021-07-04 18:14:47.100',
2: '2021-07-04 18:14:47.200',
3: '2021-07-04 18:14:47.300',
4: '2021-07-04 18:14:47.400'},
'Sum of page': {0: 748.0, 1: 0.0, 2: 0.0, 3: 3.0, 4: 4.0}})
Solution:
df1=df1.reset_index()
#reseting the index of df1
df2.columns=df1.columns
#renaming the columns of df2 so that they become same as df1
print((df1.dtypes==df2.dtypes).all())
#If the above code return True it means they are same
#If It return False then check the output of:print(df1.dtypes==df2.dtypes)
#and change the dtypes of any one df(either df1 or df2) accordingly
#Finally:
print(assert_frame_equal(df1,df2))
#The above code prints None then It means they are equal
#otherwise it will throw AssertionError
Thanks for your answer
But df2.columns=df1.columns
Failes with this error: ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements
Printing those columns gives:
print(df2.columns)
print(df1.columns)
Index(['index', '#timestamp per 100 milliseconds', 'Sum of page'], dtype='object')
Index(['page'], dtype='object')
And no possible change in the columns worked, how can i compare them?
Thanks very much for the help!
I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K
I am trying to use pandas to read a column in an excel file and print a new column using my input. I am trying to convert 3-letter code to 1-letter code. So far, I've written this code, but when I run it, it will not print anything in the last column.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
codes = []
for i in df['WT_RESIDUE']:
if i == 'ALA':
codes.append('A')
if i == 'ARG':
codes.append('R')
if i == 'ASN':
codes.append('N')
if i == 'ASP':
codes.append('D')
if i == 'CYS':
codes.append('C')
if i == 'GLU':
codes.append('E')
print (codes)
codes = df ['MUTATION_CODE']
df.to_csv(r'C:\Users\User\Documents\Research\seqadv3.csv')
The way to do this is to define a dictionary with your replacement values, and then use either map() or replace() on your existing column to create your new column. The difference between the two is that
replace() will not change values not in the dictionary keys
map() will replace any values not in the dictionary keys with the dictionary's default value (if it has one) or with NaN (if the dictionary doesn't have a default value)
df = pd.DataFrame(data={'WT_RESIDUE':['ALA', 'REMARK', 'VAL', 'CYS', 'GLU']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E'}
df['code_m'] = df['WT_RESIDUE'].map(codes)
df['code_r'] = df['WT_RESIDUE'].replace(codes)
In: df
Out:
WT_RESIDUE code_m code_r
0 ALA A A
1 REMARK NaN REMARK
2 VAL NaN VAL
3 CYS C C
4 GLU E E
More detailed information is here: Remap values in pandas column with a dict
Write:
df['MUTATION_CODE'] = codes
I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M