Convert column to date format - python

I am trying to convert the date to a correct date format. I have tested some of the possibilities that I have read in the forum but, I still don't know how to tackle this issue:
After importing:
df = pd.read_excel(r'/path/df_datetime.xlsb', sheet_name="12FEB22", engine='pyxlsb')
I get the following df:
{'Unnamed: 0': {0: 'Administrative ID',
1: '000002191',
2: '000002382',
3: '000002434',
4: '000002728',
5: '000002826',
6: '000003265',
7: '000004106',
8: '000004333'},
'Unnamed: 1': {0: 'Service',
1: 'generic',
2: 'generic',
3: 'generic',
4: 'generic',
5: 'generic',
6: 'generic',
7: 'generic',
8: 'generic'},
'Unnamed: 2': {0: 'Movement type',
1: 'New',
2: 'New',
3: 'New',
4: 'Modify',
5: 'New',
6: 'New',
7: 'New',
8: 'New'},
'Unnamed: 3': {0: 'Date',
1: 37503,
2: 37475,
3: 37453,
4: 44186,
5: 37711,
6: 37658,
7: 37770,
8: 37820},
'Unnamed: 4': {0: 'Contract Term',
1: '12',
2: '12',
3: '12',
4: '12',
5: '12',
6: '12',
7: '12',
8: '12'}}
However, even although I have tried to convert the 'Date' Column (or 'Unnamed 3', because the original dataset hasn't first row so I have to change the header after that) during the importation, it has been unsuccessful.
Is there any option that I can do?
Thanks!

try this:
from xlrd import xldate_as_datetime
def trans_date(x):
if isinstance(x, int):
return xldate_as_datetime(x, 0).date()
else:
return x
print(df['Unnamed: 3'].apply(trans_date))
>>>
0 Date
1 2002-09-04
2 2002-08-07
3 2002-07-16
4 2020-12-21
5 2003-03-31
6 2003-02-06
7 2003-05-29
8 2003-07-18
Name: Unnamed: 3, dtype: object

Related

Pandas - Group by multiple columns and datetime

I have a df of tennis results and I would like to be able to see how many days its been since each player last won a game.
This is what my df looks like
Player 1
Player 2
Date
p1_win
p2_win
Murray
Nadal
2022-05-16
1
0
Nadal
Murray
2022-05-25
1
0
and this is what I want it to look like
Player 1
Player 2
Date
p1_win
p2_win
p1_lastwin
p2_lastwin
Murray
Nadal
2022-05-16
1
0
na
na
Nadal
Murray
2022-05-25
1
0
na
9
the results will have to be able to include the days since the last win whether the player was player 1 or 2 using group by I think. Also maybe if possible it would be good to have a win percentage for the year if possible.
Any help is much appreciated.
edit - here is the dict
{'Player 1': {0: 'Murray',
1: 'Nadal',
2: 'Murray',
3: 'Nadal',
4: 'Murray',
5: 'Nadal',
6: 'Murray',
7: 'Nadal',
8: 'Murray',
9: 'Nadal',
10: 'Murray'},
'Player 2': {0: 'Nadal',
1: 'Murray',
2: 'Nadal',
3: 'Murray',
4: 'Nadal',
5: 'Murray',
6: 'Nadal',
7: 'Murray',
8: 'Nadal',
9: 'Murray',
10: 'Nadal'},
'Date': {0: '2022-05-16',
1: '2022-05-26',
2: '2022-05-27',
3: '2022-05-28',
4: '2022-05-29',
5: '2022-06-01',
6: '2022-06-02',
7: '2022-06-05',
8: '2022-06-09',
9: '2022-06-13',
10: '2022-06-17'},
'p1_win': {0: '1',
1: '1',
2: '0',
3: '1',
4: '0',
5: '0',
6: '1',
7: '0',
8: '1',
9: '0',
10: '1'},
'p2_win': {0: '0',
1: '0',
2: '1',
3: '0',
4: '1',
5: '1',
6: '0',
7: '1',
8: '0',
9: '1',
10: '0'}}
Thanks :)
I leveraged pd.merge_asof to find the latest win, and then did a merge to the relevant index.
df = pd.DataFrame({'Player 1': {0: 'Murray', 1: 'Nadal', 2: 'Murray', 3: 'Nadal', 4: 'Murray', 5: 'Nadal', 6: 'Murray'}, 'Player 2': {0: 'Nadal', 1: 'Murray', 2: 'Nadal', 3: 'Murray', 4: 'Nadal', 5: 'Murray', 6: 'Nadal'}, 'Date': {0: '2022-05-16', 1: '2022-05-26', 2: '2022-05-27', 3: '2022-05-28', 4: '2022-05-29', 5: '2022-06-01', 6: '2022-06-02'}, 'p1_win': {0: '1', 1: '1', 2: '0', 3: '1', 4: '0', 5: '0', 6: '1'}, 'p2_win': {0: '0', 1: '0', 2: '1', 3: '0', 4: '1', 5: '1', 6: '0'}})
df['p1_win']=df.p1_win.astype(int)
df['p2_win']=df.p2_win.astype(int)
df['Date'] = pd.to_datetime(df['Date'])
df['match'] = [x+'_'+y if x>y else y+'_'+x for x,y in zip(df['Player 1'],df['Player 2'])]
# df['winner'] = np.where(df.p1_win==1,df['Player 1'],df['Player 2'])
# df['looser'] = np.where(df.p1_win==0,df['Player 1'],df['Player 2'])
df = df.reset_index()
df = df.sort_values(by='Date')
df = pd.merge_asof(df,df[df.p1_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p1'),allow_exact_matches=False,direction='backward')
df = pd.merge_asof(df,df[df.p2_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p2'),allow_exact_matches=False,direction='backward')
# df = df[['index','Date','Player 1','Player 2','p1_win','p2_win','match','winner','looser','index_latest_win_p2','index_latest_win_p1']]
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p1','match'],right_on=['index','match'],suffixes=('','_latest_win_winner'))
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p2','match'],right_on=['index','match'],suffixes=('','_latest_win_looser'))
df['days_since_last_win_winner'] = (df['Date']-df.Date_latest_win_winner).dt.days
df['days_since_last_win_looser'] = (df['Date']-df.Date_latest_win_looser).dt.days
not sure that this is exactly what you meant but let me know if you need anything else:

Start looking at a column position based on a column name and return the next value

Do you know how can I start looking for an specific text starting starting in a column name that has been given by the values found in another column?
Let me explain better with an example. The column 8 contains the column names where I have to start looking for the text 'office'. So, for the first row, the Col 8 indicates that I have to start at Col 2. Then I have to find the NEXT 'Office' text and return the value of the next column (always same row). Once I get it, I will create a DataFrame containing the next value, Col 4 in this example.
{'Col 1': {0: 3.4, 1: 4.6, 2: 7.6, 3: 3.7, 4: 5.9, 5: 2.5, 6: 2.6},
'Col 2': {0: 'LTE', 1: 'LTE', 2: 'LTE', 3: 'LTE', 4: 'LTE', 5: 'LTE', 6: 'LTE'},
'Col 3': {0: 'Office', 1: 'Office', 2: nan, 3: 'Office', 4: nan, 5: nan, 6: nan},
'Col 4': {0: 1.2, 1: 3.1, 2: 23.0, 3: 11.0, 4: 34.0, 5: 12.0, 6: 123.0},
'Col 5': {0: 'LTE', 1: 'LTE', 2: 'LTE', 3: 'LTE', 4: 'LTE', 5: 'LTE', 6: 'LTE'},
'Col 6': {0: 'Office', 1: nan, 2: 'Office', 3: 'Office', 4: 'Office', 5: 'Office', 6: 'Office'},
'Col 7': {0: 1.2, 1: 6.7, 2: 12.0, 3: 143.0, 4: 674.0, 5: 354.0, 6: 134.0},
'Col 8': {0: 'Col 2', 1: 'Col 2', 2: 'Col 6', 3: 'Col 2', 4: 'Col 6', 5: 'Col 6', 6: 'Col 6'}}
any ideas on how to deal with this problem?
Output expected:
{'Col1': {0: '3.4', 1: '4.6', 2: '7.6', 3: '3.7', 4: '5.9', 5: '2.5', 6: '2.6'},
'Col 4': {0: 1.2, 1: 3.1, 2: 12.0, 3: 11.0, 4: 674.0, 5: 354.0, 6: 134.0}}
which looks like:
Col1 Col 4
0 3.4 1.2
1 4.6 3.1
2 7.6 12.0
3 3.7 11.0
4 5.9 674.0
5 2.5 354.0
6 2.6 134.0

Python conditional lookup

I have transactional table and a lookup table as below. I need add val field from df_lkp to df_txn by lookup.
For each record of df_txn, I need to loop thru df_lkp. If the grp field value is a then compare only field a in both tables and get match. If the grp value is ab then compare fields a and b in both tables. If it is abc then a, b, c fields should be compared to fetch val field, and so on. Is there a way this could done in pandas without a for-loop?
df_txn = pd.DataFrame({'id': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'},
'amt': {0: 100, 1: 200, 2: 300, 3: 400, 4: 500, 5: 600, 6: 700},
'a': {0: '226', 1: '227', 2: '248', 3: '236', 4: '248', 5: '236', 6: '236'},
'b': {0: '0E31', 1: '0E32', 2: '0E40', 3: '0E35', 4: '0E40', 5: '0E40', 6: '0E33'},
'c': {0: '3014', 1: '3015', 2: '3016', 3: '3016', 4: '3016', 5: '3016', 6: '3016'}})
df_lkp = pd.DataFrame({'a': {0: '226', 1: '227', 2: '236', 3: '237', 4: '248'},
'b': {0: '0E31', 1: '0E32', 2: '0E33', 3: '0E35', 4: '0E40'},
'c': {0: '3014', 1: '3015', 2: '3016', 3: '3018', 4: '3019'},
'grp': {0: 'a', 1: 'ab', 2: 'abc', 3: 'b', 4: 'bc'},
'val': {0: 'KE00CH0004', 1: 'KE00CH0003', 2: 'KE67593065', 3: 'KE67593262', 4: 'KE00CH0003'}})
the output
df_tx2 = pd.DataFrame({'id': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'},
'amt': {0: 100, 1: 200, 2: 300, 3: 400, 4: 500, 5: 600, 6: 700},
'a': {0: '226', 1: '227', 2: '248', 3: '236', 4: '248', 5: '236', 6: '236'},
'b': {0: '0E31', 1: '0E32', 2: '0E40', 3: '0E35', 4: '0E40', 5: '0E40', 6: '0E33'},
'c': {0: '3014', 1: '3015', 2: '3016', 3: '3016', 4: '3016', 5: '3016', 6: '3016'},
'val': {0: 'KE00CH0004', 1: 'KE00CH0003', 2: '', 3: '', 4: '', 5: '', 6: 'KE67593065'}
})

How to export to excel with pandas dataframe with multi column

I'm stuck at exporting a multi index dataframe to excel, in the matter what I'm looking for.
This is what I'm looking for in excel.
I know I have to add an extra Index Parameter on the left for the row of SRR (%) and Traction (-), but how?
My code so far.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Step 1': {'Step Typ': 'Traction', 'SRR (%)': {1: 8.384, 2: 9.815, 3: 7.531, 4: 10.209, 5: 7.989, 6: 7.331, 7: 5.008, 8: 2.716, 9: 9.6, 10: 7.911}, 'Traction (-)': {1: 5.602, 2: 6.04, 3: 2.631, 4: 2.952, 5: 8.162, 6: 9.312, 7: 4.994, 8: 2.959, 9: 10.075, 10: 5.498}, 'Temperature': 30, 'Load': 40}, 'Step 3': {'Step Typ': 'Traction', 'SRR (%)': {1: 2.909, 2: 5.552, 3: 5.656, 4: 9.043, 5: 3.424, 6: 7.382, 7: 3.916, 8: 2.665, 9: 4.832, 10: 3.993}, 'Traction (-)': {1: 9.158, 2: 6.721, 3: 7.787, 4: 7.491, 5: 8.267, 6: 2.985, 7: 5.882, 8: 3.591, 9: 6.334, 10: 10.43}, 'Temperature': 80, 'Load': 40}, 'Step 5': {'Step Typ': 'Traction', 'SRR (%)': {1: 4.765, 2: 9.293, 3: 7.608, 4: 7.371, 5: 4.87, 6: 4.832, 7: 6.244, 8: 6.488, 9: 5.04, 10: 2.962}, 'Traction (-)': {1: 6.656, 2: 7.872, 3: 8.799, 4: 7.9, 5: 4.22, 6: 6.288, 7: 7.439, 8: 7.77, 9: 5.977, 10: 9.395}, 'Temperature': 30, 'Load': 70}, 'Step 7': {'Step Typ': 'Traction', 'SRR (%)': {1: 9.46, 2: 2.83, 3: 3.249, 4: 9.273, 5: 8.792, 6: 9.673, 7: 6.784, 8: 3.838, 9: 8.779, 10: 4.82}, 'Traction (-)': {1: 5.245, 2: 8.491, 3: 10.088, 4: 9.988, 5: 4.886, 6: 4.168, 7: 8.628, 8: 5.038, 9: 7.712, 10: 3.961}, 'Temperature': 80, 'Load': 70} }
df = pd.DataFrame(data)
items = list()
series = list()
for item, d in data.items():
items.append(item)
series.append(pd.DataFrame.from_dict(d))
df = pd.concat(series, keys=items)
df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T.to_excel('testfile.xlsx')
The picture below, shows df.set_index(['Step Typ', 'Load', 'Temperature'], inplace=True).T as a dataframe: (somehow close, but not exactly what I'm looking for):
Edit 1:
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")
Can you explain clearely and show the output you are looking for?
To export a table to excel use df.to_excel('path', index=True/False)
where:
index=True or False - to insert or not the index column into the file
Found a good solution, not the exact one I was looking for, but it's still worth using it.
df.reset_index().drop(["level_0","level_1"], axis=1).pivot(columns=["Step Typ", "Load", "Temperature"], values=["SRR (%)", "Traction (-)"]).apply(lambda x: pd.Series(x.dropna().values)).to_excel("solution.xlsx")

In Python, pandas, how to ignore invalid values in python when i convert the columns from hexa to decimal?

when I use:
df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
It stops in the error because of invalid value in Type 2 column because of invalid values (negative values, NaN, string...) for hexa conversion. how to ignore this error or mark the invalid value as zero
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
You can create a personalized function that handles this exception to use in your lambda. For example:
def lambda_int(n):
try:
return int(n, 16)
except ValueError:
return 0
df[["Type 2", "Type 4"]] = df[["Type 2", "Type 4"]].applymap(lambda n: lambda_int(n))
Please go through this, i reconstructed your question and gave steps to follow
1. You first dictionary you provided does not have a value, it has a string "NaN"
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
import pandas as pd
df = pd.DataFrame(data)
df.head()
To check nan in your df and remove them
columns_with_na = df.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False))) #print them in descendng order
Prints 0 and 0 because there is no nan
Reconstructed your data to include a nan by using numpy.nan
import numpy as np
#recreated a dataset and included a nan value : np.nan at Type 2
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: np.nan,
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df2 = pd.DataFrame(data)
df2.head()
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False)))
prints 1 and 1 because there is a nan at Type 2 column
#drop nan values
df2 = df2.dropna(how = 'any')
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
#prints 0 because I dropped all the nan values
df2.head()
To fill nan in df with 0 use:
df2.fillna(0, inplace = True)
Fill in nan with 0 in df2['Type 2'] only:
#if you dont want to change the origina dataframe set inplace to false
df2['Type 2'].fillna(0, inplace = True) #inplace is set to True to change the original df

Categories