Adding Leading Zeros to a field with MM:SS time data - python

I have the following data:
data shows a race time finish and pace:
As you can see, the data doesn't show the hour format for people who finish before the hour mark and in order to do some analysis, i need to convert into a time format but pandas doesn't recognize just the MM:SS format. how can I pad '0:' in front of the rows where hour is missing?
i'm sorry, this is my first time posting.

Considering your data is in csv format.
# reading in the data file
df = pd.read_csv('data_file.csv')
# replacing spaces with '_' in column names
df.columns = [c.replace(' ', '_') for c in df.columns]
for i, row in df.iterrows():
val_inital = str(row.Gun_time)
val_final = val_inital.replace(':','')
if len(val_final)<5:
val_final = "0:" + val_inital
df.at[i, 'Gun_time'] = val_final
# saving newly edited csv file
df.to_csv('new_data_file.csv')
Before:
Gun time
0 28:48
1 29:11
2 1:01:51
3 55:01
4 2:08:11
After:
Gun_time
0 0:28:48
1 0:29:11
2 1:01:51
3 0:55:01
4 2:08:11

You can try to apply the following function to the columns you want to change then maybe change it to timedelta
df['Gun time'].apply(lambda x: '0:' + x if len(x) == 5 \
else ('0:0' + x if len(x) == 4 else x))
df['Gun time'] = pd.to_timedelta(df['Gun Time'])

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

python create multiple year daterange with specific end date for every year

Hello need some help with this problem
a = pd.date_range(start="2001-01-01", freq="T", periods=520000)
This creates the date-range i need for 1 year. I want to do the same for the next 80 years. The end result should be a date range for 80year but every year ends after 520000min. Then i add the date range to my dataset.
# this is the data
ALL_Data = pd.DataFrame({"Lebensverbrauch_Min": LebensverbrauchMIN,
"HPT": Heisspunkttemperatur_Sim1,
"Innentemperatur": StartS,
"Verlustleistung": V_Leistung,
"SolarEintrag": SolarEintrag,
"Lastfaktor": K_Load_Faktor
})
# How many minutes are left in the year
DatenJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", periods=520000))
VollesJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", end=str(xx + 1) + "-01-01"))
GG = (VollesJahr - DatenJahr)
d = pd.DataFrame(np.zeros((GG, 6)), columns=['Lebensverbrauch_Min', 'HPT', 'Innentemperatur','Verlustleistung',
'SolarEintrag', 'Lastfaktor',])
#combine Data with 0
ALL_Data = pd.concat([ALL_Data, d])
seems to work but the complete code needs 4h to run so we will see

Iterate through df rows faster

I am trying to iterate through rows of a Pandas df to get data from one column of the row, and using that data to add new columns. The code is listed below but it is VERY slow. Is there any way to do what I am trying to do without iterating thru the individual rows of the dataframe?
ctqparam = []
wwy = []
ww = []
for index, row in df.iterrows():
date = str(row['Event_Start_Time'])
day = int(date[8] + date[9])
month = int(date[5] + date[6])
total = 0
for i in range(0, month-1):
total += months[i]
total += day
out = total // 7
ww += [out]
wwy += [str(date[0] + date[1] + date[2] + date[3])]
val = str(row['TPRev'])
out = ""
for letter in val:
if letter != '.':
out += letter
df.replace(to_replace=row['TPRev'], value=str(out), inplace = True)
val = str(row['Subtest'])
if val in ctqparam_dict.keys():
ctqparam += [ctqparam_dict[val]]
# add WWY column, WW column, and correct data format of Test_Tape column
df.insert(0, column='Work_Week_Year', value = wwy)
df.insert(3, column='Work_Week', value = ww)
df.insert(4, column='ctqparam', value = ctqparam)
It's hard to say exactly what your trying to do. However, if you're looping through rows chances are that there is a better way to do it.
For example, given a csv file that looks like this..
Event_Start_Time,TPRev,Subtest
4/12/19 06:00,"this. string. has dots.. in it.",{'A_Dict':'maybe?'}
6/10/19 04:27,"another stri.ng wi.th d.ots.",{'A_Dict':'aVal'}
You may want to:
Format Event_Start_Time as datetime.
Get the week number from Event_Start_Time.
Remove all the dots (.) from the strings in column TPRev.
Expand a dictionary contained in Subtest to its own column.
Without looping through the rows, consider doing thing by columns. Like doing it to the first 'cell' of the column and it replicates all the way down.
Code:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Event_Start_Time TPRev Subtest
0 4/12/19 06:00 this. string. has dots.. in it. {'A_Dict':'maybe?'}
1 6/10/19 04:27 another stri.ng wi.th d.ots. {'A_Dict':'aVal'}
# format 'Event_Start_Time' as as datetime
df['Event_Start_Time'] = pd.to_datetime(df['Event_Start_Time'], format='%d/%m/%y %H:%M')
# get the week number from 'Event_Start_Time'
df['Week_Number'] = df['Event_Start_Time'].dt.isocalendar().week
# replace all '.' (periods) in the 'TPRev' column
df['TPRev'] = df['TPRev'].str.replace('.', '', regex=False)
# get a dictionary string out of column 'Subtest' and put into a new column
df = pd.concat([df.drop(['Subtest'], axis=1), df['Subtest'].map(eval).apply(pd.Series)], axis=1)
print(df)
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVal
print(df.info())
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Event_Start_Time 2 non-null datetime64[ns]
1 TPRev 2 non-null object
2 Week_Number 2 non-null UInt32
3 A_Dict 2 non-null object
dtypes: UInt32(1), datetime64[ns](1), object(2)
So you end up with a dataframe like this...
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVa
Obviously you'll probably want to do other things. Look at your data. Make a list of what you want to do to each column or what new columns you need. Don't mention how right now as chances are it's possible and has been done before - you just need to find the existing method.
You may write down get the difference in days from the current row and the row beneath etc.). Finally search out how to do the formatting or calculation you require. Break the problem down.

How to not set value to slice of copy [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding?
Code:
import pandas as pd
from datetime import timedelta
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['Strategy'] = ''
def iron_condor():
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
a = 0
b = 1
c = 2
d = 3
for row in TRADER_READER.index:
start_time = TRADER_READER['Date'][a]
end_time = start_time + timedelta(seconds=5)
e = TRADER_READER.iloc[a]
f = TRADER_READER.iloc[b]
g = TRADER_READER.iloc[c]
h = TRADER_READER.iloc[d]
if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']:
e.loc[e['Strategy']] = 'Iron Condor'
f.loc[f['Strategy']] = 'Iron Condor'
g.loc[g['Strategy']] = 'Iron Condor'
h.loc[h['Strategy']] = 'Iron Condor'
print(e, f, g, h)
if (d + 1) > int(TRADER_READER.index[-1]):
break
else:
a += 1
b += 1
c += 1
d += 1
iron_condor()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
Hopefully this satisfies the data needed to replicate:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL
Expected result:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code:
The leftmost column of your input file is apparently the index column,
so it should be read as the index. The consequence is some different approach
to the way to access rows (details later).
The Date column can be converted to datetime64 as early as at the reading time.
So the initial part of your code can be:
TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date'])
TRADER_READER['Strategy'] = ''
Then I decided to organize the loop other way:
indStart is the integer index of the index column.
As you process your file in "overlapping" couples of 4 consecutive rows,
a more natural way to organize the loop is to stop on 4-th row from the end.
So the loop is over the range(TRADER_READER.index.size - 3).
Indices of 4 rows of interest can be read from the respective slice of the
index, i.e. [indStart : indStart + 4]
Check of particular row can be performed with a nested function.
To avoid your warning, setting of values in Strategy column should be
performed using loc on the original DataFrame, with row parameter for
the respective row and column parameter for Strategy.
The whole update (for the current couple of 4 rows) can be performed in
a single instruction, specifying row parameter as a slice,
from a thru d.
So the code can be something like below:
def iron_condor():
def rowCheck(row):
return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb
for indStart in range(TRADER_READER.index.size - 3):
a, b, c, d = TRADER_READER.index[indStart : indStart + 4]
e = TRADER_READER.loc[a]
undSymb = e['Underlying Symbol']
start_time = e.Date
end_time = start_time + pd.Timedelta('5S')
if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]):
TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor'
print('New values:')
print(TRADER_READER.loc[a:d])
No need to increment a, b, c and d. Neither break is needed.
Edit
If for some reason you have to do other updates on the rows in question,
you can change my code accordingly.
But I don't understand "this csv file will make a new column" in your
comment. For now anything you do is performed on the DataFrame
in memory. Only after that you can save the DataFrame back to the
original file. But note that even your code changes the type of Date
column, so I assume you do it once and then the type of this column
is just datetime64.
So you probably should change the type of Date column as a separate
operation and then (possibly many times) update thie DataFrame and save
the updated content back to the source file.
Edit following the comment as of 21:22:46Z
re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if
a match has been found, otherwise None.
So can not compare this result with the string searched. If you wanted to get
the string matched, you should run e.g.:
mtch = re.search('.*TO_OPEN$', row['Action'])
textFound = None
if mtch:
textFound = mtch.group(0)
But you actually don't need to do it. It is enough to check whether
a match has been found, so the condition can be:
found = bool(re.search('.*TO_OPEN$', row['Action']))
(note that None cast to bool returns False and any non-Null object
returns True).
Yet another (probably simpler and quicker) solution is that you run just:
row.Action.endswith('TO_OPEN')
without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case.
Deal with SettingWithCopyWarning
In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val

How do you drop all "£" and "," in all the columns at once, not one at a time?

Need to remove all "£" and "," from an excel file in my jupyter notebook using python. I know how to do 1 column at a time but there are many columns and I want to drop all the "£" and "," at once and resave it.
I have tried a bunch of different things from here but they are only showing a list or one column at a time I need to drop all the "£" and "," at once as it is too time consuming.
df.head()
total refunded tax
£43,622.61 £4,528.86 £16,975.51
£133,873.05 £37,093.26 £3,561.38
£19,253.19 £14,253.07 £2,544.06
£153,947.84 £11,993.01 £2,779.59
£227,844.55 £29,682.94 £52,195.53
Here is one way replace
df=df.replace({'£':'',',':''},regex=True)
df
total refunded tax
0 43622.61 4528.86 16975.51
1 133873.05 37093.26 3561.38
2 19253.19 14253.07 2544.06
3 153947.84 11993.01 2779.59
4 227844.55 29682.94 52195.53
You can use the apply method to apply a function to all columns of a pandas dataframe.
import pandas as pd
import io
data = io.StringIO("""
total refunded tax
£43,622.61 £4,528.86 £16,975.51
£133,873.05 £37,093.26 £3,561.38
£19,253.19 £14,253.07 £2,544.06
£153,947.84 £11,993.01 £2,779.59
£227,844.55 £29,682.94 £52,195.53
""")
df = pd.read_csv(data, sep=' ')
def cleaner(x):
cleaned = [''.join([c for c in s if c not in {'£', ','}]) for s in x]
return cleaned
df = df.apply(cleaner)
print(df.to_string())
# total refunded tax
# 0 43622.61 4528.86 16975.51
# 1 133873.05 37093.26 3561.38
# 2 19253.19 14253.07 2544.06
# 3 153947.84 11993.01 2779.59
# 4 227844.55 29682.94 52195.53

Categories