I have the following sample data:
date
value
0
2021/05
50
1
2021/06
60
2
2021/07
70
3
2021/08
80
4
2021/09
90
5
2021/10
100
I want to update the data in the 'date' column, where for example '2021/05' becomes '05/10/2021', '2021/06' becomes '06/12/2021' and so long (I have to choose the new date manually for every row).
Is there a better/more clever way to do it instead of:
for i in df.index:
if df['date'][i] == '2021/05':
df['date'][i] = '05/10/2021'
elif df['date'][i] == '2021/06':
df['date'][i] = '06/12/2021'
The problem is that there are more than hundred rows that have to be updated and the code above will be tremendously long.
We can use the select method from numpy like so :
import numpy as np
condlist = [df['date'] == '2021/05',
df['date'] == '2021/06']
choicelist = ['05/10/2021',
'06/12/2021']
df['date'] = np.select(condlist, choicelist, default=np.nan)
I would use an interactive approach, saving the amended DataFrame to a file at the end:
import pandas as pd
dt = pd.DataFrame({"date":["2021/05", "2021/06", "2021/07", "2021/08", "2021/09", "2021/10"], "value": [50, 60, 70, 80, 90, 100]})
for n, i in enumerate(dt.loc[:,"date"]):
to_be_parsed = True
while parsed:
day = input("What is the day for {:s}?".format(i))
date_str = "{:s}/{:0>2s}".format(i, day)
try:
dt.loc[n,"date"] = pd.to_datetime("{:s}/{:0>2s}".format(i, day)).strftime("%m/%d/%Y")
to_be_parsed = False
except:
print("Invalid date: {:s}. Try again".format(date_str))
output_path = input("Save amended dataframe to path (no input to skip): ")
if len(output_path) > 0:
dt.to_csv(output_path, index=False)
Related
My current code is extremely slow with the nested for loop setup. I would like to speed up the process, my assumption would be that the solution is the vectorization with Pandas or NumPy. I do not know how to transfer my current code into the new format.
I have created an example code below.
import pandas as pd
import numpy as np
balance = 10000
raw_data = [[1,2,4,1,3],[2,3,7,2,4],[3,4,5,3,4],[4,4,9,1,5],[5,5,6,4,5]]
raw_df = pd.DataFrame(raw_data, columns=['D','O','H','L','C'])
history_data = [[1,1,5,np.nan,4],[0,1,3,np.nan,4],[1,0,4,2,3],[1,0,1,6,0],[0,1,7,np.nan,8]]
history_df = pd.DataFrame(history_data, columns=['TY','ST','OP','CL','SL'])
for n in raw_df.index:
for p in history_df.index:
if history_df['ST'][p] == 1 and history_df['TY'][p] == 1 and history_df['SL'][p] >= raw_df['L'][n]:
history_df['CL'][p] = raw_df['L'][n]
history_df['ST'][p] = 0
balance = balance + 20
if raw_df['C'][n] > 4:
history_df = history_df.append({'TY':0,'ST':1,'OP':5,'CL':np.nan,'SL':9,},ignore_index = True)
Check out this example, see if it helps :
import numpy as np
# Use NumPy's where function to perform the check for each row of history_df and raw_df simultaneously
mask = np.where((history_df['ST'] == 1) & (history_df['TY'] == 1) & (history_df['SL'] >= raw_df['L']))
history_df.loc[mask, 'CL'] = raw_df.loc[mask, 'L']
history_df.loc[mask, 'ST'] = 0
# Calculate the balance change
balance_change = 20 * len(mask[0])
balance += balance_change
# Append rows to history_df where raw_df['C'] > 4
new_rows = raw_df[raw_df['C'] > 4]
new_rows['TY'] = 0
new_rows['ST'] = 1
new_rows['OP'] = 5
new_rows['CL'] = np.nan
new_rows['SL'] = 9
history_df = history_df.append(new_rows, ignore_index=True)
import pandas as pd
l1 = ["2021-11-15","2021-11-13","2021-11-10","2021-05-28","2021-06-02","2021-06-02","2021-11-02"]
l2 = ["2021-11-11","2021-03-02","2021-11-05","2021-05-20","2021-05-01","2021-06-01","2021-04-08"]
#convert to dt
l1=pd.to_datetime(l1)
l2= pd.to_datetime(l2)
#put in df
df1=pd.DataFrame(l1)
df2=pd.DataFrame(l2)
df1.columns = ['0']
df2.columns = ['0']
df1=df1.set_index('0')
df2=df2.set_index('0')
#sort asc
df1=df1.sort_index()
df2=df2.sort_index()
How can I get a COUNT from each dataframe based on the number of rows that are within the last 7 days?
you can slice between two timestamps and then get the number of rows with .shape[0]:
def get_count_last_7_days(df):
stop = df.index.max()
start = stop - pd.Timedelta('7D')
return df.loc[start:stop].shape[0]
count1 = get_count_last_7_days(df1)
count2 = get_count_last_7_days(df2)
import pandas as pd
import numpy as np
from datetime import date, timedelta
x = (date.today() - timedelta(days=100))
y = (date.today() - timedelta(days=7))
z = date.today()
dates = pd.date_range(x, periods=100)
d = np.arange(1, 101)
df = pd.DataFrame(data=d, index=pd.DatetimeIndex(dates))
df = df.sort_index()
last_seven_days = df.loc[y:z]
print(last_seven_days.count())
Your last 7 days is ambiguous, I assume it's calculated from current time:
today = datetime.today()
week_ago = today - timedelta(days=7)
Since you already set the date as index, you can use .loc directly, but you also can use a mask:
df = df1.loc[week_ago:today]
# or
df = df1[(df1.index > week_ago) & (df1.index < today)]
To get row count, you can use shape accessor or sum the boolean mask
count = df1.loc[week_ago:today].shape[0]
# or
sum((df1.index > week_ago) & (df1.index < today))
I am new to Pandas, and I'm trying to avoid iterating over a DataFrame, and attempting to use vectorisation instead. I am not able to get the results I want; I need help in the more complicated masking and selection statements
This is my code:
import random
from datetime import datetime, timedelta
import pandas as pd
dates = []
temp = []
press = []
vel = []
fmt = '%Y-%m-%d %H:%M:%S'
stime = datetime.strptime('2020-01-06 10:28:16', fmt)
etime = datetime.strptime('2020-04-10 03:43:12', fmt)
td = etime - stime
l = set([random.random() for x in range(0, 1000)])
dates = [((td * x) + stime) for x in random.sample(l, 100)]
for i in range(100):
press.append(random.uniform(14,95.5))
temp.append(random.uniform(-15,45))
vel.append(random.uniform(50,153))
measurements = {
'date' : dates,
'pressure' : press,
'velocity' : vel,
'temperature': temp
}
df = pd.DataFrame(measurements)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df = df.sort_index()
df2 = pd.DataFrame()
# if temp increased from previous row, set flag
df2['temp_inc'] = df['temperature'] - df.shift(1)['temperature'] > 0
df2['temp_inc'] = df2['temp_inc'].replace({True: 1, False: 0})
# need to fetch velocity where pressure has increased from previous row, else 0
press_up_mask = df.where( (df['pressure'] - df.shift(1)['pressure']) > 0)
#df2['press_spike_velocity'] = df[press_up_mask]['velocity']
# Need to perform calc based on 'temp_inc' column: if 'temp_inc' column is 1: calculate pressure * velocity, else 0
temp_inc_mask = df2['temp_inc'] == 1
df2['boyle_fact'] = df[temp_inc_mask]['pressure'] * df[temp_inc_mask]['velocity']
# Get some stats
df2['short_max_temp'] = df['temperature'].rolling(3).max()
df2['long_min_pressure'] = df['pressure'].rolling(30).min()
print(df.head())
print(df2.head())
How do I correctly calculate columns 'press_spike_velocity' and 'boyle_fact' ?
Starting from the computations:
# if temp increased from previous row, set flag
df2['temp_inc'] = df['temperature'] - df.shift(1)['temperature'] > 0
# setting int type instead of replace
df2['temp_inc'] = df2['temp_inc'].astype(int)
# need to fetch velocity where pressure has increased from previous row, else 0
press_up_mask = df.where( (df['pressure'] - df['pressure'].shift(1)) > 0)
# set column to velocity then mask in zeros via assignment
df2['press_spike_velocity'] = df['velocity'].copy()
df2['press_spike_velocity'][~press_up_mask] = 0
# Need to perform calc based on 'temp_inc' column: if 'temp_inc' column is 1: calculate pressure * velocity, else 0
temp_inc_mask = df2['temp_inc'] == 1
# same masking approach as above
df2['boyle_fact'] = df['pressure'] * df['velocity']
df2['boyle_fact'][~temp_inc_mask] = 0
This is the simplest way to solve your problem with minimal changes to the code itself. If you dig into pandas more you could probably find methods to do this in 1-2 fewer lines via inplace operations but I don't know how much performance or readability you would gain from that.
I'm following one tutorial for web scraping an I'm stuck with one part.
I'm only getting errors when I try to run the following code:
df7['Time2'] = df7['Time'].str.split(':').apply(lambda x: float(x[0]) * 60 + float(x[1]) + float(x[2])/60)
Get the error:
IndexError: list index out of range
Also tried the following:
time_mins = []
for i in time_list:
h, m, s = i.split(':')
math = (int(h) * 3600 + int(m) * 60 + int(s))/60
time_mins.append(math)
Again didn't work.
My cell is like:
The result that I want is like:
Any help would be helpful...
Tks in adv.
Create Sample Dataframe:
# Import packages
import pandas as pd
# Create sample dataframe
time = ['1:38:17','1:38:31','1:38:32']
gender = ['M','F','M']
data = pd.DataFrame({
'Time':time,
'Gender':gender
})
data
Out[]:
Time Gender
0 1:38:17 M
1 1:38:31 F
2 1:38:32 M
Convert column into timedelta format:
# Time conversion
data['Time'] = pd.to_timedelta(data['Time'])
# Time in days
data = data.assign(Time_in_days = [x.days for x in data['Time']])
# Time in hour
data = data.assign(Time_in_hour = [(x.seconds)/(60.0*60.0) for x in data['Time']] )
# Time in minutes
data = data.assign(Time_in_minutes = [(x.seconds)/60.0 for x in data['Time']])
# Time in seconds
data = data.assign(Time_in_seconds = [x.seconds * 1.0 for x in data['Time']] )
print(data)
Time Gender Time_in_days Time_in_hour Time_in_minutes Time_in_seconds
0 01:38:17 M 0 1.638056 98.283333 5897.0
1 01:38:31 F 0 1.641944 98.516667 5911.0
2 01:38:32 M 0 1.642222 98.533333 5912.0
data['Time2'] = data['Time'].apply(lambda x: sum([a*b for a,b in zip(list(map(int,x.split(':')))[::-1],[1/60,1,60])]))
If you have date['Time'] dtype as string if not then just make small change in above line :
x.str.split(':')
I want to create a function that will read a series of time values from a file (with gaps in the sampling rate,thats the problem) and would read me exactly 200 days and allow me to move through the entire data length,say 10000 day,sort of a rolling window.
I am not sure how to code it. Can I add a statement that calculates the difference between two values of the time variable (x axis) up to when is exactly 200 days?
Or can I somehow write a function that would find the starting value say t0 and then find the element of the array that is closest to t0 + (interval=) 200 days.
What I have so far is:
f = open(reading the file from directory)
lines = f.readlines()
print(len(lines))
tx = np.array([]) # times
y= np.array([])
interval = 200 # days
for li in lines:
col = li.split()
t0 = np.array([])
t1 = np.array([])
tx = np.append(tx, float(col[0]))
y= np.append(y, float(col[1]))
t0 = np.append(t0, np.max(tx))
t1 = np.append(t1, tx[np.argmin(tx)])
print(t0,t1)
days = [t1 + dt.timedelta(days = float(x)) for x in days]
#y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":tx, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))
['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output error:
TypeError: float() argument must be a string or a number, not
'datetime.datetime'
Additionally,
First 10 tx and ty values respectively:
0 0.003372722575018
0.015239999629557 0.003366515509113
0.045829999726266 0.003385171061055
0.075369999743998 0.003385171061055
0.993219999596477 0.003366515509113
1.022699999623 0.003378941085299
1.05217999964952 0.003369617612836
1.08166999975219 0.003397665493594
3.0025899996981 0.003378941085299
3.04120999993756 0.003394537568711
import numpy as np
import pandas as pd
import datetime as dt
# load data in days and y arrays
# ... or generate them:
N = 1000 # number of days
day_min = dt.datetime.strptime('2000-01-01', '%Y-%m-%d')
day_max = 2000
days = np.sort(np.unique(np.random.uniform(low=0, high=day_max, size=N).astype(int)))
days = [day_min + dt.timedelta(days = int(x)) for x in days]
y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":days, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output:
value closest_value
day
2005-06-15 1.668638 1.591505
2005-06-16 0.316645 0.304382
2005-06-17 0.458580 0.445592
2005-06-18 -0.846174 -0.847854
2005-06-22 -0.151687 -0.166404
You could use pandas, set a datetime range and create a while loop to process the data in batches.
import pandas as pd
from datetime import datetime, timedelta
# Load data into pandas dataframe
df = pd.read_csv(filepath)
# Name columns
df.columns = ['dates', 'num_value']
# Convert strings to datetime
df.dates = pd.to_datetime(df['dates'], format='%d/%m/%Y')
# Print dates within a 200 day interval and move on to the next interval
i = 0
while i < len(df.dates):
start = df.dates[i]
end = start + timedelta(days=200)
print(df.dates[(df.dates >= start) & (df.dates < end)])
i += 200
If the columns don't have headers, you should omit skiprows:
dates num_value
2004-7-1 1
2004-7-2 5
2004-7-4 8
2004-7-5 11
2004-7-6 17
df = pd.read_table(filepath, sep="\s+", skiprows=1)