I have written the code given below. There are two Pandas dataframes: df contains columns timestamp_milli and pressure and df2 contains columns timestamp_milli and acceleration_z. Both dataframes have around 100'000 rows. In the code shown below I'm searching for each timestamp of each row of df the rows of df2 where the time difference lies within a range and is minimal.
Unfortunately the code is extremly slow. Moreover, I'm getting the following message originating from the line df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
How can I speedup the code and solve the warning?
acceleration = []
pressure = []
for index, row in df.iterrows():
mask = (df2["timestamp_milli"] >= (row["timestamp_milli"] - 5)) & (df2["timestamp_milli"] <= (row["timestamp_milli"] + 5))
df_temp = df2[mask]
# Select closest point
if len(df_temp) > 0:
df_temp["timestamp_milli"] = df_temp["timestamp_milli"] - row["timestamp_milli"]
df_temp["timestamp_milli"] = df_temp["timestamp_milli"].abs()
df_temp = df_temp.loc[df_temp["timestamp_milli"] == df_temp["timestamp_milli"].min()]
for index2, row2 in df_temp.iterrows():
pressure.append(row["pressure"])
acc = row2["acceleration_z"]
acceleration.append(acc)
I have faced a similar problem, using itertuples instead of iterrows shows significant reduction in time.
why iterrows have issues.
Hope this helps.
Related
When processing some medical training data to train a classifier for different medical tests, I got the SettingWithCopyWarning from pandas. I already read about it and figured out it comes from chained indexing a DataFrame, however I can't figure out the point in my code below where I use chained indexing.
#Turn the 12 measurement result rows of each patient into one single row for each patient
#the different test results are named: result, result_1, ... , result_11 (for each result)
#the pid and age column is kept only once while all other column fields of the 12 measurement rows are concatenated
#into one single row, also the time field exists now 12 times per row
imputed_features.sort_values(by=['pid','Time'], inplace=True)
sorted_features = train_features.sort_values(by=['pid','Time'])
measurements = []
columns = []
for i in range(12):
measurements.append(imputed_features.groupby(['pid'], as_index=False).nth(i))
measurements[i].reset_index(drop=True, inplace=True)
if( i == 0 ):
columns = [i for i in measurements[i].columns]
else:
measurements[i].drop(['pid', 'Age'], axis=1, inplace=True)
for j in measurements[i].columns:
columns.append(f'{j}_{i}')
#the resulting aggregated_features DataFrame
aggregated_features = pd.concat(measurements[0:12], axis=1, ignore_index=True)
aggregated_features.columns = columns
aggregated_features.to_csv('aggregated_features.csv', index=False)
I think it's because you're specifying a row.
pd.concat(measurements, ....)
If you still get a warning, maybe copying and merging the 'measurements' will improve it?
measures = measurements.copy()
pd.concat(measures[0:12], ...)
I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'hashtag':np.random.randn(3000000),
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
})
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
ratings.append(rating)
Globally, you'll need groupby. you can either:
use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
.value_counts(normalize=True)
.rename('ratings').reset_index(),
how='left')
in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.
I have a pandas dataframe of 62 rows x 10 columns. Each row contains numbers and if any of the numbers are within a certain range then return a string into the last column.
I have unsuccessfully tried the .apply method to use a function to make the assessment. I have also tried to import as a series but then the .apply method causes problems because it is a list.
df = pd.read_csv(results)
For example, in the image attached, if any value from Base 2019 to FY26 Load is between 0.95 and 1.05 then return 'Acceptable' into the last column otherwise return 'Not Acceptable'.
Any help, even a start would be much appreciated.
This should perform as expected:
results = "input.csv"
df = pd.read_csv(results)
low = 0.95
high = 1.05
# The columns to check
cols = df.columns[2:]
df['Acceptable?'] = (df[cols] > low).any(axis=1) & (df[cols] < high).all(axis=1)
FOUND SOLUTION: I needed to change datatype for dataframe:
for p in periods:
df['Probability{}'.format(p)] = 0
for p in periods:
df['Probability{}'.format(p)] = float(0)
Alternatively do as in approved answer below.
I am asserting new values for cells as floats but they are set as integers and I don't get why.
It is a part of a data mining project, which contains nested loops.
I am using Python 3.
I tried different modes of writing into a cell with pandas:
df.at[index,col] = float(val),
df.set_value[index,col,float(val)], and
df[col][index] = float(val) but none of them delivered a solution. The output I got was:
In: print(df[index][col])
Out: 0
In: print(val)
Out: 0.4774410939826658
Here is a simplified version of the loop
periods = [7,30,90,180]
for p in periods:
df['Probability{}'.format(p)] = 0
for i in range(len(df.index)):
for p in periods:
if i >= p - 1:
# Getting relevant data and computing value
vals = [df['Close'][j] for j in range(i - p, i)]
probability = (len([j for j in vals if j>0])/len(vals))
# Asserting value to cell in pd.dataframe
df.at[df.index[i], 'Probability{}'.format(p)] = float(probability)
I don't get why pandas.DataFrame are changing float to integer and rounds up or down. When I asserted values to cells in console directly I did not experience any problems.
Is there any work arounds or solutions to this problem?
I had no problem before nesting a for loop for periods to avoid hard coding a lot of trivial code.
NB: It also seems that if I factorize, e.g. with 100 * val = new_val, it do only factorize the rounded number. So if I multiplied 100*val = new_val = 0 because the number is rounded down to 0.
I also tried to change datatype for the dataframe:
df = df.apply(pd.to_numeric)
All the best.
Seems like a problem with incorrect data types in your dataframe. Your last attempt at converting the whole df was probably very close. Try and use
df['Close'] = pd.to_numeric(df['Close'], downcast="float")
I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()