Slicing in Pandas Dataframes and comparing elements - python

Morning. Recently I have been trying to implement pandas in creating large data tables for machine learning (I'm trying to move away from numpy as best I can).
However-I'm running into some issues-namely, slicing pandas date frames.
Namely-I'd like to return the rows I specify and reference and compare particular elements with those in other arrays-here's some a small amount of code i've implemented and some outline
import pandas as pd
import csv
import math
import random as nd
import numpy
#create the pandas dataframe from my csv. The Csv is entirely numerical data
#with exception of the first row vector which has column labels
df=pd.read_csv(r"C:\Users\Python\Downloads\Data for Brent - Secondattampatatdfrandomsample.csv")
#I use panda functionality to return a random sample of the data (a subset
#of the array)
df_sample=pd.DataFrame.sample(df,10)
It's at this point that I want to compare the first element along each row vector to the original data. Specifically, the first element in any row contains an id number.
If the elements of the original data frame and the sample frame match up like to compute a 3 and 6 month average of the associated column elements with matching id number
I want to disclaim I'm comfy moving to numpy and away from pandas-but there are training model methods I hear a ton of good things about in pandas (My training is the mathematics side of things and less so the program development). thanks for the input!
edit: here is the sample input for the first 11 row vectors in the dataframe (id, year, month,x,y,z)
id year month x y z
0 2 2016 2 1130 343.627538 163660.060200
1 2 2016 4 859 913.314513 360633.159400
2 2 2016 5 931 858.548056 93608.190030
3 2 2016 6 489 548.314860 39925.669950
4 2 2016 7 537 684.441725 80270.240060
5 2 2016 8 618 673.887072 124041.560000
6 2 2016 9 1030 644.749493 88975.429980
7 2 2016 10 1001 543.312870 54874.599830
8 2 2016 11 1194 689.053707 79930.230000
9 2 2016 12 673 483.644736 27567.749940
10 2 2017 1 912 657.716386 54590.460070
11 2 2017 2 671 682.007537 52514.580380
here is how sample data is returned given N same n tuple as before. I used native panda functions to return a randomly generated subset of 10 row vectors out of almost 9000 entries
2 2016 1 633 877.9282175 75890.97027
5185 2774 2016 4 184 399.418719 9974.375000
9441 4974 2017 2 239 135.520851 0.000000
5134 2745 2017 2 187 217.220657 7711.333333
8561 4063 2017 1 103 505.714286 18880.000000
3328 2033 2016 11 118 452.152542 7622.000000
3503 2157 2016 3 287 446.668831 8092.588235
5228 2791 2016 2 243 400.166008 12655.250000
9380 4708 2017 2 210 402.690583 5282.352941
1631 1178 2016 10 56 563.716667 16911.500000
2700 1766 2016 1 97 486.764151 6449.625000

I'd like to decry the appropriate positions in the sample array to search for identical elements in the original array and compute averages (and eventually more rigorous statistical modeling) to their associated numerical data
for id in df_sample['id'].unique():
df.groupby('id').mean()[['x', 'y', 'z']].reset_index()
I'm not sure if this is exactly what you want but I'll walk through it to see if it gives you ideas. For each unique id in the sample (I did it for all of them, implement whatever check you like), I grouped the original dataframe by that id (all rows with id == 2 are smushed together) and took the mean of the resulting pandas.GroupBy object as required (which averages the smushed together rows, for each column not in the groupby call). Since this averages your month and year as well, and all I think I care about is x, y, and z, I selected those columns, and then for aesthetic purposes reset the index.
Alternatively, if you wanted the average for that id for each year in the original df, you could do
df.groupby(['id', 'year']).mean()[['x', 'y', 'z']].reset_index()

Related

Incorrect output with np.random.choice

I am trying to randomly select records from 17mm dataframe using np.random.choice as it runs faster compared to other methods but I am getting incorrect value in output against each record...example below:
data = {
"calories":[420,380,390,500,200,100],
"Duration":[50,40,45,600,450,210],
"Id":[1,1 2,3,2,3],
"Yr":[2003,2003,2009,2003,2012,2003],
"Mth":[3,6,9,12,3,6],
}
df = PD.dataframe(data)
df2=df.groupby(['id','yr'],as_index=False).agg(np.random.choice)
Output:
Id yr calories duration mth
1 2003 420 50 6
2 2009 390 45 9
2 2012 200 450 3
3 2003 500 210 6
Problem in the output is for Id 3 for calories 500, duration and mth should be 600 and 12 instead of 210 and 6...can anyone please help why it is choosing value from different row ?
Expected output:
Same row value should be retained after random selection
This doesn't work because Pandas applies aggregates across each column independently, try putting a print statement in, e.g.:
def fn(x):
print(x)
return np.random.choice(x)
df.groupby(['id','yr'],as_index=False).agg(fn)
would let you see when the function was called and what it was called with.
I'm not an expert in Pandas, but using GroupBy.apply seems to be the easiest way I've found of keeping rows together.
Something like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"calories":[420,380,390,500,200,100],
"duration":[50,40,45,600,450,210],
"id":[1,1,2,3,2,3],
"yr":[2003,2003,2009,2003,2012,2003],
"mth":[3,6,9,12,3,6],
})
df.groupby(['id', 'yr'], as_index=False).apply(lambda x: x.sample(1))
produces:
calories duration id yr mth
0 1 380 40 1 2003 6
1 2 390 45 2 2009 9
2 4 200 450 2 2012 3
3 5 100 210 3 2003 6
the two numbers at the beginning are because you end up with a multi-index. If you want to know where the rows were selected from this would contain useful information, otherwise you could discard the index.
Note that there are warnings in the docs that this might not be very performant, but don't know the details.
Update: I've just had more of a read of the docs, and noticed that there's a GroupBy.sample method, so you could instead just do:
df.groupby(['id', 'yr']).sample(1)
which would presumably be performant as well as being much shorter!

Python Pandas: Best way to find local maximums in large DF

I have a large dataframe that consitsts of many cycles, each cycle has 2 maximum peak values inside that I need to capture into another dataframe.
I have created a sample data frame that mimics the data I am seeing:
import pandas as pd
data = {'Cycle':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3], 'Pressure':[100,110,140,180,185,160,120,110,189,183,103,115,140,180,200,162,125,110,196,183,100,110,140,180,185,160,120,180,201,190]}
df = pd.DataFrame(data)
As you can see in each cycle there are two maxes but the part I was having trouble with was that the 2nd peak is usaully higher than the first peak, so there could be rows of numbers technicially higher than the other peaks max in the cycle. The results should look something like this:
data2 = {'Cycle':[1,1,2,2,3,3], 'Peak Maxs': [185,189,200,196,185,201]}
df2= pd.DataFrame(data2)
I have tried a couple methods including .nlargest(2) per cycle, but the problem is that since one of the peaks is usually higher it will pull the 2nd highest number in the data, which isnt necesssarily the other peak.
This graph shows the peak pressures from each cycle that I would like to be able to find.
Thanks for any help.
From scipy argrelextrema
from scipy.signal import argrelextrema
out = df.groupby('Cycle')['Pressure'].apply(lambda x : x.iloc[argrelextrema(x.values, np.greater)])
Out[124]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
out = out.sort_values().groupby(level=0).tail(2).sort_index()
out
Out[138]:
Cycle
1 4 185
8 189
2 14 200
18 196
3 24 185
28 201
Name: Pressure, dtype: int64
Use groupby().shift() to get the neighborhood values, then compare:
g = df.groupby('Cycle')
local_maxes = (df['Pressure'].gt(g['Pressure'].shift()) # greater than previous row
& df['Pressure'].gt(g['Pressure'].shift(-1))] # greater than next row
)
df[local_maxes]
Output:
Cycle Pressure
4 1 185
8 1 189
14 2 200
18 2 196
24 3 185
28 3 201

Calculate column values in pandas based on previous rows of data in another column

Let's say I have a table with two columns: Date and Amount. Number of rows are not more than 3000.
Row Date Amount
1 15/05/2021 248
2 16/05/2021 115
3 17/05/2021 387
4 18/05/2021 214
5 19/05/2021 678
6 20/05/2021 489
7 21/05/2021 875
8 22/05/2021 123
................
I need to add a third column which will calculate the trim mean values based on the Amount column.
I will be using this function: my_table['TrimMean'] = stats.trim_mean(my_table['Amount'], 0.1), but adapted for my problem.
The problem is that this is not a fixed range, but a dynamic one, following this logic: for each row in my table, the trim mean value will be calculated based on the previous 90 values of the Amount column, starting from the row above current row. If there are less that 90 values, then calculate with whatever amount of rows is available.
e.g. TrimMean[1000]=stats.trim_mean(array from column Amount containing values from rows 910 to 999) TrimMean[12]=stats.trim_mean(array from column Amount containing values from rows 1 to 11)
Hope that makes sense.
Is there any way I can calculate this in a simple way, without going through row by row iteration?
We can calculate the trim_mean by applying the function over a rolling window of size 90 and min_periods=1
from scipy.stats import trim_mean
df['Amount'].rolling(90, min_periods=1).apply(trim_mean, args=(0.1, )).shift()
0 NaN
1 248.000000
2 181.500000
3 250.000000
4 241.000000
5 328.400000
6 355.166667
7 429.428571
Name: Amount, dtype: float64

Python: How to find the number of items in each point on scatterplot and produce list?

Right now I have a dataset of 1206 participants who have each endorsed a certain number of traumatic experiences and a number of symptoms associated with the trauma.
This is part of my dataframe (full dataframe is 1206 rows long):
SubjectID
PTSD_Symptom_Sum
PTSD_Trauma_Sum
1223
3
5
1224
4
2
1225
2
6
1226
0
3
I have two issues that I am trying to figure out:
I was able to create a scatter plot, but I can't tell from this plot how many participants are in each data point. Is there any easy way to see the number of subjects in each data point?
I used this code to create the scatterplot:
plt.scatter(PTSD['PTSD_Symptom_SUM'], PTSD['PTSD_Trauma_SUM'])
plt.title('Trauma Sum vs. Symptoms')
plt.xlabel('Symptoms')
plt.ylabel('Trauma Sum')
I haven't been able to successfully produce a list of the number of people endorsing each pair of items (symptoms and trauma number). I am able to run this code to create the counts for the number of people in each category:
:
count_sum= PTSD['PTSD_SUM'].value_counts()
count_symptom_sum= PTSD['PTSD_symptom_SUM'].value_counts()
print(count_sum)
print(count_symptom_sum)
Which produces this output:
0 379
1 371
2 248
3 130
4 47
5 17
6 11
8 2
7 1
Name: PTSD_SUM, dtype: int64
0 437
1 418
2 247
3 74
4 23
5 4
6 3
Name: PTSD_symptom_SUM, dtype: int64
Is it possible to alter the code to count the number of people endorsing each pair of items (symptom number and trauma number)? If not, are there any functions that would allow me to do this?
You could create a new dataset with the counts of each pair 'PTSD_SUM', 'PTSD_Symptom_SUM' with:
counts = PTSD.groupby(by=['PTSD_symptom_SUM', 'PTSD_SUM']).size().to_frame('size').reset_index()
and then use Seaborn like this:
import seaborn as sns
sns.scatterplot(data=counts, x="PTSD_symptom_SUM", y="PTSD_SUM", hue="size", size="size")
To obtain something like this:
If I understood properly, your dataframe is:
SubjectID TraumaSum Symptoms
1 1 5
2 3 4
...
So you just need:
dataset.groupby(by=['PTSD_SUM', 'PTSD_Symptom_SUM']).count()
This line will return you the count for each unique value

Numpy 2D array in Python 3.4

I have this code:
import pandas as pd
data = pd.read_csv("test.csv", sep=",")
data array looks like that:
The problem is that I can't split it by columns, like that:
week = data[:,1]
It should split the second column into the week, but it doesn't do it:
*TypeError: unhashable type: 'slice'
*
How should I do this to make it work?
I also wondering, that what this code do exactly? (Don't really understand np.newaxis part)
week = data['1'][:, np.newaxis]
Result:
There are a few issues here.
First, read_csv uses a comma as a separator by default, so you don't need to specify that.
Second, the pandas csv reader by default uses the first row to get column headings. That doesn't appear to be what you want, so you need to use the header=None argument.
Third, it looks like your first column is the row number. You can use index_col=0 to use that column as the index.
Fourth, for pandas, the first index is the column, not the row. Further, using the standard data[ind] notation is indexing by column name, rather than column number. And you can't use a comma to index both row and column at the same time (you need to use data.loc[row, col] to do that).
So for your case, all you need to do to get the second columns is data[2], or if you use the first column as the row number then the second column becomes the first column, so you would do data[1]. This returns a pandas Series, which is the 1D equivalent of a 2D DataFrame.
So the whole thing should look like this:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0)
week = data[1]
data looks like this:
1 2 3 4
0
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 22 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 60
The '0' row doesn't exist, it is just there for informational purposes.
week looks like this:
0
1 10
2 15
3 25
4 22
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: 1, dtype: int64
However, you can give columns (and rows) meaningful names in pandas, and then access them by those names. I don't know the column names, so I just made some up:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0, names=['week', 'spam', 'eggs', 'grail'])
week = data['week']
In this case, data looks like this:
week spam eggs grail
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 33 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 50
And week looks like this:
1 10
2 15
3 25
4 33
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: week, dtype: int64
For np.newaxis, what that does is add one dimension to the array. So say you have a 1D array (a vector), using np.newaxis on it would turn it into a 2D array. It would turn a 2D array into a 3D array, 3D into 4D, and so on. Depending on where you put it (such as [:,np.newaxis] vs. [np.newaxis,:], you can determine which dimension to add. So np.arange(10)[np.newaxis,:] (or just np.arange(10)[np.newaxis]) gives you a shape (1,10) 2D array, while np.arange(10)[:,np.newaxis] gives you a shape (10,1) 2D array.
In your case, what the line is doing is getting the column with the name 1, which is a 1D pandas Series, then adding a new dimension to it. However, instead of turning it back into a DataFrame, it instead converts it into a 1D numpy array, then adds one dimension to make it a 2D numpy array.
This, however, is dangerous long-term. There is no guarantee that this sort of silent conversion won't be changed at some point. To change a pandas objects to a numpy one, you should use an explicit conversion with the values method, so in your cases data.values or data['1'].values.
However, you don't really need a numpy array. A series is fine. If you really want a 2D object, you can convert a Series into a DataFrame by using something like data['1'].to_frame().

Categories