Below is my data set.
I want to calculate the average temperature of each station. It is more desired if I can remove zero (noise) data.
How can I do it?
enter image description here
I have no idea on how to start with.
I think you can use standard csv reader to iterate over rows, collect all temperatures to the list t and do something like sum(t) / len(t)
Just turn the data into a pandas.DataFrame and use pandas.DataFrame.groupby method:
import pandas as pd
file = #your csv file
df = pd.read_csv(file)
df = df.drop(df[df['temp'] == 0].index)
print(df.groupby('ID')[['temp']].mean())
Gives:
temp
ID
1 20.5
2 32.1
3 14.4
Note: the file I used looks like...
ID,stuff,temp
1,3,20
1,6,20.1
1,7,21.4
2,1,30.2
2,3,0
2,2,34
3,7,0
3,6,0
3,2,14.4
If wanting to turn that data into a column, you create a dictionary and use it to 'replace' (but not really) a new column in the DataFrame:
mean = df.groupby('ID')[['temp']].mean() # Store this into a variable
groups = {}
for i in mean.itertuples(): # Iterate over the values
groups[i[0]] = i[1]
df['avg_temp'] = df['ID'].replace(groups) # Create a new column
print(df)
Gives:
ID stuff temp avg_temp
0 1 3 20.0 20.5
1 1 6 20.1 20.5
2 1 7 21.4 20.5
3 2 1 30.2 32.1
5 2 2 34.0 32.1
8 3 2 14.4 14.4
I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop
You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333
I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()
I have an ascii file as following
7.00000000 5.61921453
18.00000000 9.75818253
13.00000000 37.94074631
18.00000000 29.54162407
10.00000000 18.82115364
13.00000000 15.00485802
16.00000000 19.24893761
20.00000000 22.59035683
17.00000000 59.69598007
17.00000000 34.07574844
18.00000000 24.17820358
13.00000000 24.70093536
11.00000000 23.37569046
14.00000000 34.14352036
13.00000000 33.33922577
16.00000000 36.64311981
20.00000000 60.21446609
20.00000000 33.54150391
18.00000000 40.84828949
21.00000000 40.31245041
34.00000000 91.71004486
40.00000000 93.24317169
42.00000000 43.94712067
12.00000000 32.73310471
7.00000000 25.25534248
9.00000000 23.14623833
I want to calculate (for both columns, separately) the mean values of the first 10 rows, then the next 11 rows, then the next 5 rows so as to get the following output
14.9 25.2296802
18 40.2734046
22 43.6649956
How I could do that in python with pandas? In case I would have a stable group of rows (e.g. per 10 rows) I would do the following
df = pd.read_csv(i,sep='\t',header=None)
df_mean=df.groupby(np.arange(len(df))//10).mean()
Use numpy.repeat to craft groups (here a/b/c) with arbitrary lengths:
import numpy as np
means = df.groupby(np.repeat(['a', 'b', 'c'], [10, 11, 5])).mean()
output:
0 1
a 14.9 25.229680
b 18.0 40.273405
c 22.0 43.664996
If you don't care about group names:
groups = [10, 11, 5]
means = df.groupby(np.repeat(np.arange(len(groups)), groups)).mean()
output:
0 1
0 14.9 25.229680
1 18.0 40.273405
2 22.0 43.664996
Is there an easy way to sum the value of all the rows above the current row in an adjacent column? Click on the image below to see what I'm trying to make. It's easier to see it than explain it.
Text explanation: I'm trying to create a chart where column B is either the sum or percent of total of all the rows in A that are above it. That way I can quickly visualize where the quartile, third, etc are in the dataframe. I'm familiar with the percentile function
How to calculate 1st and 3rd quartiles?
but I'm not sure I can get it to do exactly what I want it to do. Image below as well as text version:
Text Version
1--1%
1--2%
4--6%
4--10%
2--12%
...
and so on to 100 percent.
Do i need to write a for loop to do this?
Excel Chart:
you can use cumsum for this:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=dict(x=[13,22,34,21,33,41,87,24,41,22,18,12,13]))
df["percent"] = (100*df.x.cumsum()/df.x.sum()).round(1)
output:
x percent
0 13 3.4
1 22 9.2
2 34 18.1
3 21 23.6
4 33 32.3
5 41 43.0
6 87 65.9
7 24 72.2
8 41 82.9
9 22 88.7
10 18 93.4
11 12 96.6
12 13 100.0
I have two dataframes. I will explain my requirement in form of a loop--because this is how I visualize the problem.
I realize that there can be another solution, so if this can be done differently, please feel free to share! I am new to Pandas, so I'm struggling with this solution. Thank you in advance for looking at my question!!
I have 2 dataframes that have 3 columns: ID, ODO, ODOLength. ODOLength is the running difference for each ODO record, which I got using: abs(Df1['Odo'] - Df1['Odo'].shift(-1))
OldDataSet = {'id' : [10,20,30,40,50,60,70,80,90,100,110,120,130,140],'Odo': [-1.09,1.02,26.12,43.12,46.81,56.23,111.07,166.38,191.27,196.41,207.74,231.61,235.84,240.04], 'OdoLength':[2.11,25.1,17,3.69,9.42,54.84,55.31,24.89,5.14,11.33,23.87,4.23,4.2,4.09]}
NewDataSet = {'id' : [1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000,13000,14000],'Odo': [1.51,2.68,4.72,25.03,42,45.74,55.15,110.05,165.41,170.48,172.39,190.35,195.44,206.78], 'OdoLength':[1.17,2.04,20.31,16.97,3.74,9.41,54.9,55.36,5.07,1.91,17.96,5.09,11.34,23.89]}
FinalResultDataSet = {'DFOneId':[10,20,30,40,50,60,70,80,90,100,110], 'DFTwoID' : [1000,3000,4000,5000,6000,7000,8000,11000,12000,13000,14000], 'OdoDiff': [2.6,3.7,1.09,1.12,1.07,1.08,1.02,6.01,0.92,0.97,0.96], 'OdoLengthDiff':[0.94,4.79,0.03,0.05,0.01,0.06,0.05,6.93,0.05,0.01,0.02], 'OdoAndLengthDiff':[1.66,1.09,1.06,1.07,1.06,1.02,0.97,0.92,0.87,0.96,0.94]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
FinalDf = pd.DataFrame(FinalResultDataSet)
The logic behind how to get the FinalDF is as follows: Take Odo and OdoLen from df1 and subtract it from each Odo and OdoLen columns in df2. Take the lowest value of the difference and match them. For next comparison of Df1 and Df2, begin with the first Df2 record that does not have a match. If Df2 values are not a minimum value, for the current Df1 values
that is being compared then that record of DF2 is not included in the final dataset. For example, Df1 ID 20- was compared to Df2 ID 2000 and the final result was 21.4 ((DfOne.ODO:1.02-DfTwo.ODO:2.68) - (DfOneODOLen:25.1-DfTwo.ODoLen-2.04) = 21.4), however when Df1 ID 20 is compared to Df2 3000 the final difference is 1.09 ((DfOne.ODO:1.02-DfTwo.ODO:4.72) - (DfOneODOLen:25.1-DfTwo.ODoLen-20.31) = 1.06). In this case, Df2 ID 3000 is matched to DF1 ID 20 and Df2 ID - 2000 is dropped off because
the difference was larger. At this point DF2 ID 2000 is not considered for any other matches. So the next DF1 record comparison would start at DF2 ID 4000, because that is the next value that does not have a match.
As I said, I am open to all suggestions!
Thanks!
You can using merge_asof
Step 1: combine the dataframe
df1['match']=df1.Odo+df1.OdoLength
df2['match']=df2.Odo+df2.OdoLength
out=pd.merge_asof(df1,df2,on='match',direction='nearest')
out.drop_duplicates(['id_y'])
Out[728]:
Odo_x OdoLength_x id_x match Odo_y OdoLength_y id_y
0 -1.09 2.11 10 1.02 1.51 1.17 1000
1 1.02 25.10 20 26.12 4.72 20.31 3000
2 26.12 17.00 30 43.12 25.03 16.97 4000
3 43.12 3.69 40 46.81 42.00 3.74 5000
4 46.81 9.42 50 56.23 45.74 9.41 6000
5 56.23 54.84 60 111.07 55.15 54.90 7000
6 111.07 55.31 70 166.38 110.05 55.36 8000
7 166.38 24.89 80 191.27 172.39 17.96 11000
8 191.27 5.14 90 196.41 190.35 5.09 12000
9 196.41 11.33 100 207.74 195.44 11.34 13000
10 207.74 23.87 110 231.61 206.78 23.89 14000
Step 2
Then you can do something like below to get your new column
out['OdoAndLengthDiff']=out.OdoLength_x-out.OdoLength_y+out.Odo_x-out.Odo_y
BTW I did not drop the column , after you get all new value if you need, You can drop it by using out=out.drop([columns],1)