Read ASCII-File with missing data fields with numpy.genfromtxt

Read ASCII-File with missing data fields with numpy.genfromtxt - python

My data file is like this:
abb
sdsdfmn
sfdf sdf
2011-12-05 11:00 1.0 9.0
2011-12-05 12:00 44.9 2.0
2011-12-05 13:00 66.8 4.2
2011-12-05 14:00 22.8 1.0 26.2 45.2 2.3
2011-12-05 15:00 45.7 2.0 45.0 45.6 1.4
2011-12-05 16:00 23.2 3.0 456.2 11.7 1.5
2011-12-05 17:00 67.4 4.0 999.1 45.8 0.9
2011-12-05 18:00 34.4 1.2
2011-12-05 19:00 12.4 4.2 345.1 11.1 7.6
I used numpy genfromtxt:
data = np.genfromtxt('data.txt', usecols=(0,1,3), skip_header=4, dtype=[('date','S10'),('hour','S5'),('myfloat','f8')])
The Problem is column 3 has some empty values in there (at the beginning and later on). So it read a wrong column.
I tried the delimiter-parameter, because all float columns has fixed width (delimiter=[10,5,5]), but it also fails.
Is there a workaround?

Related

AttributeError: 'list' object has no attribute 'assign'

I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).

I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.

How to sample data from Pandas DataFrame where data is present for every hour of a given day

I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.
I wish to iterate through each day in the input DataFrame and do the following:
Check that there is a row for each hour of the day
Check that there is both maximum and minimum data for each hour of the day
If the conditions above are met, I wish to:
Add a row to the output DataFrame for the given date
Use the date to fill the 'Date_time' cell for the row
Transpose the hourly data to the hourly cells
Find the max of the hourly max data, and use it to fill the max cell for the row
Find the min of the hourly min data, and use it to fill the min cell for the row
Example daily input data examples follow.
Example 1
All hours for day available
Max & min available for each hour
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
0 2019-02-03 00:00:00 18.6 18.7 18.5
1 2019-02-03 01:00:00 18.6 18.7 18.5
2 2019-02-03 02:00:00 18.2 18.5 18.0
3 2019-02-03 03:00:00 18.0 18.0 17.9
4 2019-02-03 04:00:00 18.0 18.1 17.9
5 2019-02-03 05:00:00 18.3 18.4 18.1
6 2019-02-03 06:00:00 18.7 19.1 18.4
7 2019-02-03 07:00:00 20.1 21.3 19.1
8 2019-02-03 08:00:00 21.7 22.9 21.0
9 2019-02-03 09:00:00 23.2 23.9 22.8
10 2019-02-03 10:00:00 23.7 24.1 23.3
11 2019-02-03 11:00:00 24.6 25.5 24.0
12 2019-02-03 12:00:00 25.1 25.7 24.7
13 2019-02-03 13:00:00 24.5 25.0 24.2
14 2019-02-03 14:00:00 23.9 25.3 21.2
15 2019-02-03 15:00:00 19.6 21.2 18.8
16 2019-02-03 16:00:00 19.2 19.5 18.7
17 2019-02-03 17:00:00 19.8 19.9 19.4
18 2019-02-03 18:00:00 19.6 19.8 19.5
19 2019-02-03 19:00:00 19.3 19.4 19.1
20 2019-02-03 20:00:00 19.2 19.4 19.1
21 2019-02-03 21:00:00 19.3 19.4 18.9
22 2019-02-03 22:00:00 18.8 19.0 18.7
23 2019-02-03 23:00:00 19.0 19.1 18.9
Example 2
All hours for day available
Max & min available for each hour
NaN values for some Mean_temp entries
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
24 2019-02-04 00:00:00 18.9 19.0 18.9
25 2019-02-04 01:00:00 18.8 18.9 18.7
26 2019-02-04 02:00:00 18.6 18.8 18.4
27 2019-02-04 03:00:00 18.4 18.6 18.1
28 2019-02-04 04:00:00 18.7 18.9 18.4
29 2019-02-04 05:00:00 18.8 18.8 18.7
30 2019-02-04 06:00:00 19.0 19.3 18.8
31 2019-02-04 07:00:00 19.7 20.4 19.3
32 2019-02-04 08:00:00 21.4 22.8 20.3
33 2019-02-04 09:00:00 23.5 23.9 22.8
34 2019-02-04 10:00:00 25.7 23.6
35 2019-02-04 11:00:00 26.5 25.4
36 2019-02-04 12:00:00 27.1 26.1
37 2019-02-04 13:00:00 25.8 26.8 24.8
38 2019-02-04 14:00:00 25.4 27.8 23.7
39 2019-02-04 15:00:00 22.1 24.1 20.2
40 2019-02-04 16:00:00 21.8 22.6 20.2
41 2019-02-04 17:00:00 20.9 22.4 19.6
42 2019-02-04 18:00:00 18.9 19.6 18.6
43 2019-02-04 19:00:00 18.8 18.9 18.6
44 2019-02-04 20:00:00 18.9 19.0 18.8
45 2019-02-04 21:00:00 18.8 18.9 18.7
46 2019-02-04 22:00:00 18.8 18.9 18.7
47 2019-02-04 23:00:00 18.9 19.2 18.7
Example 3
Not all hours of the day are available
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
48 2019-02-05 00:00:00 19.2 19.3 19.0
49 2019-02-05 01:00:00 19.3 19.4 19.3
50 2019-02-05 02:00:00 19.3 19.4 19.2
51 2019-02-05 03:00:00 19.4 19.5 19.4
52 2019-02-05 04:00:00 19.5 19.6 19.3
53 2019-02-05 05:00:00 19.3 19.5 19.1
54 2019-02-05 06:00:00 20.1 20.6 19.2
55 2019-02-05 07:00:00 21.1 21.7 20.6
56 2019-02-05 08:00:00 22.3 23.2 21.7
57 2019-02-05 15:00:00 25.3 25.8 25.0
58 2019-02-05 16:00:00 25.8 26.0 25.2
59 2019-02-05 17:00:00 24.3 25.2 23.3
60 2019-02-05 18:00:00 22.5 23.3 22.1
61 2019-02-05 19:00:00 21.6 22.1 21.1
62 2019-02-05 20:00:00 21.1 21.3 20.9
63 2019-02-05 21:00:00 21.2 21.3 20.9
64 2019-02-05 22:00:00 20.9 21.0 20.6
65 2019-02-05 23:00:00 19.9 20.6 19.7
Example 4
All hours of the day are available
Max and/or min have at least one NaN value
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
66 2019-02-06 00:00:00 19.7 19.8 19.7
67 2019-02-06 01:00:00 19.6 19.7 19.3
68 2019-02-06 02:00:00 19.0 19.3 18.6
69 2019-02-06 03:00:00 18.5 18.6 18.4
70 2019-02-06 04:00:00 18.6 18.7 18.4
71 2019-02-06 05:00:00 18.5 18.6
72 2019-02-06 06:00:00 19.0 19.6 18.5
73 2019-02-06 07:00:00 20.3 21.2 19.6
74 2019-02-06 08:00:00 21.5 21.7 21.2
75 2019-02-06 09:00:00 21.4 22.3 20.9
76 2019-02-06 10:00:00 23.5 24.4 22.3
77 2019-02-06 11:00:00 24.7 25.4 24.3
78 2019-02-06 12:00:00 24.9 25.5 23.9
79 2019-02-06 13:00:00 23.4 24.0 22.9
80 2019-02-06 14:00:00 23.3 23.8 22.9
81 2019-02-06 15:00:00 24.4 23.7
82 2019-02-06 16:00:00 24.9 25.1 24.7
83 2019-02-06 17:00:00 24.4 24.9 23.8
84 2019-02-06 18:00:00 22.5 23.8 21.7
85 2019-02-06 19:00:00 20.8 21.8 19.6
86 2019-02-06 20:00:00 19.1 19.6 18.9
87 2019-02-06 21:00:00 19.0 19.1 18.9
88 2019-02-06 22:00:00 19.1 19.1 19.0
89 2019-02-06 23:00:00 19.1 19.1 19.0
Just to recap, the above inputs would create the following output:
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!
Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?

You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)
Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])
all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time'] + all_hours + ['Max','Min'])
## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
## check if NaN is in either 'Max Temp' or 'Min Temp' columns
new_df_data = {}
if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
## create a dictionary for the new row of the final_df
new_df_data['Date_time'] = group
new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
new_df_data['Max'], new_df_data['Min'] = df_group['Max_temp'].max(), df_group['Min_temp'].min()
final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
else:
continue
Output:
>>> final_df
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.2 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
0 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 NaN NaN NaN 25.8 25.4 22.1 21.8 20.9 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1

get data of given time range in python when time stamp is not proper

time a b
2021-05-23 22:06:54 10.4 70.1
2021-05-23 22:21:41 10.7 68.3
2021-05-23 22:36:28 10.4 69.4
2021-05-23 22:51:15 9.9 71.7
2021-05-23 23:06:02 9.5 73.1
... ... ... ... ... ...
2021-11-19 08:18:31 19.8 43.0
2021-11-19 08:20:04 21.0 42.0
2021-11-19 08:21:25 35.5 20.0
2021-11-19 08:21:32 19.8 43.0
2021-11-19 08:23:05 21.0 42.0
here time is in the index, not a column.
when I did df.between_time("2021-11-17 08:15:00","2021-11-19 08:00:00")
it throws the error ValueError: Cannot convert arg ['2021-11-17 08:15:00'] to a time
data frame has not proper time stamp.
What i want to do,-: when i pass time range or date range, i want to get all the data between given time.
Thanks

Use truncate:
>>> df.truncate("2021-05-23 23:00:00", "2021-11-19 08:20:00")
a b
time
2021-05-23 23:06:02 9.5 73.1
2021-11-19 08:18:31 19.8 43.0

faster way to calculate a rolling sum in a dataframe

To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).
I need a faster way to get a value from the previous row and add it to a value on the current row.
I have the following dataframe:
import pandas as pd
from itertools import repeat
df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')
df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
Which is
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 0.0 0.0 0.0 0.0
15:00 1740 102.7 6.5 667.55 0.0 0.0 0.0 0.0
14:00 1680 108.5 5.4 585.90 0.0 0.0 0.0 0.0
13:00 1620 105.3 6.3 663.39 0.0 0.0 0.0 0.0
12:00 1560 107.1 6.4 685.44 0.0 0.0 0.0 0.0
11:00 1500 103.4 7.1 734.14 0.0 0.0 0.0 0.0
10:00 1440 101.8 6.7 682.06 0.0 0.0 0.0 0.0
09:00 1380 102.7 6.2 636.74 0.0 0.0 0.0 0.0
08:00 1320 101.6 5.7 579.12 0.0 0.0 0.0 0.0
07:00 1260 99.8 1.2 119.76 0.0 0.0 0.0 0.0
06:00 1200 100.2 2.4 240.48 0.0 0.0 0.0 0.0
05:00 1140 97.7 3.9 381.03 0.0 0.0 0.0 0.0
04:00 1080 99.3 5.2 516.36 0.0 0.0 0.0 0.0
03:00 1020 100.1 8.9 890.89 0.0 0.0 0.0 0.0
02:00 960 102.5 7.2 738.00 0.0 0.0 0.0 0.0
01:00 900 103.9 6.5 675.35 0.0 0.0 0.0 0.0
Right now I am using a for loop to check each row if 'flag' is set.
#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
# if flag[row] is not set:
# PVsum_2[row] = PV[row] + PV[row + 1]
# Vsum_2[row] = volume[row] + volume[row + 1]
# VWMA_2[row] = PVsum_2[row] / Vsum_2[row]
# flag[row] = 1.0
#----pseudo code----
my_dict = {'dtime' : 0,
'time' : 1,
'price' : 2,
"volume" : 3,
'PV' : 4,
'check' : 5,
'PVsum_2': 6,
'Vsum_2' : 7,
'VWMA_2' : 8}
for row in reversed(range(len(df)-1)):
# if flag value is not set (i.e. flag == 0)
if not df['flag'][row]:
# sum of current and previous PV (price*volume) values
a = df['PV'][row] + df['PV'][row+1]
df.iloc[row, my_dict['PVsum_2']-1] = a
# sum of current and previous volumes
b = df['volume'][row] + df['volume'][row+1]
df.iloc[row, my_dict['Vsum_2']-1] = b
# PVsum_2 / Vsum_2
c = (a / b) if b != 0.0 else 0.0
df.iloc[row, my_dict['VWMA_2']-1] = c
# set check value to 1.0
df.iloc[row, my_dict['flag']-1] = 1.0
but this takes too long on large sets of data (500+ rows)
I'm looking for something faster and more elegant.
The dataframe should look like this when it is done (notice the bottom row has not been altered):
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000
Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.
df =
time price volume PV flag PVsum_2 Vsum_2 VWMA_2
dtime
19:00 1980 100.1 6.0 600.60 0.0 0.0 0.0 0.0
18:00 1920 102.7 6.5 667.55 0.0 0.0 0.0 0.0
17:00 1860 108.5 5.4 585.90 0.0 0.0 0.0 0.0
16:00 1800 100.1 6.0 600.60 1.0 1268.15 12.5 101.452000
15:00 1740 102.7 6.5 667.55 1.0 1253.45 11.9 105.331933
14:00 1680 108.5 5.4 585.90 1.0 1249.29 11.7 106.776923
13:00 1620 105.3 6.3 663.39 1.0 1348.83 12.7 106.207087
12:00 1560 107.1 6.4 685.44 1.0 1419.58 13.5 105.154074
11:00 1500 103.4 7.1 734.14 1.0 1416.20 13.8 102.623188
10:00 1440 101.8 6.7 682.06 1.0 1318.80 12.9 102.232558
09:00 1380 102.7 6.2 636.74 1.0 1215.86 11.9 102.173109
08:00 1320 101.6 5.7 579.12 1.0 698.88 6.9 101.286957
07:00 1260 99.8 1.2 119.76 1.0 360.24 3.6 100.066667
06:00 1200 100.2 2.4 240.48 1.0 621.51 6.3 98.652381
05:00 1140 97.7 3.9 381.03 1.0 897.39 9.1 98.614286
04:00 1080 99.3 5.2 516.36 1.0 1407.25 14.1 99.804965
03:00 1020 100.1 8.9 890.89 1.0 1628.89 16.1 101.173292
02:00 960 102.5 7.2 738.00 1.0 1413.35 13.7 103.164234
01:00 900 103.9 6.5 675.35 0.0 0.00 0.0 0.000000

It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.
For starters, the following lines
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))
can be much easier written as:
df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0
But it seems you don't even need to initialise those columns really.
You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:
df_shift = df.shift(-1)
You can then use normal vectorised calculations to achieve what you want, e.g.:
df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0 # this is your check whether that value is different from 0
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2'] # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0
Hopefully you get the idea and can make small adjustments to make it work exactly as you want.

Pandas/Python: interpolation of multiple columns based on values specified for one reference column

df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?

Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read ASCII-File with missing data fields with numpy.genfromtxt - python

Related

AttributeError: 'list' object has no attribute 'assign'

How to sample data from Pandas DataFrame where data is present for every hour of a given day

get data of given time range in python when time stamp is not proper

faster way to calculate a rolling sum in a dataframe

Pandas/Python: interpolation of multiple columns based on values specified for one reference column

Categories

Resources