Parse CSV in 2D Python Object - python

i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.

You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)

Related

Apply function over portion of dataframe multiple times

I have a dataframe with 4 columns: "Date" (in string format), "Hour" (in string format), "Energia_Attiva_Ingresso_Delta" and "Energia_Attiva_Uscita_Delta".
Obviously for every date there are multiple hours. I'd like to calculate a column for the overall dataframe, but on a daily base. Basically: the operation of the function must be calculated for every single date.
So, I thought to iter over the single values of the date column and to filter the dataframe with .loc, then pass the filtered df to the function. In the function I have to re-filter the df with loc (for the purpose of the calculation).
Here's the code I wrote and as you can see in the function i need to operate iterativelly on the row with the maximum value of 'Energia_Ingresso_Delta'; to do so I use again the .loc function:
#function
def optimize(df):
min_index = np.argmin(df.Margine)
max_index = np.argmax(df.Margine)
Energia_Prelevata_Da_Rete = df[df.Margine < 0]['Margine'].sum().round(1)
Energia_In_Eccesso = df[df.Margine > 0]['Margine'].sum().round(1)
carico_medio = (Energia_In_Eccesso / df[df['Margine']<0]['Margine'].count()).round(1)
while (Energia_In_Eccesso != 0):
max_index = np.argmax(df.Energia_Ingresso_Delta)
df.loc[max_index, 'Energia_Attiva_Ingresso_Delta'] = df.loc[max_index,'Energia_Attiva_Ingresso_Delta'] + carico_medio
Energia_In_Eccesso = (Energia_In_Eccesso - carico_medio).round(1)
#Call function with "partial dataframe". The dataframe is called "prova"
for items in list(prova.Data.unique()):
function(prova.loc[[items]])
But I keep getting this error:
"None of [Index(['2021-05-01'], dtype='object')] are in the [index]"
Can someone help me? :)
Thanks in advance

How to assess all values of a row in a pandas dataframe and write into a new column

I have a pandas dataframe of 62 rows x 10 columns. Each row contains numbers and if any of the numbers are within a certain range then return a string into the last column.
I have unsuccessfully tried the .apply method to use a function to make the assessment. I have also tried to import as a series but then the .apply method causes problems because it is a list.
df = pd.read_csv(results)
For example, in the image attached, if any value from Base 2019 to FY26 Load is between 0.95 and 1.05 then return 'Acceptable' into the last column otherwise return 'Not Acceptable'.
Any help, even a start would be much appreciated.
This should perform as expected:
results = "input.csv"
df = pd.read_csv(results)
low = 0.95
high = 1.05
# The columns to check
cols = df.columns[2:]
df['Acceptable?'] = (df[cols] > low).any(axis=1) & (df[cols] < high).all(axis=1)

Monte carlo simulation in python - problem with looping

I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)
By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.
Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.

Taking a proportion of a dataframe based on column values

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

creating and naming pandas series in for loop

Goal: pass a list of N ints to a function and use those ints to 1). create and name N columns in a pandas dataframe and; 2). calculate the rolling mean using those ints as lookback period.
here is the code for the function (with a data pull for reproducibility):
import pandas as pd
import pandas_datareader as web
test_df = web.DataReader('GDP', data_source = 'fred')
def sma(df, sma_lookbacks = [1,2]):
import pandas as pd
df = pd.DataFrame(df)
df = df.dropna()
for lookback in sma_lookbacks:
df[str('SMA' + str(lookback))] = df.rolling(window = lookback).mean()
return df.tail()
sma(test_df)
Error received:
ValueError: Wrong number of items passed 2, placement implies 1
Do I have a logic problem here? I believe in the for loop it should be passing the ints in sequence not at once, so I do not quite understand how it is passing more than one value at a time. As a result, I'm not sure how to troubleshoot.
According to this post, this error is thrown when you are simultaneously passing multiple values to a container that can only take one value. Shouldn't the for loop address that?
ValueError: Wrong number of items passed - Meaning and suggestions?
I think pandas search for column name before assigning the values returned from function applied on the dataframe. So initialize the column with some scalar in the begining before assigning series returned from a function to that column i.e
import pandas as pd
import pandas_datareader as web
test_df = web.DataReader('GDP', data_source = 'fred')
def sma(df, sma_lookbacks = [1,2]):
df = pd.DataFrame(df)
df = df.dropna()
for lookback in sma_lookbacks:
df[str('SMA' + str(lookback))] = 0
df[str('SMA' + str(lookback))] = df.rolling(window = lookback).mean()
return df.tail()
GDP SMA1 SMA2
DATE
2016-04-01 18538.0 18538.0 18431.60
2016-07-01 18729.1 18729.1 18633.55
2016-10-01 18905.5 18905.5 18817.30
2017-01-01 19057.7 19057.7 18981.60
2017-04-01 19250.0 19250.0 19153.85

Categories