Randomly Modify Entries in CSV using Pandas - python

I have a CSV file containing data like this:
DateTime, product_x, product_y, product_z
2018-01-02 00:00:00,945,1318,17.12
2018-01-03 00:00:00,958,1322,17.25
...
I want to use Python and Pandas to modify the values for product_x, product_y and product_z by some random amount - say adding a random value from -3 - +3 to each, and then writing the result back to a CSV.
EDIT: I need each cell shifted by a different amount (except for random coincidences).
How do I do this please?

Use np.random.randint with columns names in list for generate 2d array and add to original columns filtered in same list:
cols = ['product_x','product_y','product_y']
#dynamic columns names
#cols = df.filter(like='product').columns
df[cols] += np.random.randint(-3, 3, size=(len(df.index), len(cols)))
print (df)
DateTime product_x product_y product_z
0 2018-01-02 00:00:00 947 1320 17.12
1 2018-01-03 00:00:00 958 1323 17.25

Related

Split dataframe into many sub-dataframes based on timestamp

I have a large csv with the following format:
timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter
I load csv into a dataframe with:
df = pd.read_csv("/home/test.csv")
and then I want to create multiple dataframes every 2 seconds. For example:
df1 contains:
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
df2 contains :
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
and so on.
I achieve to split timestamps with command below:
full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')
but how I can store these spitted dataframes? How can I split a dataset based on timestamps into multiple dataframes?
Probably that question can help us: Pandas: Timestamp index rounding to the nearest 5th minute
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
ns2sec=2*1000000000 # 2 seconds in nanoseconds
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))
# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])
How about .resample()?
#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
2020-03-01 00:00:02 nick NaN
2020-03-01 00:00:02 john NaN
2020-03-01 00:00:04 peter NaN
#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>
#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>> df_dict[i] = df
>>> df_dict[0]
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
Now for some explanation...
resample() is great for rebinning DataFrames based on time (I use it often for downsampling time series data), but it can be used simply to cut up the DataFrame, as you want to do. Iterating over the resampler object produced by df.resample() returns a tuple of (name of the bin,df corresponding to that bin): e.g. the first tuple is (timestamp of the first second,data corresponding to the first 2 seconds). So to get the DataFrames out, we can loop over this object and store them somewhere, like a dict.
Note that this will produce every 2-second interval from the start to the end of the data, so many will be empty given your data. But you can add a step to filter those out if needed.
Additionally, you could manually assign each sliced DataFrame to a variable, but this would be cumbersome (you would probably need to write a line for each 2 second bin, rather than a single small loop). Rather with a dictionary, you can still associate each DataFrame with a callable name. You could also use an OrderedDict or list or whatever collection.
A couple points on your script:
setting freq to "0.2T" is 12 seconds (.2 *60); you can rather
do freq="2s"
The example df and df2 are "out of phase," by that I mean one is binned in 2 seconds starting on odd numbers (1-2 seconds), while one is starting on evens (4-5 seconds). So the date_range you mentioned wouldn't create those bins, it would create dfs from either 0-1s, 2-3s, 4-5s... OR 1-2s,3-4s,5-6s,... depending on which timestamp it started on.
For the latter point, you can use the base argument of .resample() to set the "phase" of the resampling. So in the case above, base=0 would start bins on even numbers, and base=1 would start bins on odds.
This is assuming you are okay with that type of binning - if you really want 1-2 seconds and 4-5 seconds to be in different bins, you would have to do something more complicated I believe.

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?

I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.
Basically, row-axis has two levels, and column-axis also.
In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".
import pandas as pd
row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)
It seems to work.
MD.index
>>> MultiIndex([], names=['Data', 'Period'])
MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])
Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.
id.index[0]
>>> Timestamp('2020-01-01 08:00:00')
id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')
I want to keep in MD its first and last index values for instance.
MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]
This doesn't work, I get following error message:
TypeError: unhashable type: 'list'
In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :
Timestamp
First Last
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00
10m 2020-01-05 08:00:00 2020-01-06 18:00:00
Ultimately, I will also keep min and max of some columns in id.
For instance if id has a column 'Temperature'.
Timestamp Temperature
First Last Min Max
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00 -2.5 10
10m 2020-01-05 08:00:00 2020-01-06 18:00:00 4 15
These values will be recorded when I record id.
I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.
Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame.
I will then dump it in a csv file to store these "meta data".
Please, what is the right way to initialize each of these values in MD?
I thank you for your help!
Bests,
Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.
The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.
d = {('Score', 'Min'): {('id1', '5m'): 72, ('id1', '10m'): -18},
('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
('Timestamp', 'Last'): {('id1', '5m'): 10, ('id1', '10m'): 20}}
# | | |
# Column MultiIndex Row Multi Cell Value
# Label Label
pd.DataFrame(d)
Score Timestamp
Min First Last
id1 5m 72 1 10
10m -18 2 20
Creating that dict will depend upon how you get the values. You can extend a dict with update

Apply function to all columns in pd dataframe using variables in other dataframes

I have a series of dataframes, some hold static values some are time series.
I have been able to add
I want to transpose the values from one time series to a new time series, applying a function which draws values from both the original time series dataframe and the dataframe which holds static values.
A snip of the time series and static dataframes are below.
Time series dataframe (Irradiance)
Datetime Date Time GHI DIF flagR SE SA TEMP
2017-07-01 00:11:00 01.07.2017 00:11 0 0 0 -9.39 -179.97 11.1
2017-07-01 00:26:00 01.07.2017 00:26 0 0 0 -9.33 -176.47 11.0
2017-07-01 00:41:00 01.07.2017 00:41 0 0 0 -9.14 -172.98 10.9
2017-07-01 00:56:00 01.07.2017 00:56 0 0 0 -8.83 -169.51 10.9
2017-07-01 01:11:00 01.07.2017 01:11 0 0 0 -8.40 -166.04 11.0
Static dataframe (Properties)
Bearing (degrees) Inclination (degrees)
Home ID
151631 244 29
151632 244 29
151633 184 44
I have written a function which I want to use to populate a new dataframe using values from both of these.
def dif(DIF, Inclination, GHI):
global Albedo
return DIF * (1 + math.cos(math.radians(Inclination)) / 2) + (GHI * Albedo * (1 - math.cos(math.radians(Inclination)) / 2))
When I have tried to do the same, but within the same dataframe I have used the Numpy vectorize funcion, so I thought I would be able to iterate over each column of the the new dataframe using the following code.
for column in DIF:
DIF[column] = np.vectorize(dif)(irradiance['DIF'], properties.iloc['Inclination (degrees)'][column], irradiance['GHI'])
Instead this throws the following error.
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Inclination (degrees)] of <class 'str'>
I've checked the dtypes for the values of Inclination(degrees) but it is returned as Int64, not str so I'm not sure why this error is being generated.
I'm obviously missing something critical here. Are there alternative methods that would work better, or at all? Any help would be much appreciated.

Moving average on pandas.groupby object that respects time

Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected:

How to combine daily data with monthly data in Pandas?

I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]

Categories