creating and naming pandas series in for loop - python

Goal: pass a list of N ints to a function and use those ints to 1). create and name N columns in a pandas dataframe and; 2). calculate the rolling mean using those ints as lookback period.
here is the code for the function (with a data pull for reproducibility):
import pandas as pd
import pandas_datareader as web
test_df = web.DataReader('GDP', data_source = 'fred')
def sma(df, sma_lookbacks = [1,2]):
import pandas as pd
df = pd.DataFrame(df)
df = df.dropna()
for lookback in sma_lookbacks:
df[str('SMA' + str(lookback))] = df.rolling(window = lookback).mean()
return df.tail()
sma(test_df)
Error received:
ValueError: Wrong number of items passed 2, placement implies 1
Do I have a logic problem here? I believe in the for loop it should be passing the ints in sequence not at once, so I do not quite understand how it is passing more than one value at a time. As a result, I'm not sure how to troubleshoot.
According to this post, this error is thrown when you are simultaneously passing multiple values to a container that can only take one value. Shouldn't the for loop address that?
ValueError: Wrong number of items passed - Meaning and suggestions?

I think pandas search for column name before assigning the values returned from function applied on the dataframe. So initialize the column with some scalar in the begining before assigning series returned from a function to that column i.e
import pandas as pd
import pandas_datareader as web
test_df = web.DataReader('GDP', data_source = 'fred')
def sma(df, sma_lookbacks = [1,2]):
df = pd.DataFrame(df)
df = df.dropna()
for lookback in sma_lookbacks:
df[str('SMA' + str(lookback))] = 0
df[str('SMA' + str(lookback))] = df.rolling(window = lookback).mean()
return df.tail()
GDP SMA1 SMA2
DATE
2016-04-01 18538.0 18538.0 18431.60
2016-07-01 18729.1 18729.1 18633.55
2016-10-01 18905.5 18905.5 18817.30
2017-01-01 19057.7 19057.7 18981.60
2017-04-01 19250.0 19250.0 19153.85

Related

Parse CSV in 2D Python Object

i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.
You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

pandas timeseries slicing: between_time returns NaN value while using inside the query operator np.where

I'm trying to label a certain period of time that occurs repeatedly in a time series data set. I'm using between_time() inside of a np.where(). It returns a NaN value.
What am I missing?
import pandas as pd
import numpy as np
data_df = pd.read_csv("data.csv")
data_df['Datetime'] = pd.to_datetime(data_df['Date'] + ' ' + data_df['Time'])
data_df = data_df.set_index('Datetime')
data_df['label'] =pd.Series(np.where(data_df['Time'].between_time('16:00','9:00'), "time1", "time2"))
data_df.head()
Image of the table
There is an alternative solution using apply which may be a little tricky but do the work
df['Label'] = df['time'].apply(lambda x : 'time1' if x < '09:00:00' and x > '06:00:00' else 'time2')
print(df[df['Label']=='time1'].head()) #show time1 values

Pandas Dataframe filtering groups based on dates frequency

I have a pandas data frame with the next columns:
User_id (numeric)| Day (DateTime)| Data (numeric)
What I want is to group by user_id so that I keep just those users for who I have Data in a period of 15 consecutive days.
Say, if I had data from 01-05 (dd-mm) to 16-05 (dd-mm) the rows referring to that user would be kept.
Ex.:
df1 = pd.DataFrame(['13-01-2018',1], ['14-01-2018',2],['15-01-2018',3],
['13-02-2018',1], ['14-02-2018',2],['15-02-2018',3])
#Apply solution to extract data of first N consecutive dates with N = 3
result.head()
0 1
0 13-01-2018 1
1 14-01-2018 2
2 15-01-2018 3
Don't be afraid to ask for further details! Sorry I couldn't be more specific.
I finally managed to solved it.
At first I thought my solutions wouldn't be optimal and would take too much time, but it turned out to work pretty good.
I defined a function that, given a n_days time window traverses the data looking for a delta fitting the windows size (n_days). Thus, it looks for n_days consecutive dates.
def consecutive(x,n_days):
result = None
if len(x) >= n_days:
stop = False
i = n_days
while(stop == False and i < len(x)+1):
window = lista[i-n_days:i]
delta = (window[len(window)-1]-window[0]).n_days
if delta == n_days-1:
stop = True
result = window
i=i+1
return result
And then call it by using apply.
b=a.groupby('user_id')['day'].unique().apply(lambda x: consecutive(x,15))
df=b.loc[b.apply(lambda x: x is not None)].reset_index()
Next steps imply to transform the dataframe into a one with a row per returned date.
import itertools
import pandas as pd
import numpy as np
def melt_series(s):
lengths = s.str.len().values
flat = [i for i in itertools.chain.from_iterable(s.values.tolist())]
idx = np.repeat(s.index.values, lengths)
return pd.Series(flat, idx, name=s.name)
df=melt_series(df.day).to_frame().join(df.drop('day', 1)).reindex_axis(df.columns, 1)
And merging with the actual dataframe.
final =pd.merge(a,df[['user_id','day']],on=['user_id','day'])

Python Pandas 'apply' returns series; can't convert to dataframe

OK, I'm at half-wit's end. I'm geocoding a dataframe with geopy. I've written a simple function to take an input - country name - and return the latitude and longitude. I use apply to run the function and it returns a Pandas series object. I can't seem to convert it to a dataframe. I'm sure I'm missing something obvious, but I'm new to python and still RTFMing. BTW, the geocoder function works great.
# Import libraries
import os
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
def locate(x):
geolocator = Nominatim()
# print(x) # debug
try:
#Get geocode
location = geolocator.geocode(x, timeout=8, exactly_one=True)
lat = location.latitude
lon = location.longitude
except:
#didn't work for some reason that I really don't care about
lat = np.nan
lon = np.nan
# print(lat,lon) #debug
return lat, lon # Note: also tried return { 'LAT': lat, 'LON': lon }
df_geo_in = df_addr.drop_duplicates(['COUNTRY']).reset_index() #works perfectly
df_geo_in['LAT'], df_geo_in['LON'] = df_geo_in.applymap(locate)
# error: returns more than 2 values - default index + column with results
I also tried
df_geo_in['LAT','LON'] = df_geo_in.applymap(locate)
I get a single dataframe with no index and a single colume with the series in it.
I've tried a number of other methods, including 'applymap' :
source_cols = ['LAT','LON']
new_cols = [str(x) for x in source_cols]
df_geo_in = df_addr.drop_duplicates(['COUNTRY']).set_index(['COUNTRY'])
df_geo_in[new_cols] = df_geo_in.applymap(locate)
which returned an error after a long time:
ValueError: Columns must be same length as key
I've also tried manually converting the series to a dataframe using the df.from_dict(df_geo_in) method without success.
The goal is to geocode 166 unique countries, then join it back to the 188K addresses in df_addr. I'm trying to be pandas-y in my code and not write loops if possible. But I haven't found the magic to convert series into dataframes and this is the first time I've tried to use apply.
Thanks in advance - ancient C programmer
I'm assuming that df_geo is a df with a single column so I believe the following should work:
change:
return lat, lon
to
return pd.Series([lat, lon])
then you should be able to assign like so:
df_geo_in[['LAT', 'LON']] = df_geo_in.apply(locate)
What you tried to do was assign the result of applymap to 2 new columns which is incorrect here as applymap is designed to work on every element in a df so unless the lhs has the same expected shape this won't give the desired result.
Your latter method is also incorrect because you drop the duplicate countries and then expect this to assign every country geolocation back but the shape is different.
It is probably quicker for large df's to create the geolocation non-duplicated df's and then merge this back to your larger df like so:
geo_lookup = df_addr.drop_duplicates(['COUNTRY'])
geo_lookup[['LAT','LNG']] = geo_lookup['COUNTRY'].apply(locate)
df_geo_in.merge(geo_lookup, left_on='COUNTRY', right_on='COUNTRY', how='left')
this will create a df with non duplicated countries with geo location addresses and then we perform a left merge back to the master df.
Always easier to test with some sample data, but please try the following zip function to see if it works.
df_geo_in['LAT_LON'] = df_geo_in.applymap(locate)
df_geo_in['LAT'], df_geo_in['LON'] = zip(*df_geo_in.LAT_LON)

Categories