Python Pandas - Cumulative time series data, calculate change? - python

I'm pulling data from an API at intervals, each item I pull has a "start date" (for the ad campaign, t1) and the next points will be an increasing t2 value (now). It's for a learning project I'm doing, as I'm relatively new to data science.
The values, such as revenue, cost, clicks, conversions etc. are cumulative. To find the change from one datapoint to the next, I'd have to subtract n - (n-1), as n contains the data from (n-1).
I pull the data into a dataframe using (the database is sqlite for now):
SQL = """SELECT
MAX(a.t2) as "Snapshot time",
a.volid AS "Camp ID",
a.tsid AS "TS ID",
a.placement as "Source ID",
a.clicks AS "Clicks tracker",
a.visits AS "Visits tracker",
a.conversions AS "Conversion",
a.revenue AS "Revenue USD",
b.cost AS "Cost USD" ,
b.clicks AS "ts Clicks",
from tracker a JOIN ts b ON a.placement = b.placement AND a.tsid =
b.campaignid AND a.t2 = b.t2
GROUP BY a.voli, a.tsid, a.placement"""
df = pd.read_sql_query(SQL, conn)
df_t2['snapshot'] = pd.to_datetime(df_t2['snapshot'], format='%Y-%m-%dT%H:%M:%S.%fZ')
# Generate time value for the second sql query, for n-x
t1 = df_t2['snapshot'].max() - dt.timedelta(hours=offset)
t1 = t1.strftime('%Y-%m-%dT%H:%M:%S.%fZ')
This gives me the latest snapshot (t0--tn). My initial thought was to make a similar dataframe for (t0-t(n-1)) and subtract them, which is there the t1 variable came into the picture in the code above.
But I tried this and can't get it to work. I also tested to handle it like a time series in Pandas, but I'm not sure if my data structure works for that.
The expected behavior would be to get a dataframe only consisting of data for tn-t(n-1). Even better if it would be possible to generate all the n-(n-1) for the entire series, so each record would be an increment and not a cumulative value.
Any input would be much appreciated. Thanks in advance.

Related

Performing batch join in pyspark from timestamp and id using watermarking

I have two dataframes with on equipment and equipment surrounding information. A row from the on equipment dataframe has the following schema:
on_equipment_df.head()
# Output of notebook:
Row(
on_eq_equipid='1',
on_eq_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 866098),
on_eq_charge=3.917107463423725, on_eq_electric_consumption=102.02754516792204, on_eq_force=10.551710736897613, on_eq_humidity=22.663245200558457, on_eq_pressure=10.813417893943944, on_eq_temperature=29.80448721128125, on_eq_vibration=4.376662536641158,
measurement_status='Bad')
And a row from the equipment surrounding looks like:
equipment_surrounding_df.head()
# Output of notebook
Row(
eq_surrounding_equipid='1',
eq_surrounding_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 903198),
eq_surrounding_dust=24.0885168316774, eq_surrounding_humidity=16.949569353381793, eq_surrounding_noise=12.256649392702574, eq_surrounding_temperature=8.141877435145844,
measurement_status='Good')
Notice that both tables have id's identifying the equipment and a timestamp relating to when the measurement was taken.
Problem: I want to perform a join between these two dataframes based on equipment id and the timestamps. The issue is that the timestamp is recorded with a very precise reading rendering the join through timestamp impossible (unless I round the timestamp, which I would like to leave as a last resort). The on equipment and equipment surrounding readings are recorded at different frequencies. I want therefore to perform a join only based on the equipment id but for windows between certain timestamp values. This is similar to what is done in structured streaming using watermarking.
To do this I tried to use the equivelent operation used in structured streaming as mentioned above called watermarking. Here is the code:
# Add watermark defined by the timestamp column of each df
watermarked_equipment_surr = equipment_surrounding_df.withWatermark("eq_surr_timestamp", "0.2 seconds")
watermarked_on_equipment = on_equipment_df.withWatermark("on_eq_timestamp", "0.2 seconds")
# Define new equipment dataframe based on the watermarking conditions described
equipment_df = watermarked_on_equipment.join(
watermarked_equipment_surr,
expr("""
on_eq_equipid = eq_surr_equipid AND
on_eq_timestamp >= eq_surrounding_timestamp AND
on_eq_timestamp <= eq_surrounding_timestamp + interval 0.2 seconds
"""))
# Perfrom a count (error appears here)
print("Equipment size:", equipment_df.count())
I get an error when performing this action. Based on this, I have two questions:
Is this the right way to solve such a use case / problem?
If so, why do I get an error in my code?
Thank you in advance
UPDATE:
So I beleive I found half the solution inspiered by:
Joining two spark dataframes on time (TimestampType) in python
Essentially the solution goes through creating two columns in one of the dataframes which represents an upper and lower limit of the timestamp using udfs. The code for that:
def lower_range_func(x, offset_milli=250):
"""
From a timestamp and offset, get the timestamp obtained from subtractng the offset.
"""
return x - timedelta(seconds=offset_milli/1000)
def upper_range_func(x, offset_milli=250):
"""
From a timestamp and offset, get the timestamp obtained from adding the offset.
"""
return x + timedelta(seconds=offset_milli/1000)
# Create two dataframes
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())
# Add these columns to the on_equipment dataframe
on_equipment_df = on_equipment_df\
.withColumn('lower_on_eq_timestamp', lower_range(on_equipment_df["on_eq_timestamp"]))\
.withColumn('upper_on_eq_timestamp', upper_range(on_equipment_df["on_eq_timestamp"]))
Once we have those columns, we can perform a filtered join using these new columns.
# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(equipment_surrounding_df)\
.filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
.filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)
The problem is, as soon as I try to join using also the equipment id, like so:
# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(
equipment_surrounding_df, on_equipment_df.on_eq_equipid == equipment_surrounding_df.eq_surr_equipid)\
.filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
.filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)
I get the an error. Any thoughts on this approach?

How to deal with a table, having more than 2000 column , to calculate avg of all the numeric columns in pyspark?

I working on a table with number of columns>2000, I want to carry out the stats of all the numeric column of table, I created a dataframe as
df = sqlContext.sql("select * from table")
I tried using describe() on the dataframe as df.describe() but it's like process is running for lifelong... For around 5-6 hours but no response.
Could anyone please help me out with the workaround using pyspark. Thanks in advance.
P.S => In scala there is a function called sliding that can be used as
allColumns.sliding(200) which will slide the 200 columns and further we can carry out the avg of that columns.
Also i need to collect all the parts i.e. P1 -> 1-200 Columns, P2 -> 201-400 etc. and join them to get the data collectively.
You could directly compute your average in the sql request like:
# collect the column name and shape then in a list
colname = sqlContext.sql("describe table1").select("col_name").collect()
colname = [x.col_name for x in colname]
# build the query (average on each column)
df = sqlContext.sql("select " +",".join(["AVG({0}) AS avg_{0}".format(x) for x in colname]) + " from table")
# As a result you will get the average for each column (as Row)
df.collect()
Not sure to answer your question. Don't hesitate comment.

Dividing two columns of a different DataFrames

I am using Spark to do exploratory data analysis on a user log file. One of the analysis that I am doing is average requests on daily basis per host. So in order to figure out the average, I need to divide the total request column of the DataFrame by number unique Request column of the DataFrame.
total_req_per_day_df = logs_df.select('host',dayofmonth('time').alias('day')).groupby('day').count()
avg_daily_req_per_host_df = total_req_per_day_df.select("day",(total_req_per_day_df["count"] / daily_hosts_df["count"]).alias("count"))
This is what I have written using the PySpark to determine the average. And here is the log of error that I get
AnalysisException: u'resolved attribute(s) count#1993L missing from day#3628,count#3629L in operator !Project [day#3628,(cast(count#3629L as double) / cast(count#1993L as double)) AS count#3630];
Note: daily_hosts_df and logs_df is cached in the memory. How do you divide the count column of both data frames?
It is not possible to reference column from another table. If you want to combine data you'll have to join first using something similar to this:
from pyspark.sql.functions import col
(total_req_per_day_df.alias("total")
.join(daily_hosts_df.alias("host"), ["day"])
.select(col("day"), (col("total.count") / col("host.count")).alias("count")))
It's a question from an edX Spark course assignment. Since the solution is public now I take the opportunity to share another, slower one and ask whether the performance of it could be improved or is totally anti-Spark?
daily_hosts_list = (daily_hosts_df.map(lambda r: (r[0], r[1])).take(30))
days_with_hosts, hosts = zip(*daily_hosts_list)
requests = (total_req_per_day_df.map(lambda r: (r[1])).take(30))
average_requests = [(days_with_hosts[n], float(l)) for n, l in enumerate(list(np.array(requests, dtype=float) / np.array(hosts)))]
avg_daily_req_per_host_df = sqlContext.createDataFrame(average_requests, ('day', 'avg_reqs_per_host_per_day'))
Join the two data frames on column day, and then select the day and ratio of the count columns.
total_req_per_day_df = logs_df.select(dayofmonth('time')
.alias('day')
).groupBy('day').count()
avg_daily_req_per_host_df = (
total_req_per_day_df.join(daily_hosts_df,
total_req_per_day_df.day == daily_hosts_df.day
)
.select(daily_hosts_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count'])
.alias('avg_reqs_per_host_per_day')
)
.cache()
)
Solution, based on zero323 answer, but correctly works as OUTER join.
avg_daily_req_per_host_df = (
total_req_per_day_df.join(
daily_hosts_df, daily_hosts_df['day'] == total_req_per_day_df['day'], 'outer'
).select(
total_req_per_day_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count']).alias('avg_reqs_per_host_per_day')
)
).cache()
Without 'outer' param you lost data for days missings in one of dataframes. This is not critical for PySpark Lab2 task, becouse both dataframes contains same dates. But can create some pain in another tasks :)

Pandas Split-Apply-Combine

I have a dataset with userIDs, Tweets and CreatedDates. Each UserID will have multiple tweets created at different dates. I want to find the frequency of tweets and Ive written a small calculation which gives me the number of tweets per hour per userID. I used group by to do this the code as follows :
twitterDataFrame = twitterDataFrame.set_index(['CreatedAt'])
tweetsByEachUser = twitterDataFrame.groupby('UserID')
numberOfHoursBetweenFirstAndLastTweet = (tweetsByEachUser['CreatedAtForCalculations'].first() - tweetsByEachUser['CreatedAtForCalculations'].last()).astype('timedelta64[h]')
numberOfTweetsByTheUser = tweetsByEachUser.size()
frequency = numberOfTweetsByTheUser / numberOfHoursBetweenFirstAndLastTweet
When printing the value of frequency I get :
UserID
807095 5.629630
28785486 2.250000
134758540 8.333333
Now I need to go back into my big data frame (twitterDataFrame) and add these values alongside the correct UserIDs. How can i possible do that? Id like to say
twitterDataFrame['frequency'] = the frequency corresponding to the UserID. e.g twitterDataFrame['UserID'] and the frequency value we got for that above.
However I am not sure how i do this. Would anyone know how i can achieve this?
You can use join operation on the frequency object you created, or do it in one stage:
get_freq = lambda ts: (ts.last() - ts.first()).astype('timedelta64[h]') / len(ts)
twitterDataFrame['frequency'] = twitterDataFrame.groupby('UserID')['CreatedAtForCalculations'].transform(get_freq)

Python library for dealing with time associated data?

I've got some data (NOAA-provided weather forecasts) I'm trying to work with. There are various data series (temperature, humidity, etc), each of which contains a series of data points, and indexes into an array of datetimes, on various time scales (Some series are hourly, others 3-hourly, some daily). Is there any sort of library for dealing with data like this, and accessing it in a user-friendly way.
Ideal usage would be something like:
db = TimeData()
db.set_val('2010-12-01 12:00','temp',34)
db.set_val('2010-12-01 15:00','temp',37)
db.set_val('2010-12-01 12:00','wind',5)
db.set_val('2010-12-01 13:00','wind',6)
db.query('2010-12-01 13:00') # {'wind':6, 'temp':34}
Basically the query would return the most recent value of each series.
I looked at scikits.timeseries, but it isn't very amenable to this use case, due to the amount of pre-computation involved (it expects all the data in one shot, no random-access setting).
If your data is sorted you can use the bisect module to quickly get the entry with the greatest time less than or equal to the specified time.
Something like:
i = bisect_right(times, time)
# times[j] <= time for j<i
# times[j] > time for j>=i
if times[i-1] == time:
# exact match
value = values[i-1]
else:
# interpolate
value = (values[i-1]+values[i])/2
SQLite has a date type. You can also convert all the times to seconds since epoch (by going through time.gmtime() or time.localtime()), which makes comparisons trivial.
It is a classic row-to-column problem, in a good SQL DBMS you can use unions:
SELECT MAX(d_t) AS d_t, SUM(temp) AS temp, SUM(wind) AS wind, ... FROM (
SELECT d_t, 0 AS temp, value AS wind FROM table
WHERE type='wind' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
SELECT d_t, value, 0 FROM table
WHERE type='temp' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
...
) q1;
The trick is to make a subquery for each dimension while providing placeholder columns for the other dimensions. In Python you can use SQLAlchemy to dynamically generate a query like this.

Categories