I am using Spark to do exploratory data analysis on a user log file. One of the analysis that I am doing is average requests on daily basis per host. So in order to figure out the average, I need to divide the total request column of the DataFrame by number unique Request column of the DataFrame.
total_req_per_day_df = logs_df.select('host',dayofmonth('time').alias('day')).groupby('day').count()
avg_daily_req_per_host_df = total_req_per_day_df.select("day",(total_req_per_day_df["count"] / daily_hosts_df["count"]).alias("count"))
This is what I have written using the PySpark to determine the average. And here is the log of error that I get
AnalysisException: u'resolved attribute(s) count#1993L missing from day#3628,count#3629L in operator !Project [day#3628,(cast(count#3629L as double) / cast(count#1993L as double)) AS count#3630];
Note: daily_hosts_df and logs_df is cached in the memory. How do you divide the count column of both data frames?
It is not possible to reference column from another table. If you want to combine data you'll have to join first using something similar to this:
from pyspark.sql.functions import col
(total_req_per_day_df.alias("total")
.join(daily_hosts_df.alias("host"), ["day"])
.select(col("day"), (col("total.count") / col("host.count")).alias("count")))
It's a question from an edX Spark course assignment. Since the solution is public now I take the opportunity to share another, slower one and ask whether the performance of it could be improved or is totally anti-Spark?
daily_hosts_list = (daily_hosts_df.map(lambda r: (r[0], r[1])).take(30))
days_with_hosts, hosts = zip(*daily_hosts_list)
requests = (total_req_per_day_df.map(lambda r: (r[1])).take(30))
average_requests = [(days_with_hosts[n], float(l)) for n, l in enumerate(list(np.array(requests, dtype=float) / np.array(hosts)))]
avg_daily_req_per_host_df = sqlContext.createDataFrame(average_requests, ('day', 'avg_reqs_per_host_per_day'))
Join the two data frames on column day, and then select the day and ratio of the count columns.
total_req_per_day_df = logs_df.select(dayofmonth('time')
.alias('day')
).groupBy('day').count()
avg_daily_req_per_host_df = (
total_req_per_day_df.join(daily_hosts_df,
total_req_per_day_df.day == daily_hosts_df.day
)
.select(daily_hosts_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count'])
.alias('avg_reqs_per_host_per_day')
)
.cache()
)
Solution, based on zero323 answer, but correctly works as OUTER join.
avg_daily_req_per_host_df = (
total_req_per_day_df.join(
daily_hosts_df, daily_hosts_df['day'] == total_req_per_day_df['day'], 'outer'
).select(
total_req_per_day_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count']).alias('avg_reqs_per_host_per_day')
)
).cache()
Without 'outer' param you lost data for days missings in one of dataframes. This is not critical for PySpark Lab2 task, becouse both dataframes contains same dates. But can create some pain in another tasks :)
Related
I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.
Something similar to this:
df = spark.load.json(directory).repartition('client', 'region')
unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]
for client, region in unique_clients_regions:
(df
.filter(f"client = '{client}' and region = '{region}'")
.select(
...
)
.write.mode("append")
.saveAsTable(f"{client}_{region}_data")
)
Is there a way to map the write operation to different groupBy groups instead of having to iterate over the distinct set? I made sure to repartition by client and region to try and speed up performance of the filter.
I cannot, in good conscience , advice you anything using this solution. Actually that's a really bad data architecture.
You should have only one table and partition by client and region. That will create different folders for each couple client/region. And you only need one write in the end and no loop nor collect.
spark.read.json(directory).write.saveAsTable(
"data",
mode="append",
partitionBy=['client', 'region']
)
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()
I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.
I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)
I wrote a lambda function that should be fast, but this is taking a very long time. Is there a better way to write this?
fn = lambda x: shape(df[df.CustomerCard_Num == x.CustomerCard_Num])[0]
df['tottrans'] = df.apply(fn, axis = 1)
Basically, I have a big database of transactions (rows). A set of rows might correspond to different customers (Customer card number if a column in df, multiple rows might have the same df.CustomerCard_Num.)
I am trying to count the number of rows for each customer with this lambda function. But it does not seem to work quickly. Should I be using groupby?
There is a built in way:
df.CustomerCard_Num.value_counts()
See the docs