I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.
Something similar to this:
df = spark.load.json(directory).repartition('client', 'region')
unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]
for client, region in unique_clients_regions:
(df
.filter(f"client = '{client}' and region = '{region}'")
.select(
...
)
.write.mode("append")
.saveAsTable(f"{client}_{region}_data")
)
Is there a way to map the write operation to different groupBy groups instead of having to iterate over the distinct set? I made sure to repartition by client and region to try and speed up performance of the filter.
I cannot, in good conscience , advice you anything using this solution. Actually that's a really bad data architecture.
You should have only one table and partition by client and region. That will create different folders for each couple client/region. And you only need one write in the end and no loop nor collect.
spark.read.json(directory).write.saveAsTable(
"data",
mode="append",
partitionBy=['client', 'region']
)
Related
I have grouped the number of customers by region and year joined using groupby in Python. However I want to remove several regions from the region group.
I know in order to exclude one group from a groupby you can use the following code:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest')).index).
Therefore I initially tried the following:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index)
However that gave me the apparent error ('Southwest','Northwest').
Now I am wondering if there is a way to drop several groups at once instead of me having to type out the above code repeatedly for each region I want to remove.
I expect the output of the final query to be similar to the image shown below however information regarding the Northwest and Southwest regions should be removed.
It's not df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index). grouped.get_group takes a single name as argument. If you want to drop more than one group, you can use df1 = df.drop((grouped.get_group('Southwest').index, grouped.get_group('Northwest').index)) since drop can take a list as input.
As a side note, ('Southwest') evaluates to 'Southwest' (i.e. it's not a tuple). If you want to make a tuple of size 1, it's ('Southwest', )
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()
I am using Spark to do exploratory data analysis on a user log file. One of the analysis that I am doing is average requests on daily basis per host. So in order to figure out the average, I need to divide the total request column of the DataFrame by number unique Request column of the DataFrame.
total_req_per_day_df = logs_df.select('host',dayofmonth('time').alias('day')).groupby('day').count()
avg_daily_req_per_host_df = total_req_per_day_df.select("day",(total_req_per_day_df["count"] / daily_hosts_df["count"]).alias("count"))
This is what I have written using the PySpark to determine the average. And here is the log of error that I get
AnalysisException: u'resolved attribute(s) count#1993L missing from day#3628,count#3629L in operator !Project [day#3628,(cast(count#3629L as double) / cast(count#1993L as double)) AS count#3630];
Note: daily_hosts_df and logs_df is cached in the memory. How do you divide the count column of both data frames?
It is not possible to reference column from another table. If you want to combine data you'll have to join first using something similar to this:
from pyspark.sql.functions import col
(total_req_per_day_df.alias("total")
.join(daily_hosts_df.alias("host"), ["day"])
.select(col("day"), (col("total.count") / col("host.count")).alias("count")))
It's a question from an edX Spark course assignment. Since the solution is public now I take the opportunity to share another, slower one and ask whether the performance of it could be improved or is totally anti-Spark?
daily_hosts_list = (daily_hosts_df.map(lambda r: (r[0], r[1])).take(30))
days_with_hosts, hosts = zip(*daily_hosts_list)
requests = (total_req_per_day_df.map(lambda r: (r[1])).take(30))
average_requests = [(days_with_hosts[n], float(l)) for n, l in enumerate(list(np.array(requests, dtype=float) / np.array(hosts)))]
avg_daily_req_per_host_df = sqlContext.createDataFrame(average_requests, ('day', 'avg_reqs_per_host_per_day'))
Join the two data frames on column day, and then select the day and ratio of the count columns.
total_req_per_day_df = logs_df.select(dayofmonth('time')
.alias('day')
).groupBy('day').count()
avg_daily_req_per_host_df = (
total_req_per_day_df.join(daily_hosts_df,
total_req_per_day_df.day == daily_hosts_df.day
)
.select(daily_hosts_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count'])
.alias('avg_reqs_per_host_per_day')
)
.cache()
)
Solution, based on zero323 answer, but correctly works as OUTER join.
avg_daily_req_per_host_df = (
total_req_per_day_df.join(
daily_hosts_df, daily_hosts_df['day'] == total_req_per_day_df['day'], 'outer'
).select(
total_req_per_day_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count']).alias('avg_reqs_per_host_per_day')
)
).cache()
Without 'outer' param you lost data for days missings in one of dataframes. This is not critical for PySpark Lab2 task, becouse both dataframes contains same dates. But can create some pain in another tasks :)
I have some data that I want to analyze. I group my data by the relevant group variables (here, 'test_condition' and 'region') and analyze the measure variable ('rt') with a function I wrote:
grouped = data.groupby(['test_condition', 'region'])['rt'].apply(summarize)
That works fine. The output looks like this (fake data):
ci1 ci2 mean
test_condition region
Test Condition Name And 0 295.055978 338.857066 316.956522
Spill1 0 296.210167 357.036210 326.623188
Spill2 0 292.955327 329.435977 311.195652
The problem is, 'test_condition' and 'region' are not actual columns, I can't index into them. I just want columns with the names of the group variables! This seems so simple (and is automatically done in R's ddply) but after lots of googling I have come up with nothing. Does anyone have a simple solution?
By default, the grouping variables are turned into an index. You can change the index to columns with grouped.reset_index().
My second suggestion to specify this in the groupby call with as_index=False, seems not to work as desired in this case with apply (but it does work when using aggregate)
I wrote a lambda function that should be fast, but this is taking a very long time. Is there a better way to write this?
fn = lambda x: shape(df[df.CustomerCard_Num == x.CustomerCard_Num])[0]
df['tottrans'] = df.apply(fn, axis = 1)
Basically, I have a big database of transactions (rows). A set of rows might correspond to different customers (Customer card number if a column in df, multiple rows might have the same df.CustomerCard_Num.)
I am trying to count the number of rows for each customer with this lambda function. But it does not seem to work quickly. Should I be using groupby?
There is a built in way:
df.CustomerCard_Num.value_counts()
See the docs