How does result changes by using .distinct() in spark? - python

I was working with Apache Log File. And I created RDD with tuple (day,host) from each log line. Next step was to Group up host and then display the result.
I used distinct() with mapping of first RDD into (day,host) tuple. When I don't use distinct I get different result as when I do. So how does a result change when using distinct() in spark??

Distinct removes the duplicate entries for a particular key. Your count should reduce or remain same after applying distinct.
http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html#distinct

I think when you only use map action on FIRST_RDD(logs) you will get SECOND_RDD count of new this SECOND_RDD will be equal to count of FIRST_RDD.But if you use distinct on SECOND_RDD, count will decrease to number of distinct tuples present in SECOND_RDD.

Related

How to compute multiple counts with different conditions on a pyspark DataFrame, fast?

Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.

SQLAlchemy query count all records in database according to month [duplicate]

I want
DBSession.query(Article).group_by(Article.created.month).all()
But this query can't using
How do I do this using SQLAlchemy?
assuming you db engine actually supports functions like MONTH(), you can try
import sqlalchemy as sa
DBSession.query(Article).group_by( sa.func.year(Article.created), sa.func.month(Article.created)).all()
else you can group in python like
from itertools import groupby
def grouper( item ):
return item.created.year, item.created.month
for ( (year, month), items ) in groupby( query_result, grouper ):
for item in items:
# do stuff
I know this question is ancient, but for the benefit of anyone searching for solutions, here's another strategy for databases that don't support functions like MONTH():
db.session.query(sa.func.count(Article.id)).\
group_by(sa.func.strftime("%Y-%m-%d", Article.created)).all()
Essentially this is turning the timestamps into truncated strings that can then be grouped.
If you just want the most recent entries, you can add, for example:
order_by(Article.created.desc()).limit(7)
Following this strategy, you can easily create groupings such as day-of-week simply by omitting the year and month.
THC4k answer works but I just want to add that query_result need to be already sorted to get itertools.groupby working the way you want.
query_result = DBSession.query(Article).order_by(Article.created).all()
Here is the explanation in the itertools.groupby docs:
The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

Pandas: fix typos in keys within a dataframe

So, I have a large data frame with customer names. I used the phone number and email combined to create a unique ID key for each customer. But, sometimes there will be a typo in the email so it will create two keys for the same customer.
Like so:
Key | Order #
555261andymiller#gmail.com 901345
555261andymller#gmail.com 901345
I'm thinking of combining all the keys based on the phone number (partial string) and then assigning all the keys within each group to the first key in every group. How would I go about doing this in Pandas? I've tried iterating over the rows and I've also tried the groupby method by partial string, but I can't seem to assign new values using this method.
If you really don't care what the new ID is, you can groupby the first characters of the string (which represent the phone number)
For example:
df.groupby(df.Key.str[:6]).first()
This will result in a dataframe where the index is the the first entry of the customer record. This assumes that the phone number will always be correct, though it sounds like that should not be an issue

How does the collectAsMap() function work for Spark API

I am trying to understand as to what happens when we run the collectAsMap() function in spark. As per the Pyspark docs,it says,
collectAsMap(self)
Return the key-value pairs in this RDD to the master as a dictionary.
and for core spark it says,
def collectAsMap(): Map[K, V] Return the key-value pairs in this RDD
to the master as a Map.
When I try to run a sample code in pyspark for a List, I get this result:
and for scala I get this result:
I am a little confused as to why it is not returning all the elements in the List. Can somebody help me understand what is happening in this scenario as to why I am getting selective results.
Thanks.
The semantics of collectAsMap are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap explicitly states:
Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)
In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.
If you use collect instead, it will return Array[(Int,Int)] without losing any of your pairs.
collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.

How to count the number of row keys for a particular column_family in Cassandra (read details)

I am trying to load data from SQL to No-SQL i.e Cassandra. but somehow few rows are not matching. Can somebody tell me how to count the number of row keys for a particular column_family in Cassandra.
I tried get_count and get_multicount, but these methods require keys to passed, In my case i do not know the keys, Instead I need the row count of the row_keys.
list column_family_name gives me the list but limited to only 100 rows. is there any way,
I can override the 100 limit.
As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead.
If cf is your column family, something like this should work:
num_rows = len(list(cf.get_range()))
However, the documentation for get_range indicates that this might cause issues if you have too many rows. You might have to do it in chunks, using start and row_count.
You can count Cassandra rows without reading all rows.
See the implementation in Spark for cassandraCount() which does this quite efficiently.

Categories