UPDATE(04/20/17):
I am using Apache Spark 2.1.0 and I will be using Python.
I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. I need to create an RDD of tuples from the header of the values.csv file:
values.csv (main collected data, very large):
+--------+---+---+---+---+---+----+
| ID | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
| | | | | | | |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1 |
| | | | | | | |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2 |
| | | | | | | |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3 |
+--------+---+---+---+---+---+----+
output (RDD):
+----------+----------+----------+----------+----------+----------+----------+
| abc123 | (1;1) | (2;2) | (3;3) | (4;1) | (9;0) | (11;1) |
| | | | | | | |
| aewe23 | (1;4) | (2;5) | (3;6) | (4;1) | (9;0) | (11;2) |
| | | | | | | |
| ad2123 | (1;7) | (2;8) | (3;9) | (4;1) | (9;0) | (11;3) |
+----------+----------+----------+----------+----------+----------+----------+
What happened was I paired each value with the column name of that value in the format:
(column_number, value)
raw format (if you are interested in working with it):
id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3
The Problem:
The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem. Is it possible to achieve the output with a parallelized header?
I think you can achieve the solution using PySpark Dataframe too. However, my solution is not optimal yet. I use split to get the new column name and corresponding columns to do sum. This depends on how large is your key_list. If it's too large, this might not work will because you have to load key_list on memory (using collect).
import pandas as pd
import pyspark.sql.functions as func
# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
['aewe23', 4, 5, 6, 1, 0, 2],
['ad2123', 7, 8, 9, 1, 0, 3]],
columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
['b','2;4'],
['c','3;9;11']],
columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data
key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)
Output
output_df.show(n=3)
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 4|
| 4| 6| 8|
| 7| 9| 12|
+---+---+---+
Related
I got a dataframe (df1), where I have listed some time frames:
| start | end | event name |
|-------|-----|------------|
| 1 | 3 | name_1 |
| 3 | 5 | name_2 |
| 2 | 6 | name_3 |
In these time frames, I would like to extract some data from another dataframe (df2). For example, I want to extend df1 with the average measurementn from df2 inside the specified time range.
| timestamp | measurement |
|-----------|-------------|
| 1 | 5 |
| 2 | 7 |
| 3 | 5 |
| 4 | 9 |
| 5 | 2 |
| 6 | 7 |
| 7 | 8 |
I was thinking about an UDF function which filters df2 by timestamp and evaluates the average. But in a UDF I can not reference two dataframes:
def get_avg(start, end):
return df2.filter(df2.timestamp > start & df2.timestamp < end).agg({"average": "avg"})
udf_1 = f.udf(get_avg)
df1.select(udf_1('start', 'end').show()
This will throw an error TypeError: cannot pickle '_thread.RLock' object.
How would I solve this issue efficiently?
In this case there is no need to use UDFs, you can simply use join over a range interval determined by the timestamps
import pyspark.sql.functions as F
df1.join(df2, on=[(df2.timestamp > df1.start) & (df2.timestamp < df1.end)]) \
.groupby('start', 'end', 'event_name') \
.agg(F.mean('measurement').alias('avg')) \
.show()
+-----+---+----------+-----------------+
|start|end|event_name| avg|
+-----+---+----------+-----------------+
| 1| 3| name_1| 7.0|
| 3| 5| name_2| 9.0|
| 2| 6| name_3|5.333333333333333|
+-----+---+----------+-----------------+
I have a DF in PySpark where I'm trying to explode two columns of arrays. Here's my DF:
+--------+-----+--------------------+--------------------+
| id| zip_| values| time|
+--------+-----+--------------------+--------------------+
|56434459|02138|[1.033990484, 1.0...|[1.624322475139E9...|
|56434508|02138|[1.04760919, 1.07...|[1.624322475491E9...|
|56434484|02138|[1.047177758, 1.0...|[1.62432247655E9,...|
|56434495|02138|[0.989590562, 1.0...|[1.624322476937E9...|
|56434465|02138|[1.051481754, 1.1...|[1.624322477275E9...|
|56434469|02138|[1.026476497, 1.1...|[1.624322477605E9...|
|56434463|02138|[1.10024864, 1.31...|[1.624322478085E9...|
|56434458|02138|[1.011091305, 1.0...|[1.624322478462E9...|
|56434464|02138|[1.038230333, 1.0...|[1.62432247882E9,...|
|56434474|02138|[1.041924752, 1.1...|[1.624322479386E9...|
|56434452|02138|[1.044482358, 1.1...|[1.624322479919E9...|
|56434445|02138|[1.050144598, 1.1...|[1.624322480344E9...|
|56434499|02138|[1.047851812, 1.0...|[1.624322480785E9...|
|56434449|02138|[1.044700917, 1.1...|[1.6243224811E9, ...|
|56434461|02138|[1.03341455, 1.07...|[1.624322481443E9...|
|56434526|02138|[1.04779412, 1.07...|[1.624322481861E9...|
|56434433|02138|[1.0498406, 1.139...|[1.624322482181E9...|
|56434507|02138|[1.0013894403, 1....|[1.624322482419E9...|
|56434488|02138|[1.047270063, 1.0...|[1.624322482716E9...|
|56434451|02138|[1.043182727, 1.1...|[1.624322483061E9...|
+--------+-----+--------------------+--------------------+
only showing top 20 rows
My current solution is to do a posexplode on each column, combined with a concat_ws for a unique ID, creating two DFs.
First DF:
+-----------+-----+-----------+
| new_id| zip_| values_new|
+-----------+-----+-----------+
| 56434459_0|02138|1.033990484|
| 56434459_1|02138| 1.07805057|
| 56434459_2|02138| 1.09000133|
| 56434459_3|02138| 1.07009546|
| 56434459_4|02138|1.102403015|
| 56434459_5|02138| 1.1291009|
| 56434459_6|02138|1.088399924|
| 56434459_7|02138|1.047513142|
| 56434459_8|02138|1.010418795|
| 56434459_9|02138| 1.0|
|56434459_10|02138| 1.0|
|56434459_11|02138| 1.0|
|56434459_12|02138| 0.99048968|
|56434459_13|02138|0.984854524|
|56434459_14|02138| 1.0|
| 56434508_0|02138| 1.04760919|
| 56434508_1|02138| 1.07858897|
| 56434508_2|02138| 1.09084267|
| 56434508_3|02138| 1.07627785|
| 56434508_4|02138| 1.13778706|
+-----------+-----+-----------+
only showing top 20 rows
Second DF:
+-----------+-----+----------------+
| new_id| zip_| values_new|
+-----------+-----+----------------+
| 56434459_0|02138|1.624322475139E9|
| 56434459_1|02138|1.592786475139E9|
| 56434459_2|02138|1.561164075139E9|
| 56434459_3|02138|1.529628075139E9|
| 56434459_4|02138|1.498092075139E9|
| 56434459_5|02138|1.466556075139E9|
| 56434459_6|02138|1.434933675139E9|
| 56434459_7|02138|1.403397675139E9|
| 56434459_8|02138|1.371861675139E9|
| 56434459_9|02138|1.340325675139E9|
|56434459_10|02138|1.308703275139E9|
|56434459_11|02138|1.277167275139E9|
|56434459_12|02138|1.245631275139E9|
|56434459_13|02138|1.214095275139E9|
|56434459_14|02138|1.182472875139E9|
| 56434508_0|02138|1.624322475491E9|
| 56434508_1|02138|1.592786475491E9|
| 56434508_2|02138|1.561164075491E9|
| 56434508_3|02138|1.529628075491E9|
| 56434508_4|02138|1.498092075491E9|
+-----------+-----+----------------+
only showing top 20 rows
I then join the DFs on new_id, resulting in:
+------------+-----+----------------+-----+------------------+
| new_id| zip_| values_new| zip_| values_new|
+------------+-----+----------------+-----+------------------+
| 123957783_3|02138|1.527644029268E9|02138| 1.0|
| 125820702_3|02138|1.527643636531E9|02138| 1.013462378|
|165689784_12|02138|1.243647038288E9|02138|0.9283950599999999|
|165689784_14|02138|1.180488638288E9|02138| 1.011595547|
| 56424973_12|02138|1.245630256025E9|02138|0.9566622300000001|
| 56424989_14|02138|1.182471866886E9|02138| 1.0|
| 56425304_7|02138|1.403398444955E9|02138| 1.028527131|
| 56425386_6|02138|1.432949752808E9|02138| 1.08516484|
| 56430694_17|02138|1.087866094991E9|02138| 1.120045416|
| 56430700_20|02138| 9.61635686239E8|02138| 1.099920854|
| 56430856_13|02138|1.214097787512E9|02138| 0.989263804|
| 56430866_12|02138|1.245633801277E9|02138| 0.990684134|
| 56430875_10|02138|1.308705777269E9|02138| 1.0|
| 56430883_3|02138|1.529630585921E9|02138| 1.06920212|
| 56430987_13|02138|1.214100806414E9|02138| 0.978794644|
| 56431009_1|02138|1.592792025664E9|02138| 1.07923349|
| 56431013_9|02138|1.340331235566E9|02138| 1.0|
| 56431025_8|02138|1.371860189767E9|02138| 0.9477155|
| 56432373_13|02138|1.214092187852E9|02138| 0.994825498|
| 56432421_2|02138|1.561161037707E9|02138| 1.11343257|
+------------+-----+----------------+-----+------------------+
only showing top 20 rows
My question: Is there a more effective way to get the resultant DF? I tried doing two posexplodes in parallel but PySpark allows only one.
You can achieve it as follows:
df = (df.withColumn("values_new", explode(col("values")))
.withColumn("times_new", explode(col("time")))
.withColumn("id_new", monotonically_increasing_id()))
I have a table like so:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 0 | 5 |
| 0 | 4 |
| 0 | 0 |
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
| 3 | -4 |
--------------------------------------------
I would like to remove all IDs which have any Value <= 0, so the result would be:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
--------------------------------------------
I tried doing this by filtering to only rows with Value<=0, selecting the distinct IDs from this, converting that to a list, and then removing any rows in the original table that have an ID in that list using df.filter(~df.Id.isin(mylist))
However, I have a huge amount of data, and this ran out of memory making the list, so I need to come up with a pure pyspark solution.
As Gordon mentions, you may need a window for this, here is a pyspark version:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("Id")
(df.withColumn("flag",F.when(F.col("Value")<=0,0).otherwise(1))
.withColumn("Min",F.min("flag").over(w)).filter(F.col("Min")!=0)
.drop("flag","Min")).show()
+---+-----+
| Id|Value|
+---+-----+
| 1| 3|
| 2| 1|
| 2| 8|
+---+-----+
Brief summary of approach taken:
Set a flag when Value<=0 then 0 else `1
get min over a partition of id (will return 0 if any of prev cond is
met)
filter only when this Min value is not 0
`
You can use window functions:
select t.*
from (select t.*, min(value) over (partition by id) as min_value
from t
) t
where min_value > 0
I came across the concept of flags in python on some occasions, for example in wxPython. An example is the initialization of a frame object.
The attributes that are passed to "style".
frame = wx.Frame(None, style=wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX)
I don't really understand the concept of flags. I haven't even found a solid explanation what exactly the term "flag" means in Python. How are all these attributes passed to one variable?
The only thing i can think of is that the "|" character is used as a boolean operator, but in that case wouldn't all the attributes passed to style just evaluate to a single boolean expression?
What is usually meant with flags in this sense are bits in a single integer value. | is the ususal bit-or operator.
Let's say wx.MAXIMIZE_BOX=8 and wx.RESIZE_BORDER=4, if you or them together you get 12. In this case you can actually use + operator instead of |.
Try printing the constants print(wx.MAXIMIZE_BOX) etc. and you may get a better understanding.
Flags are not unique to Python; the are a concept used in many languages. They build on the concepts of bits and bytes, where computer memory stores information using, essentially, a huge number of flags. Those flags are bits, they either are off (value 0) or on (value 1), even though you usually access the computer memory in groups of at least 8 of such flags (bytes, and for larger groups, words of a multiple of 8, specific to the computer architecture).
Integer numbers are an easy and common representation of the information stored in bytes; a single byte can store any integer number between 0 and 255, and with more bytes you can represent bigger integers. But those integers still consist of bits that are either on or off, and so you can use those as switches to enable or disable features. You pass in specific integer values with specific bits enabled or disabled to switch features on and off.
So a byte consists of 8 flags (bits), and enabling one of these means you have 8 different integers; 1, 2, 4, 8, 16, 32, 64 and 128, and you can pass a combination of those numbers to a library like wxPython to set different options. For multi-byte integers, the numbers just go up by doubling.
But you a) don't want to remember what each number means, and b) need a method of combining them into a single integer number to pass on.
The | operator does the latter, and the wx.MAXIMIZE_BOX, wx.RESIZE_BORDER, etc names are just symbolic constants for the integer values, set by the wxWidget project in various C header files, and summarised in wx/toplevel.h and wx/defs.h:
/*
Summary of the bits used (some of them are defined in wx/frame.h and
wx/dialog.h and not here):
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|15|14|13|12|11|10| 9| 8| 7| 6| 5| 4| 3| 2| 1| 0|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxCENTRE
| | | | | | | | | | | | | | \____ wxFRAME_NO_TASKBAR
| | | | | | | | | | | | | \_______ wxFRAME_TOOL_WINDOW
| | | | | | | | | | | | \__________ wxFRAME_FLOAT_ON_PARENT
| | | | | | | | | | | \_____________ wxFRAME_SHAPED
| | | | | | | | | | \________________ wxDIALOG_NO_PARENT
| | | | | | | | | \___________________ wxRESIZE_BORDER
| | | | | | | | \______________________ wxTINY_CAPTION_VERT
| | | | | | | \_________________________
| | | | | | \____________________________ wxMAXIMIZE_BOX
| | | | | \_______________________________ wxMINIMIZE_BOX
| | | | \__________________________________ wxSYSTEM_MENU
| | | \_____________________________________ wxCLOSE_BOX
| | \________________________________________ wxMAXIMIZE
| \___________________________________________ wxMINIMIZE
\______________________________________________ wxSTAY_ON_TOP
...
*/
and
/*
Summary of the bits used by various styles.
High word, containing styles which can be used with many windows:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | \_ wxFULL_REPAINT_ON_RESIZE
| | | | | | | | | | | | | | \____ wxPOPUP_WINDOW
| | | | | | | | | | | | | \_______ wxWANTS_CHARS
| | | | | | | | | | | | \__________ wxTAB_TRAVERSAL
| | | | | | | | | | | \_____________ wxTRANSPARENT_WINDOW
| | | | | | | | | | \________________ wxBORDER_NONE
| | | | | | | | | \___________________ wxCLIP_CHILDREN
| | | | | | | | \______________________ wxALWAYS_SHOW_SB
| | | | | | | \_________________________ wxBORDER_STATIC
| | | | | | \____________________________ wxBORDER_SIMPLE
| | | | | \_______________________________ wxBORDER_RAISED
| | | | \__________________________________ wxBORDER_SUNKEN
| | | \_____________________________________ wxBORDER_{DOUBLE,THEME}
| | \________________________________________ wxCAPTION/wxCLIP_SIBLINGS
| \___________________________________________ wxHSCROLL
\______________________________________________ wxVSCROLL
...
*/
The | operator is the bitwise OR operator; it combines the bits of two integers, each matching bit is paired up and turned into an output bit according to the boolean rules for OR. When you do this for those integer constants, you get a new integer number with multiple flags enabled.
So the expression
wx.MAXIMIZE_BOX | wx.RESIZE_BORDER | wx.SYSTEM_MENU | wx.CAPTION | wx.CLOSE_BOX
gives you an integer number with the bits numbers 9, 6, 11, 29, and 12 set; here I used '0' and '1' strings to represent the bits and int(..., 2) to interpret a sequence of those strings as a single integer number in binary notation:
>>> fourbytes = ['0'] * 32
>>> fourbytes[9] = '1'
>>> fourbytes[6] = '1'
>>> fourbytes[11] = '1'
>>> fourbytes[29] = '1'
>>> fourbytes[12] = '1'
>>> ''.join(fourbytes)
'00000010010110000000000000000100'
>>> int(''.join(fourbytes), 2)
39321604
On the receiving end, you can use the & bitwise AND operator to test if a specific flag is set; that return 0 if the flag is not set, or the same integer as assigned to the flag constant if the flag bit had been set. In both C and in Python, a non-zero number is true in a boolean test, so testing for a specific flag is usually done with:
if ( style & wxMAXIMIZE_BOX ) {
for determining that a specific flag is set, or
if ( !(style & wxBORDER_NONE) )
to test for the opposite.
It is a boolean operator - not logical one, but bitwise one. wx.MAXIMIZE_BOX and the rest are typically integers that are powers of two - 1, 2, 4, 8, 16... which makes it so that only one bit in them is 1, all the rest of them are 0. When you apply bitwise OR (x | y) to such integers, the end effect is they combine together: 2 | 8 (0b00000010 | 0b00001000) becomes 10 (0b00001010). They can be pried apart later using the bitwise AND (x & y) operator, also calling a masking operator: 10 & 8 > 0 will be true because the bit corresponding to 8 is turned on.
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()