I have a dataset in pyspark for which I create a row_num column, so my data looks like:
#data:
+-----------------+-----------------+-----+------------------+-------+
|F1_imputed |F2_imputed |label| features|row_num|
+-----------------+-----------------+-----+------------------+-------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|
| 0.1265| 0.1176| 0| [0.1265,0.1176]| 2|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|
| -0.1428| 0.4705| 0| [-0.1428,0.4705]| 4|
| -0.1015| 0.6811| 0| [-0.1015,0.6811]| 5|
| -0.01146| 0.8273| 0| [-0.01146,0.8273]| 6|
| 0.0853| 0.2525| 0| [0.0853,0.2525]| 7|
| 0.2186| 0.2725| 0| [0.2186,0.2725]| 8|
| -0.145| 0.3592| 0| [-0.145,0.3592]| 9|
| -0.1176| 0.4225| 0| [-0.1176,0.4225]| 10|
+-----------------+-----------------+-----+------------------+-------+
I'm trying to filter out a random selection of rows using:
count = data.count()
sample = [np.random.choice(np.arange(count), replace=True, size=50)]
filtered = data.filter(data.row_num.isin(sample))
However the second line gives an error:
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
What is causing this? I use the same filtering code to spilt the rows by label (binary column of ones and zeros) which does work, but reapplying the code now doesn't work for sampling
Numpy data types don't interact well with Spark. You can convert them to Python data types using .tolist() before calling .isin:
sample = np.random.choice(np.arange(count), replace=True, size=50).tolist()
Related
My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+
I am running spark version :3.1.3 on YARN cluster 2 nodes each 15 GB. I am trying to run DBSCAN on spark based on this answer :
1)https://stackoverflow.com/a/60153719/13378785
spark_dbscan takes a dataframe which contains the features and a partition based on which some sub-groups have been created and sklearn DBSCAN will be applied later on locally .
Run_dbscan is the implementation of DBSCAN from sklearn (which the only difference i consider noise as 0 instead of -1).
As an output i just get a dataframe with some local labels for each group (which later on with another process are merged).
An example of the input and output is shown below:
def create_scheme(dimensions):
schema = StructType()
for i in range(0, dimensions):
schema.add(str(i), FloatType())
schema.add('index', IntegerType())
schema.add('partitions', IntegerType())
schema.add('cluster', IntegerType())
return schema
def Run_dbscan(df , eps=0.2, min_samples=45, colums=None, dim=2):
db = DBSCAN(eps=eps, min_samples=min_samples).fit(df[colums[:dim]])
df['cluster'] = db.labels_
df['cluster'] = df['cluster'] + 1
return df
def spark_dbscan(train_d_path, spark, esp=0.3, min_pts=3, dim=2, part=8,roungd=0):
def dbscan(data):
return Run_dbscan(data=data, eps=esp, colums=col_d, min_samples=min_pts, dim=dim)
#partition_df = has been created earlier and is the one displayed below as Input
col_d = partition_df.columns
partition_df = partition_df.repartition(X) #X integer [2~200]
df = partition_df.groupby("partitions").applyInPandas(dbscan, schema=create_scheme(dim))
df = df.select([c for c in df.columns if c in ['cluster', 'index', 'partitions']])
#write.csv will either throw OOM or task will fail in YARN
df.write.csv('apath', header='true')
return df
Input :
+-----------+------------+-----+----------+
| 0| 1|index|partitions|
+-----------+------------+-----+----------+
|0.009805393| -1.15588| 0| 400|
| -0.3587753| -0.8127602| 1| 400|
| -0.3587753| -0.8127602| 1| 600|
| 0.39499614| 0.57317185| 2| 500|
| 0.39499614| 0.57317185| 2| 800|
|-0.11886594| -1.3287356| 3| 400|
|0.045839764| -1.333411| 4| 400|
| -1.406546| 0.75050515| 5| 200|
| -1.406546| 0.75050515| 5| 300|
| 1.3432932| -1.1720116| 6| 700|
| 0.23757286| 0.89215803| 7| 800|
| -1.0563164| 1.4410132| 8| 200|
| -0.3718215| -0.49574503| 9| 600|
| 0.7345064| -1.3823572| 10| 700|
| 1.5790144| -0.79659116| 11| 100|
| 1.5790144| -0.79659116| 11| 700|
|-0.04173269| 1.3471143| 12| 800|
| -1.4477109| 0.5785008| 13| 200|
| -1.4477109| 0.5785008| 13| 300|
| 0.59039986|-0.057216637| 14| 500|
+-----------+------------+-----+----------+
Output:
+-----+----------+-------+
|index|partitions|cluster|
+-----+----------+-------+
|98663| 300| 2|
|80841| 300| 1|
|62544| 300| 1|
| 2977| 300| 1|
|23948| 300| 1|
|44470| 300| 1|
|84420| 300| 1|
|30907| 300| 1|
|93084| 300| 1|
|58671| 300| 1|
|58051| 300| 1|
|46485| 300| 2|
|29175| 300| 1|
|41179| 300| 1|
|95224| 300| 1|
|56119| 300| 1|
|22074| 300| 1|
|88336| 300| 1|
|16965| 300| 1|
|35793| 300| 0|
+-----+----------+-------+
only showing top 20 rows
For small datasets this works fine (less than 1 million rows) however for larger dataset if i run it locally it will always end up with either Heap error or OOM due to the memory which is required for the DBSCAN. SKlearn dbscan is memory based so i am expecting all the data to moved to the memory
however i expected with this implementation to be able to "split" the original dataset into smaller chuncks and prevent these errors .
So my question is how to overcome this, so basically a trade off memory for speed
I have tried the following:
1)Increasing the number of partitions which lowers the number of data/rows grouped together thus spark would slower calculate the labels.
2)Increase the number of the dataframe partitions/shuffle (X value) .
Neither of those solutions could prevent the issue i mentioned from happening.
The only method which i haven't tried but might could work is a filter like approach where i would create N dataframe (#N the number of partitions/sub-groups) and run DBSCAN to each of the with a for loop and then joining them together but i couldn't find any post supporting such solution.
d = {}
num_p = [1,2,3,4] #consider list with the partitionsnumber 1,2,3,4etc
for ie in num_p:
d[ie] = df.filter((f.col("partitions")==ie)
pandas_df = d[ie].toPandas()
#apply dbscan
#drop features and convert back to spark dataframe with scheme
merge all spark dataframe into one with union and move to global merging.
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 4| 2|
+---+----+---+
I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.
Any ideas ?
It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.
pandas has a ready-made method to convert a dataframe to a dict.
I am trying to replicate a code in Python using PySpark, and I found myself in a problem. So this is the code I am trying to replicate:
df_act = (df_act.assign(n_cycles = (lambda x: (x.cycles_bol != x.cycles_bol.shift(1)).cumsum())))
Keep in mind that I am working with a dataframe, and that cycles_bol is a column of dataframe "df_act".
and I simply can't. The closest I think I have gotten to the solution is the following:
df_act=df_act.withColumn(
"grp",
when(df_act['cycles_bol'] == lead("cycles_bol").over(Window.partitionBy("user_id").orderBy("timestamp")),0).otherwise(1).over(Window.orderBy("timestamp"))
).drop("grp").show()
Can anyone please help me?
Thanks in advance!
You dindt give much information
You have to orderBy,use lag to check if cycles_bol consecutives are the same and conditionally add. Use an existing column to orderBy if it wont change the order of cycles_bol. If you dont have such a column, generate one using monotonically_increasing function like I did.
df_act.withColumn('id', monotonically_increasing_id()).withColumn('n_cycles',sum(when(lag('cycles_bol').over(Window.orderBy('id'))!=col('cycles_bol'),1).otherwise(0)).over(Window.orderBy('id'))).drop('id').show()
+----------+--------+
|cycles_bol|n_cycles|
+----------+--------+
| A| 0|
| B| 1|
| B| 1|
| A| 2|
| B| 3|
| A| 4|
| A| 4|
| B| 5|
| C| 6|
+----------+--------+
I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.
In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.
Current Dataframe example and Desired Output
+---+---------- +----------+------------
|USER| DATE |COUNT_COL1| COUNT_COL2|... DESIRED COLUMN
+---+---------- +----------+------------
| b | 7/1/2019 | 12 | 1 | 2 (2 columns are non-zero)
| a | 6/9/2019 | 0 | 5 | 1
| c | 1/1/2019 | 0 | 0 | 0
Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.
test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)
For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)
from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
.withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()
#+----+--------+----------+----------+--------------+
#|USER| DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#| b|7/1/2019| 12| 1| 2|
#| a|6/9/2019| 0| 5| 1|
#| c|1/1/2019| 0| 0| 0|
#+----+--------+----------+----------+--------------+
Let's say you have below df:
df.show()
df.printSchema()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| a| 1| 2| 3|
| a| 0| 2| 1|
| a| 0| 0| 1|
| a| 0| 0| 0|
+---+---+---+---+
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Using case when statement you can check if column is numeric and then if it is larger than 0. In the next step f.size will return count thanks to f.array_remove which left only cols with True value.
from pyspark.sql import functions as f
cols = [f.when(f.length(f.regexp_replace(f.col(x), '\\d+', '')) > 0, False).otherwise(f.col(x).cast('int') > 0) for x in df2.columns]
df.select("*", f.size(f.array_remove(f.array(*cols), False)).alias("count")).show()
+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|count|
+---+---+---+---+-----+
| a| 1| 2| 3| 3|
| a| 0| 2| 1| 2|
| a| 0| 0| 1| 1|
| a| 0| 0| 0| 0|
+---+---+---+---+-----+