Adding random samples from one spark dataframe to another - python

I have two dataframes like this:
| User |
------
| 1 |
| 2 |
| 3 |
and
| Articles |
----------
| 'A' |
| 'B' |
| 'C' |
What's an intuitive way to assign each user 2 articles randomly?
The output dataframe might look like this:
| User | Articles |
-----------------
| 1 | 'A' |
| 1 | 'C' |
| 2 | 'C' |
| 2 | 'B' |
| 3 | 'C' |
| 3 | 'A' |
Here's the code that will generate these two dataframes:
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
rdd = sc.parallelize(a)
articles = rdd.map(lambda x: Row(article_id=x[0]))
articles_df = sqlContext.createDataFrame(articles)

Since your article list is small it makes sense to keep it as a python object and not as a distributed list. This will allow you to create a udf to produce a random list of articles for each user_id. The following is one way you could do so:
from random import sample,seed
from pyspark.sql import Row
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import ArrayType,StringType
class ArticleRandomizer(object):
def __init__(self,article_list,num_articles=2,preseed=0):
self.article_list=article_list
self.num_articles=num_articles
self.preseed=preseed
def getrandom(self,user):
seed(user+self.preseed)
return sample(self.article_list,self.num_articles)
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
#rdd = sc.parallelize(a)
#articles = rdd.map(lambda x: Row(article_id=x[0]))
#articles_df = sqlContext.createDataFrame(articles)
article_list=[article[0] for article in a]
ARandomizer=ArticleRandomizer(article_list)
add_articles=udf(ARandomizer.getrandom,ArrayType(StringType()))
users_df.select('user_id',explode(add_articles('user_id'))).show()
This ArticleRandomizer.getrandom function is seeded by the user_id, so it is deterministic, meaning you will get the same random list of articles for a given user for each run. You can adjust this to get a potentially different list by changing the preseed value when you instantiate the class.
This hasn't been tested to see if it will scale well, but it should work fine on your dataset because the dimension of the articles and the users are both fairly small.

If indeed the Articles DataFrame is pretty small we can run collect_list which will take the entire DataFrame and make it one row with an Array column.
| Articles |
-----------------
| ['A', 'B', 'C'] |
Then we can cross-join this table to the Users one, randomly generate two different integers (this is the main part of the code below) and then just pick the two elements from the Articles column .
explode function is used in order to achieve the format you presented in the original question.
from pyspark.sql.functions import collect_list, rand, when, col, size, floor, explode, array
articles_collected = articles.agg(collect_list("Articles").alias("articles"))
users \
.join(articles_collected, how="cross") \
.withColumn(
"first_rand",
floor(rand() * size("articles"))
) \
.withColumn(
"second_rand",
when(
col("first_rand") == 0,
floor(rand() * (size("articles") - 1)) + 1
).otherwise(
floor(rand() * col("first_rand"))
)
) \
.withColumn(
"articles_picked",
array(
col("articles").getItem(col("first_rand").cast("int")),
col("articles").getItem(col("second_rand").cast("int"))
)
) \
.select(
"User",
explode("articles_picked").alias("Articles")
)

Related

PySpark : How do you use the values in multiple columns to perform some sort of aggregation?

What i have :
#+-------+----------+----------+
#|dotId |codePp |status |
#+-------+----------+----------+
#|dot0001 |Pp3523 |start |
#|dot0001 |Pp3524 |stop |
#|dot0020 |Pp3522 |start |
#|dot0020 |Pp3556 |stop |
#|dot9999 |Pp3545 |stop |
#|dot9999 |Pp3523 |start |
#|dot9999 |Pp3587 |stop |
#|dot9999 |Pp3567 |start |
#------------------------------|
What i want :
Instruction: if status is 'stop' put codePp with '(stop)' else put 'codePp'
#+-------+----------------------------------------------+
#|dotId |codePp |
#+-------+----------------------------------------------+
#|dot0001 |Pp3523, Pp3524(stop) |
#|dot0020 |Pp3522, Pp3556(stop) |
#|dot9999 |Pp3545(stop), Pp3523, Pp3587(stop), Pp3567 |
#-------------------------------------------------------|
But how to wrote it at pyspark ?
You may try the following using a case expression (using when) to determine whether to append the status. This was done in a group by/aggregation that used collect_list to gather all codePp values and concat_ws to convert it into a comma separated string.
from pyspark.sql import functions as F
output_df =(
df.groupBy("dotId")
.agg(
F.concat_ws(
', ',
F.collect_list(
F.concat(
F.col("codePp"),
F.when(F.col("status")=="stop"),"(stop)")
)
)
).alias("codePp")
)
)
Let me know if this works for you.

How to transform multiple pandas dataframes to array in memory constrains?

The given problem:
I have folders named from folder1 to folder999. In each folder there are parquet files - named from 1.parquet to 999.parquet. Each parquet consist pandas dataframe of given structure:
id |title |a
1 |abc |1
1 |abc |3
1 |abc |2
2 |abc |1
... |def | ...
Where column a can be value of range a1 to a3.
The partial step is to obtain structure:
id | title | a1 | a2 | a3
1 | abc | 1 | 1 | 1
2 | abc | 1 | 0 | 0
...
In order to obtain final form,:
title
id | abc | def | ...
1 | 3 | ... |
2 | 1 | ... |
where values of column abc is sum of columns a1, a2 and a3.
The goal is to obtain final form calculated on all the parquet files in all the folders.
Now, the situation I am now looks like this: I do know how to receive the final form by partial step, e.g. by using sparse.coo_matrix() like explained in How to make full matrix from dense pandas dataframe .
The problem is: due to memory limitations I cannot simply read all the parquets at once.
I have three questions:
How to get there efficiently, if I have plenty of data (assume each parquet file consists of 500MB)?
Can I transform each parquet to final form separately and THEN merge them somehow? If yes, how could I do that?
Is there any way to skip the partial step?
For every dataframe in the files, you seem to
Group Data by the columns id, title
Now, sum the data in column a for each group
Creating a full matrix for the task, is not necessary and so's the partial step.
I am not sure, how many unique combinations of id, title exists in a file and or all of them. A safe step would be to process files in batches, save their results and later combine all results
Which looks like,
import pandas as pd
import numpy as np
import string
def gen_random_data(N, M):
# N = 100
# M = 10
titles = np.apply_along_axis(lambda x: ''.join(x), 1, np.random.choice(list(string.ascii_lowercase), 3*M).reshape(-1, 3))
titles = np.random.choice(titles, N)
_id = np.random.choice(np.arange(M) + 1, N)
val = np.random.randint(M, size=(N,))
df = pd.DataFrame(np.vstack((_id, titles, val)).T, columns=['id', 'title', 'a'])
df = df.astype({'id': np.int64, 'title': str, 'a': np.int64})
return df
def combine_results(grplist):
# stitch into one dataframe
comb_df = pd.concat(dflist, axis=1)
# Sum over common axes i.e. id, titles
comb_df = comb_df.apply(lambda row: np.nansum(row), axis=1)
# Return a data frame with sum of a's
return comb_df.to_frame('sum_of_a')
totalfiles = 10
batch = 2
filelist = []
for counter,nfiles in enumerate(range(0, totalfiles, batch)):
# Read data from files. generate random data
dflist = [gen_random_data(100, 2) for _ in range(nfiles)]
# Process the data in memory
dflist = [_.groupby(['id', 'title']).agg(['sum']) for _ in dflist]
collection = combine_results(dflist)
# write intermediate results to file and repeat the process for the rest of the files
intermediate_result_file_name = f'resfile_{counter}'
collection.to_parquet(intermediate_result_file_name, index=True)
filelist.append(intermediate_result_file_name)
# Combining result files.
collection = [pd.read_parquet(file) for file in filelist]
totalresult = combine_results(collection)

Joining two csv data in dataflow python

I have two csv files - I want to do full join in data flow
the two csv files I read as PCollection
csv1
columns A | B | C | D | E
csv2
columns A | B | C | F | G
I need to join the two P collection based on key A,B and get the resulting p collection like below
columns A | B | C | D | E | F | G
Trial 1
{'left': P_collection_1, 'right': P_collection_2}
| ' Combine' >> beam.CoGroupByKey()
| ' ExtractValues' >> beam.Values()
This is basically like a full join in sql
I believe you can indeed use CoGroupBykey:
Applying the Apache Beam Programming guide's example of phones and emails to your case, you can try to feed CoGroupByKey with a PCollection of 'C,D,E's, keyed with 'A,B's, and a PCollection of 'F,G's, also keyed with 'A,B's.
To make it a little clearer, the elements in each PCollection must be tuples, with their first element an 'A,B' key, and the second a 'C,D,E' or 'F,G' value:
PColl1 = PCollection(
('2,4', '1,2,5'),
('1,10', '4,4,9'),
...) # this is the PCollection of CDE's
PColl2 = PCollection(
('2,4', '30,3'),
('20,1', '2,1'),
...) # this is the PCollection of FG's
(The PCollection notation is just here to illustrate)
Then we would apply :
join = {'CDE': PColl1, 'FG': Pcoll2} | beam.CoGroupByKey()
As per the programming guide, result should be:
PCollection(
('2,4', {
'CDE': ['1,2,5'],
'FG': ['30,3']
}
),
('1,10', {
'CDE': ['4,4,9']
}
),
('20,1', {
'FG': ['2,1']
}
),
...)
If A and B take value 2,4 more than once in a same file, it shouldn't be a problem, we should have several values in CDE or in FG.

Applying a udf function in a distributed fashion in PySpark

Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?
I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.

Python Output Length

I'm attempting to output my database table data, which works aside from long table rows. The columns need to be as large as the longest database row. I'm having trouble implementing a calculation to correctly output the table proportionally instead of a huge mess when long rows are outputted (without using a third party library e.g. Print results in MySQL format with Python). Please let me know if you need more information.
Database connection:
connection = sqlite3.connect("test_.db")
c = connection.cursor()
c.execute("SELECT * FROM MyTable")
results = c.fetchall()
formatResults(results)
Table formatting:
def formatResults(x):
try:
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in x:
print(tavnit % row)
print(separator)
print ""
except:
showMainMenu()
pass
Output problem example:
+------+------+---------+
| Date | Name | LinkOrFile |
+------+------+---------+
| 03-17-2016 | hi.com | Locky |
| 03-18-2016 | thisisitsqq.com | None |
| 03-19-2016 | http://ohiyoungbuyff.com\69.exe?1 | None |
| 03-20-2016 | http://thisisitsqq..com\69.exe?1 | None |
| 03-21-2016 | %Temp%\zgHRNzy\69.exe | None |
| 03-22-2016 | | None |
| 03-23-2016 | E52219D0DA33FDD856B2433D79D71AD6 | Downloader |
| 03-24-2016 | microsoft.com | None |
| 03-25-2016 | 89.248.166.132 | None |
| 03-26-2016 | http://89.248.166.131/55KB5js9dwPtx4= | None |
If your main problem is making column widths consistent across all the lines, this python package could do the job: https://pypi.python.org/pypi/tabulate
Below you find a very simple example of a possible formatting approach.
The key point is to find the largest length of each column and then use format method of the string object:
#!/usr/bin/python
import random
import string
from operator import itemgetter
def randomString(minLen = 1, maxLen = 10):
""" Random string of length between 1 and 10 """
l = random.randint(minLen, maxLen)
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(l))
COLUMNS = 4
def randomTable():
table = []
for i in range(10):
table.append( [randomString() for j in range(COLUMNS)] )
return table
def findMaxColumnLengs(table):
""" Returns tuple of max column lengs """
maxLens = [0] * COLUMNS
for l in table:
lens = [len(s) for s in l]
maxLens = [max(maxLens[e[0]], e[1]) for e in enumerate(lens)]
return maxLens
if __name__ == '__main__':
ll = randomTable()
ml = findMaxColumnLengs(ll)
# tuple of formatting statements, see format docs
formatStrings = ["{:<%s}" % str(m) for m in ml ]
fmtStr = "|".join(formatStrings)
print "=================================="
for l in ll:
print l
print "=================================="
for l in ll:
print fmtStr.format(*l)
This prints the initial table packed in the list of lists and the formatted output.
==================================
['2U7Q', 'DZK8Z5XT', '7ZI0W', 'A9SH3V3U']
['P7SOY3RSZ1', 'X', 'Z2W', 'KF6']
['NO8IEY9A', '4FVGQHG', 'UGMJ', 'TT02X']
['9S43YM', 'JCUT0', 'W', 'KB']
['P43T', 'QG', '0VT9OZ0W', 'PF91F']
['2TEQG0H6A6', 'A4A', '4NZERXV', '6KMV22WVP0']
['JXOT', 'AK7', 'FNKUEL', 'P59DKB8']
['BTHJ', 'XVLZZ1Q3H', 'NQM16', 'IZBAF']
['G0EF21S', 'A0G', '8K9', 'RGOJJYH2P9']
['IJ', 'SRKL8TXXI', 'R', 'PSUZRR4LR']
==================================
2U7Q |DZK8Z5XT |7ZI0W |A9SH3V3U
P7SOY3RSZ1|X |Z2W |KF6
NO8IEY9A |4FVGQHG |UGMJ |TT02X
9S43YM |JCUT0 |W |KB
P43T |QG |0VT9OZ0W|PF91F
2TEQG0H6A6|A4A |4NZERXV |6KMV22WVP0
JXOT |AK7 |FNKUEL |P59DKB8
BTHJ |XVLZZ1Q3H|NQM16 |IZBAF
G0EF21S |A0G |8K9 |RGOJJYH2P9
IJ |SRKL8TXXI|R |PSUZRR4LR
The code that you used is for MySQL. The critical part is the line widths.append(max(cd[2], len(cd[0]))) where cd[2] gives the length of the longest data in that column. This works for MySQLdb.
However, you are using sqlite3, for which the value cd[2] is set to None:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.description
Thus, you will need to replace the following logic:
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
with your own. The rest of the code should be fine as long as widths is computed correctly.
The easiest way to get the widths variable correctly, would be to traverse through each row of the result and find out the max width of each column, then append it to widths. This is just some pseudo code:
for cd in c.description:
columns.append(cd[0]) # Get column headers
widths = [0] * len(c.description) # Initialize to number of columns.
for row in x:
for i in range(len(row)): # This assumes that row is an iterable, like list
v = row[i] # Take value of ith column
widths[i] = max(len(v), widths[i]) # Compare length of current value with value already stored
At the end of this, widths should contain the maximum length of each column.

Categories