I have two csv files - I want to do full join in data flow
the two csv files I read as PCollection
csv1
columns A | B | C | D | E
csv2
columns A | B | C | F | G
I need to join the two P collection based on key A,B and get the resulting p collection like below
columns A | B | C | D | E | F | G
Trial 1
{'left': P_collection_1, 'right': P_collection_2}
| ' Combine' >> beam.CoGroupByKey()
| ' ExtractValues' >> beam.Values()
This is basically like a full join in sql
I believe you can indeed use CoGroupBykey:
Applying the Apache Beam Programming guide's example of phones and emails to your case, you can try to feed CoGroupByKey with a PCollection of 'C,D,E's, keyed with 'A,B's, and a PCollection of 'F,G's, also keyed with 'A,B's.
To make it a little clearer, the elements in each PCollection must be tuples, with their first element an 'A,B' key, and the second a 'C,D,E' or 'F,G' value:
PColl1 = PCollection(
('2,4', '1,2,5'),
('1,10', '4,4,9'),
...) # this is the PCollection of CDE's
PColl2 = PCollection(
('2,4', '30,3'),
('20,1', '2,1'),
...) # this is the PCollection of FG's
(The PCollection notation is just here to illustrate)
Then we would apply :
join = {'CDE': PColl1, 'FG': Pcoll2} | beam.CoGroupByKey()
As per the programming guide, result should be:
PCollection(
('2,4', {
'CDE': ['1,2,5'],
'FG': ['30,3']
}
),
('1,10', {
'CDE': ['4,4,9']
}
),
('20,1', {
'FG': ['2,1']
}
),
...)
If A and B take value 2,4 more than once in a same file, it shouldn't be a problem, we should have several values in CDE or in FG.
Related
Say we have a select query like
select a.col1,a.col2,a.col3,b.col1,b,col2,c.col1
from a, b, c
where a.col1(+)=b.col2 and b.col3(+)=c.col2
In mybatis xml we have mapper defined which could include associations and collection result mapped to Mybatis "resultMap".
Mybatis maps respective columns from SQL query to result objects without duplication of objects when we use "resultMap"
Is there something similar in SQLAlchemy?
<resultMap id="example" type="ExampleClass1">
<id property="attribute1" column = "a.col1"/>
<result property="attribute2" column = "a.col2"/>
<result property="attribute3" column = "a.col2"/>
<association property = "assocattribute" javaType="ExampleClass2"
resultMap = "assocMap">
<collection property = "listattribute" "ofType" = "ExampleClass3"
resultMap = "collectMap">
</resultMap>
<resultMap id="assocMap">
<result property = "assocattribute1" column = "b.col1">
<result property = "assocattribute2" column = "b.col2">
</resultMap>
<resultMap id="collectMap">
<result poperty = "collectattribute1" column = "c.col1">
</resultMap>
In the above example. joins are handled and duplicates are removed by mybatis. similar solution is needed in sqlalchemy if possible.
|a.col1 |a.col2|a.col3|b.col1|b.col2|c.col1|
|-------|------|------|------|------|------|
| a | b | c |d |e |f |
| a | b | c |d |e |g |
| a | b | c |d |e |j |
| a | b | c |d |e |k |
required java object is
Abstract representation
ExampleClass1 {
attribute1:a,
attribute2:b,
attribute3:c,
assocattribute=> ExampleClass2 {
assocattribute1:d,
assocattribute2:e
}
collectattribute=> List( ExampleClass3{collectattribute1:f }
,ExampleClass3{collectattribute1:g }
,ExampleClass3{collectattribute1:J }
,ExampleClass3{collectattribute1:K }
}
We tried few things like sqlalchemysession.query(a,b,c).join() with conditions etc.. where objects from a, b, c are duplicated on join conditions. But in Mybatis "resultMap" removes duplicates and maps to proper output object.
want to create a new column based on a string column that have as separator(";") and delete (";") in the end if exist using python/pyspark :
Inputs :
"511;520;611;"
"322;620"
"3;321;"
"334;344"
expected Output :
+Column | +new column
"511;520;611;" | [511,520,611]
"322;620" | [322,620]
"3;321;" | [3,321]
"334;344" | [334,344]
try :
data = data.withColumn(
"newcolumn",
split(col("column"), ";"))
but i get an empty string at the end of the array like here and i want to delete it if exist
+Column | +new column
"511;520;611;" | [511,520,611,empty string]
"322;620" | [322,620]
"3;321;" | [3,321,empty string]
"334;344" | [334;344]
for spark version >= 2.4, use filter function with != '' condition to filter out empty strings in an array
from pyspark.sql.functions import expr
data = data.withColumn("newcolumn", expr("filter(split(column, ';'), x -> x != '')"))
The given problem:
I have folders named from folder1 to folder999. In each folder there are parquet files - named from 1.parquet to 999.parquet. Each parquet consist pandas dataframe of given structure:
id |title |a
1 |abc |1
1 |abc |3
1 |abc |2
2 |abc |1
... |def | ...
Where column a can be value of range a1 to a3.
The partial step is to obtain structure:
id | title | a1 | a2 | a3
1 | abc | 1 | 1 | 1
2 | abc | 1 | 0 | 0
...
In order to obtain final form,:
title
id | abc | def | ...
1 | 3 | ... |
2 | 1 | ... |
where values of column abc is sum of columns a1, a2 and a3.
The goal is to obtain final form calculated on all the parquet files in all the folders.
Now, the situation I am now looks like this: I do know how to receive the final form by partial step, e.g. by using sparse.coo_matrix() like explained in How to make full matrix from dense pandas dataframe .
The problem is: due to memory limitations I cannot simply read all the parquets at once.
I have three questions:
How to get there efficiently, if I have plenty of data (assume each parquet file consists of 500MB)?
Can I transform each parquet to final form separately and THEN merge them somehow? If yes, how could I do that?
Is there any way to skip the partial step?
For every dataframe in the files, you seem to
Group Data by the columns id, title
Now, sum the data in column a for each group
Creating a full matrix for the task, is not necessary and so's the partial step.
I am not sure, how many unique combinations of id, title exists in a file and or all of them. A safe step would be to process files in batches, save their results and later combine all results
Which looks like,
import pandas as pd
import numpy as np
import string
def gen_random_data(N, M):
# N = 100
# M = 10
titles = np.apply_along_axis(lambda x: ''.join(x), 1, np.random.choice(list(string.ascii_lowercase), 3*M).reshape(-1, 3))
titles = np.random.choice(titles, N)
_id = np.random.choice(np.arange(M) + 1, N)
val = np.random.randint(M, size=(N,))
df = pd.DataFrame(np.vstack((_id, titles, val)).T, columns=['id', 'title', 'a'])
df = df.astype({'id': np.int64, 'title': str, 'a': np.int64})
return df
def combine_results(grplist):
# stitch into one dataframe
comb_df = pd.concat(dflist, axis=1)
# Sum over common axes i.e. id, titles
comb_df = comb_df.apply(lambda row: np.nansum(row), axis=1)
# Return a data frame with sum of a's
return comb_df.to_frame('sum_of_a')
totalfiles = 10
batch = 2
filelist = []
for counter,nfiles in enumerate(range(0, totalfiles, batch)):
# Read data from files. generate random data
dflist = [gen_random_data(100, 2) for _ in range(nfiles)]
# Process the data in memory
dflist = [_.groupby(['id', 'title']).agg(['sum']) for _ in dflist]
collection = combine_results(dflist)
# write intermediate results to file and repeat the process for the rest of the files
intermediate_result_file_name = f'resfile_{counter}'
collection.to_parquet(intermediate_result_file_name, index=True)
filelist.append(intermediate_result_file_name)
# Combining result files.
collection = [pd.read_parquet(file) for file in filelist]
totalresult = combine_results(collection)
I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.
I have two dataframes like this:
| User |
------
| 1 |
| 2 |
| 3 |
and
| Articles |
----------
| 'A' |
| 'B' |
| 'C' |
What's an intuitive way to assign each user 2 articles randomly?
The output dataframe might look like this:
| User | Articles |
-----------------
| 1 | 'A' |
| 1 | 'C' |
| 2 | 'C' |
| 2 | 'B' |
| 3 | 'C' |
| 3 | 'A' |
Here's the code that will generate these two dataframes:
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
rdd = sc.parallelize(a)
articles = rdd.map(lambda x: Row(article_id=x[0]))
articles_df = sqlContext.createDataFrame(articles)
Since your article list is small it makes sense to keep it as a python object and not as a distributed list. This will allow you to create a udf to produce a random list of articles for each user_id. The following is one way you could do so:
from random import sample,seed
from pyspark.sql import Row
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import ArrayType,StringType
class ArticleRandomizer(object):
def __init__(self,article_list,num_articles=2,preseed=0):
self.article_list=article_list
self.num_articles=num_articles
self.preseed=preseed
def getrandom(self,user):
seed(user+self.preseed)
return sample(self.article_list,self.num_articles)
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
#rdd = sc.parallelize(a)
#articles = rdd.map(lambda x: Row(article_id=x[0]))
#articles_df = sqlContext.createDataFrame(articles)
article_list=[article[0] for article in a]
ARandomizer=ArticleRandomizer(article_list)
add_articles=udf(ARandomizer.getrandom,ArrayType(StringType()))
users_df.select('user_id',explode(add_articles('user_id'))).show()
This ArticleRandomizer.getrandom function is seeded by the user_id, so it is deterministic, meaning you will get the same random list of articles for a given user for each run. You can adjust this to get a potentially different list by changing the preseed value when you instantiate the class.
This hasn't been tested to see if it will scale well, but it should work fine on your dataset because the dimension of the articles and the users are both fairly small.
If indeed the Articles DataFrame is pretty small we can run collect_list which will take the entire DataFrame and make it one row with an Array column.
| Articles |
-----------------
| ['A', 'B', 'C'] |
Then we can cross-join this table to the Users one, randomly generate two different integers (this is the main part of the code below) and then just pick the two elements from the Articles column .
explode function is used in order to achieve the format you presented in the original question.
from pyspark.sql.functions import collect_list, rand, when, col, size, floor, explode, array
articles_collected = articles.agg(collect_list("Articles").alias("articles"))
users \
.join(articles_collected, how="cross") \
.withColumn(
"first_rand",
floor(rand() * size("articles"))
) \
.withColumn(
"second_rand",
when(
col("first_rand") == 0,
floor(rand() * (size("articles") - 1)) + 1
).otherwise(
floor(rand() * col("first_rand"))
)
) \
.withColumn(
"articles_picked",
array(
col("articles").getItem(col("first_rand").cast("int")),
col("articles").getItem(col("second_rand").cast("int"))
)
) \
.select(
"User",
explode("articles_picked").alias("Articles")
)