PySpark: read files without knowing the key of very single row [duplicate] - python

I'm trying to read in retrosheet event file into spark. The event file is structured as such.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
As you can see for each game the events loops back.
I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below
from operator import itemgetter
retrosheet = sc.newAPIHadoopFile(
'/path/to/retrosheet/file',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
.filter(itemgetter(1))
.values()
.filter(lambda x: x)
.map(lambda v: (
v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))
Since Spark 2.4 you can also read data into DataFrame using text reader
spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

Related

How to read empty delta partitions without failing in Azure Databricks?

I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where/filter. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))

How to save results from an API call that uses a pandas column for the requests before the whole thing times out when using apply?

I have a pandas dataframe with strings that I'm using to query an API and return the results.
I'm trying to call the API using a function and .apply and then save the results from the api call into a csv file. The problem is that I'm trying to do 10000+ requests and my kernel/notebook crashes. Basically I'm trying to do a big operation and I'm guessing I'm running out of memory. So I'm trying to think of a way I can do these api calls and save the results and not have it all crash. My version with .apply works with a small amount of data but not once it gets larger.
So my notebook code currently looks something like this.
df = pd.read_csv('bigstringlist.csv')
df = df.loc[0:3000]
My function looks something like this.
def api_fetch_func(address):
sleep(.2)
API_PRIVATE = 'awewaefawefawef'
encoded = urllib.parse.quote(address)
query ='https://apitocall' + str(encoded) + \
'.json?limit=1&key=' \
+ API_PRIVATE
response = requests.get(query)
while True:
try:
jsonResponse = response.json()
break
except:
response = requests.get(query)
try:
return jsonResponse['results']
except:
return
else:
return
Then I'm calling the function like so
df['response_col'] = df['string_col'].apply(api_fetch_func)
Something tells me that .apply isn't the right thing to do here. Would be better if I just push the api responses into an array or another dataframe?
Should I just use .iterrows to loop over the list of strings and call the function? Something tells me .apply tries to jam too much into memory and that's why this doesn't work.
So I was going to try
results = []
for index, row in df.iterrows():
# call API
# push results to array
Or is there another way to do this?
If it's a memory issue, what I'd do is write the API calling function as a generator with the yield statement. Then, you can loop through the api_fetch_function generator and save smaller data frames for the csv files rather than holding everything in memory in one go.
for idx, response in api_fetch_generator():
if idx % 500 == 0:
df = create_df() # create a fresh df as you did above with 'string_col'.
df['response_col'] = df['string_col'].apply(response)
if (idx % 500 == 0) and idx != 0:
# Save the df using idx to control the file name
df.to_csv(f"response_batch_{idx / 500}.csv")
# Combine the csv's after everything is saved.

Fastest way to send dataframe to Redis

I have a dataframe that contains 2 columns. For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set. I've done research and I think I found the fastest way of doing this via iterables:
def send_to_redis(df, r):
df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
df = df[['bin_subscriber', 'bin_total_score']]
with r.pipeline() as pipe:
index = 0
for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
r.set(subscriber, total_score)
if (index + 1) % 2000 == 0:
pipe.execute()
index += 1
With this, I can send about 400-500k sets to Redis per minute. We may end up processing up to 300 million which at this rate would take half a day or so. Doable but not ideal. Note that in the outer wrapper I am downloading .parquet files from s3 one at a time and pulling into Pandas via IO bytes.
def process_file(s3_resource, r, bucket, key):
buffer = io.BytesIO()
s3_object = s3_resource.Object(bucket, key)
s3_object.download_fileobj(buffer)
send_to_redis(
pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)
def main():
args = get_args()
s3_resource = boto3.resource('s3')
r = redis.Redis()
file_prefix = get_prefix(args)
s3_keys = [
item.key for item in
s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
if item.key.endswith('.parquet')
]
for key in s3_keys:
process_file(s3_resource, r, args.bucket, key)
Is there a way to send this data to Redis without the use of iteration? Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob? I imagine that would be slightly faster.
The original parquet that I am pulling into Pandas is created via Pyspark. I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable. Every redis object having that label seems very space inefficient.
Any suggestions would be greatly appreciated!
Try Redis Mass Insertion and redis bulk import using --pipe:
Create a new text file input.txt containing the Redis command
Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen
use redis-mass.py (see below) to insert to redis
python redis-mass.py input.txt | redis-cli --pipe
redis-mass.py from github.
#!/usr/bin/env python
"""
redis-mass.py
~~~~~~~~~~~~~
Prepares a newline-separated file of Redis commands for mass insertion.
:copyright: (c) 2015 by Tim Simmons.
:license: BSD, see LICENSE for more details.
"""
import sys
def proto(line):
result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
for arg in line[1:]:
result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
return result
if __name__ == "__main__":
try:
filename = sys.argv[1]
f = open(filename, 'r')
except IndexError:
f = sys.stdin.readlines()
for line in f:
print(proto(line.rstrip().split(' ')),)

Change Delimiter using pyspark and save it as textfile in HDFS

I have a input data file in HDFS. I will be reading that file and perform some validations like below one. After performing validations i am getting the result as below. I want to change the delimiter of comma to '\t' using pyspark and store it in HDFS. Can any one help me with this. (No csv ans please). Thanks in advance.
Validation Code:
dc = data_f.filter("age > 25").filter(data_f.mar == '"married"').groupBy("job","edu").avg("bal","age").sort(data_f.job.desc(),"edu").rdd.map(list).collect()
Result:
[[u'"unknown"', u'"primary"', 1515.974358974359, 48.61538461538461],
[u'"unknown"', u'"secondary"', 1314.2045454545455, 47.84090909090909],
[u'"unknown"', u'"tertiary"', 2328.64, 51.84],
[u'"unknown"', u'"unknown"', 1977.1157894736841, 51.694736842105264],
[u'"unemployed"', u'"primary"', 1685.6097560975609, 44.957317073170735],
[u'"unemployed"', u'"secondary"', 1472.3518072289157, 43.8433734939759],
[u'"unemployed"', u'"tertiary"', 1865.968992248062, 41.031007751937985],
[u'"unemployed"', u'"unknown"', 859.1875, 45.375],
[u'"technician"', u'"primary"', 1512.704, 47.912]]
If You need to avoid
.csv.write
method, you could just use this snippet on rdd
def concatenate_row(row):
concatenated_row = ""
for col in row:
concatenated_row += str(col) + "\t"
return concatenated_row
result = rdd.map(lambda row : concatenate_row(row))
and then just call
saveAsTextFile
method on it

How to map structured data to schemaRDD in Spark?

I've asked this question differently before but there are some changes so I thought asking it again as a new question.
I have a structured data which only part of it is in json format but I need to map the entire data to an schemaRDD. The data looks like this:
03052015 04:13:20
{"recordType":"NEW","data":{"keycol":"val1","col2":"val2","col3":"val3"}
Each line starts with date followed by time and a json formatted text.
I need to map not only the text in json but also the date and time into the same structure.
I tried it in Python but obviously it doesn't work because Row does not take an RDD (jsonRDD in this case).
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
orderFile = sc.textFile(myfile)
orderLine = orderFile.map(lambda line: line.split(" ", 2))
anotherOrderLine = orderLine.map(lambda p: Row(date=p[0], time=p[1], content=sqlContext.jsonRDD(p[3])))
schemaOrder = sqlContext.inferSchema(anotherOrderLine)
schemaOrder.printSchema()
for x in schemaOrder.collect():
print x
The goal is to be able to run a query like this against the schemaRDD:
select date, time, data.keycol, data.val1, data.val2, data.val3 from myOrder
How can I map the entire line to a schemaRDD?
Any help is appreciated?
The simplest option would be to add this field to JSON and use jsonRDD
My data:
03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
Code:
import json
def transform(data):
ts = data[:18].strip()
jss = data[18:].strip()
jsj = json.loads(jss)
jsj['ts'] = ts
return json.dumps(jsj)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.textFile('/sparkdemo/sample.data')
tbl = sqlContext.jsonRDD(rdd.map(transform))
tbl.registerTempTable("myOrder")
sqlContext.sql("select ts, recordType, data.keycol, data.col1, data.col2 data from myOrder").collect()
Result:
[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]
In your code there is a problem that you are calling jsonRDD for each of the rows, this is not correct - it accepts an RDD and returns SchemaRDD.
The sqlContext.jsonRDD creates a schema rdd from an RDD containing strings where each string contains a JSON representation. This code sample is from the SparkSQL documentation (https://spark.apache.org/docs/1.2.0/sql-programming-guide.html):
val anotherPeopleRDD = sc.parallelize("""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
One of the cool things about jsonRDD is the fact that you can provide and additional parameter stating the JSONs schema, which should improve performance your performance. This can be don by creating an schemaRDD (just load a sample document) and then call the schemaRDD.schema method to get the schema.

Categories