Splitting dataFrame using spark python

Splitting dataFrame using spark python - python

I'm using dataframe in spark to split and store data in a tablular format. My data in file looks as below -
{"click_id": 123, "created_at": "2016-10-03T10:50:33", "product_id": 98373, "product_price": 220.50, "user_id": 1, "ip": "10.10.10.10"}
{"click_id": 124, "created_at": "2017-02-03T10:51:33", "product_id": 97373, "product_price": 320.50, "user_id": 1, "ip": "10.13.10.10"}
{"click_id": 125, "created_at": "2017-10-03T10:52:33", "product_id": 96373, "product_price": 20.50, "user_id": 1, "ip": "192.168.2.1"}
and I've written this code to split the data -
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf
spark = SparkSession \
.builder \
.appName("Hello") \
.config("World") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: l.split(',')),
["Col1","Col2","Col3","Col4","Col5","Col6"]
)
ratings.registerTempTable("ratings")
final_df = sqlContext.sql("select * from ratings");
final_df.show(20,False)
The above code works fine and gives the below output :
As you can see from the output the "click_id and number" is being shown, similarly created_at and timestamp is being shown.
I want to actually have only the values in the table - click_id, created_at, product_id and so on.
How do I get only those values into my table ?

In your map function, parse the json object instead of splitting it
map(lambda l: l.split(','))
should become
map(lambda l: json.loads(l))
(after you have imported json)
import json
Also if you remove the columns definition
["Col1","Col2","Col3","Col4","Col5","Col6"]
you will get the columns from json

Assuming you want to use only the dataframe API, then you could use the following code:
ratings = spark.read.json("transactions.json")
This will load the json into a dataframe, mapping the json keys into column names.
Then you can select and rename the columns with the code below.
ratings = ratings.select(col('click_id').alias('Col1'),
col('created_at').alias('Col2'),
col('product_id').alias('Col3'),
col('product_price').alias('Col4'),
col('user_id').alias('Col5'),
col('ip').alias('Col6'))
This way you can also cast columns into relevant datatypes, e.g. col('product_price').cast('double').alias('Col4') and properly save to database.

Related

Nested Json Using pyspark

We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)

The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.

This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.

So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")

Using structured queries to geocode records in a pandas dataframe using GeoPy

I would like to use structured queries to do geocoding in GeoPy, and I would like to run this on a large number of observations. I don't know how to do these queries using a pandas dataframe (or something that can be easily transformed to and from a pandas dataframe).
First, some set up:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
Ngeolocator = Nominatim(user_agent="myGeocoder")
Ngeocode = RateLimiter(Ngeolocator.geocode, min_delay_seconds=1)
df = pandas.DataFrame(["Bob", "Joe", "Ed"])
df["CLEANtown"] = ['Harmony', 'Fargo', '']
df["CLEANcounty"] = ['', '', 'Traill']
df["CLEANstate"] = ['Minnesota', 'North Dakota', 'North Dakota']
df["full"]=['Harmony, Minnesota','Fargo, North Dakota','Traill County, North Dakota']
df.columns = ["name"] + list(df.columns[1:])
I know how to run a structured query on a single location by providing a dictionary. I.e.:
q={'city':'Harmony', 'county':'', 'state':'Minnesota'}
testN=Ngeocode(q,addressdetails=True)
And I know how to geocode from the dataframe simply using a single column populated with strings. I.e.:
df['easycode'] = df['full'].apply(lambda x: Ngeocode(x, language='en',addressdetails=True).raw)
But how do I turn the columns CLEANtown, CLEANcounty, and CLEANstate into dictionaries row by row, use those dictionaries as structured queries, and put the results back into the pandas dataframe?
Thank you!

One way is to use apply method of a DataFrame instead of a Series. This would pass the whole row to the lambda. Example:
df["easycode"] = df.apply(
lambda row: Ngeocode(
{
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
language="en",
addressdetails=True,
).raw,
axis=1,
)
Similarly, if you wanted to make a single row of the dictionaries first, you could do:
df["full"] = df.apply(
lambda row: {
"city": row["CLEANtown"],
"county": row["CLEANcounty"],
"state": row["CLEANstate"],
},
axis=1,
)
df["easycode"] = df["full"].apply(
lambda x: Ngeocode(
x,
language="en",
addressdetails=True,
).raw
)

How to convert csv to json with multi-level nesting using pandas

I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}

You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.

Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')

Loading JSON data into pandas data frame and creating custom columns

Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.

If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)

Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting dataFrame using spark python - python

Related

Nested Json Using pyspark

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

Using structured queries to geocode records in a pandas dataframe using GeoPy

How to convert csv to json with multi-level nesting using pandas

Loading JSON data into pandas data frame and creating custom columns

Categories

Resources