I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where/filter. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))
Related
I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')
I have a 12GB JSON file that every line contains information about a scientific paper. This is how it looks
enter image description here
I want to parse it and create 3 pandas dataframes that contain information about venues, authors and how many times an author has published in a venue. Bellow you can see the code I have written. My problem is that this code needs many days in order to run. Is there a way to make it faster?
venues = pd.DataFrame(columns = ['id', 'raw', 'type'])
authors = pd.DataFrame(columns = ['id','name'])
main = pd.DataFrame(columns = ['author_id','venue_id','number_of_times'])
with open(r'C:\Users\dintz\Documents\test.json',encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
if 'id' not in paper["venue"]:
if 'type' not in paper["venue"]:
venues = venues.append({'raw': paper["venue"]["raw"]},ignore_index=True)
else:
venues = venues.append({'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
else:
venues = venues.append({'id': paper["venue"]["id"] , 'raw': paper["venue"]["raw"], 'type': paper["venue"]["type"]},ignore_index=True)
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
authors = authors.append({'id': author["id"] , 'name': author["name"]},ignore_index=True)
main = main.append({'author_id': author["id"] , 'venue_raw': venues.iloc[-1]['raw'],'number_of_times': 1},ignore_index=True)
authors = authors.drop_duplicates(subset=None, keep='first', inplace=False)
venues = venues.drop_duplicates(subset=None, keep='first', inplace=False)
main = main.groupby(by=['author_id','venue_raw'], axis=0, as_index = False).sum()
Apache Spark allows to read json files in multiple chunks in parallel to make it faster -
https://spark.apache.org/docs/latest/sql-data-sources-json.html
For a regular multi-line JSON file, set the multiLine parameter to True.
If you're not familiar with Spark, you can use Pandas-compatible layer on top of Spark that is called Koalas -
https://koalas.readthedocs.io/en/latest/
Koalas read_json call -
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_json.html
Your use wrong tool to accomplish this task, do not use pandas for this scenario.
Lets look at the last 3 lines of code, it is simple and clean, but how to fill these data into pandas dataframe is not so easy, when you can not use pandas input function such as read_json() or read_csv().
I prefer use pure python for this simple task, if your PC has sufficient memory, use dict to get a unique authors and venues, use itertools.groupby to grouping and use more_itertools.ilen to calculate the count.
authors = {}
venues = {}
for paper in papers:
venues[paper["venue"]["id"]] = (paper["venue"]["raw"], paper["venue"]["type"])
for author in obj:
authors[author["id"]] = author["name"]
I want to filter my spark dataframe. In this dataframe, there is an col of URL.
I have tried to use os.path.exists(col("url")) to filter my dataframe, but I got errors like
"string is needed, but column has been found".
here is part of my code, pandas has been used in codes, and now i want to use spark to implement the following code
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')
# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
b = bob_ross.loc[s]['TITLE']
b = b.lower()
b = re.sub(r'[^a-z0-9\s]', '',b)
b = re.sub(r'\s', '_',b)
img = b+".png"
if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
colors.loc[s] = t
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]
here is how i try to implement it with spark, i am stuck at the error line
from pyspark.sql.functions import *
bob_ross = spark.read.csv('/mnt/umsi-data-science/si618wn2017/bob_ross.csv',header=True)
bob_ross=bob_ross.withColumn("image",concat(lit("/dbfs/mnt/umsi-data-science/si618wn2017/images/"),concat(regexp_replace(regexp_replace(lower(col('TITLE')),r'[^a-z0-9\s]',''),r'\s','_'),lit(".png"))))
#error line ---filter----
bob_ross.filter(os.path.exists(col("image")))
print(bob_ross.head())
You should be using filter function, not an OS function
For example
df.filter("image is not NULL")
os.path.exists only operates on the local filesystem, while Spark is meant to run on many servers, so that should be a sign you're not using the correct function
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).
The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')
Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.
Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris
As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)