Passing multiple JSON filenames to pandas and adding to a dataframe - python

I'm working on the following code, where I am passing filenames from a variable "filelist", then performing some pandas dataframe procedures including obtaining data and filtering columns.
filelist=filestring
li=[]
merged_df=[]
for file in filelist:
# temp df
df=pd.read_json(file)
# filter columns
df2=pd.DataFrame(df[['a','e']])
#find data from each JSON
epf=df['s']['d']['Pf']
ep=pd.json_normalize(epf, record_path=['u','D'], meta=['l','dn'])
#append data from each JSON and add to li_dframe
li.append(ep)
frame=pd.concat(li,axis=0,ignore_index=True)
#merge filtered columns and li_dframe
merged_df=pd.concat([df2,frame.reset_index(drop=True)],axis=1).ffill()
print(merged_df)
merged_df.head(n=40)
Where I am having trouble is in the final merged data frame, I can only see data from one of the underlying Json files - the second in the list. This is strange as the print function, shows that the code has correctly handled both JSON and extracted the required values. Can anyone advise what I am missing here to correctly ensure the merged_df has all the necessary info?

Related

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

How to perform operations on a Dask dataframe and export the results to a csv?

I have a large input csv file (several GBs) that I import in Dask with a blocksize of 5e6. The input csv contains two columns: "ID" and "Text".
ddf1 = dd.read_csv('REL_Input.csv', names=['ID', 'Text'], blocksize=5e6)
I need to add a third column to ddf1, "Hash", by parsing the existing "Text" column for a string between "Hash=" and ";". In Pandas, I can simply do this:
ddf1['Hash'] = ddf1['Text'].str.extract(r'Hash=(.*?);')
When I do this in Dask, I get an error saying that the "column assignment doesn't support dask.dataframe.core.DataFrame". I tried to use assign but had no luck.
I also need to read multiple large csv files (each several GBs in size) from a directory, concatenate them into another Dask dataframe, ddf2. Each of these csv files have 100s of columns but I only need 2: "Hash" and "Name". Here is the code to create ddf2:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Then, I need to merge the two dataframes on the "Hash" columns--something like this:
ddf3 = ddf1[['ID', 'ddf1_Hash']].merge(ddf2[['ddf2_Hash', 'Name']], left_on='ddf1_Hash', right_on='ddf2_Hash', how='left')
Finally, I need to export ddf3 as a csv:
df3.to_csv('Output.csv')
I looked and it seems I can create the column for ddf1 and perform the merge operation by changing both ddf1 and ddf2 to pandas dfs using compute. However, that's not an option for me due to the sheer size of these dataframes. I also tried using the chunks approach in Pandas, but that does not work due to the "out of memory" error.
Is there a good way to tackle this problem? I'm still learning Python so any help would be appreciated.
UPDATE:
I am able to create the third column and merge the two dataframes. Though, now the issue is that I can't export the merged dataframe as a csv.
Running regex on a string column. The following snippet uses assign:
import dask.dataframe as dd
import pandas as pd
# this step is just to setup a minimum reproducible example
df = pd.DataFrame(list("abcdefghi"), columns=['A'])
ddf = dd.from_pandas(df, npartitions=3)
# this uses assign to extract the relevant content
ddf = ddf.assign(check_c = lambda x: x['A'].str.extract(r'([a-z])'))
# you can see that the computation was done correctly
ddf.compute()
Concatenating csv files. Do csv files have the same structure/columns? If so, you can just use dd.read_csv("path_to_csv_files/*csv"), but if the files have different structures, then your approach is correct:
ddf2 = dd.concat([dd.read_csv(f, usecols=['Hash', 'Name'], blocksize=5e6) for f in glob.glob('*.tsv')], ignore_index=True, axis=0)
Merging the dataframes. This is going to be an expensive operation, here's a couple of options to potentially reduce the cost of this:
if any of the dataframes can be put into memory, then it would help to run .compute() to get pandas dataframe before the merge;
setting the key variable as index on one or both dataframes:
ddf1 = ddf1.set_index('Hash')
ddf2 = ddf2.set_index('Hash')
ddf3 = ddf1.merge(ddf2, left_index=True, right_index=True)
Saving csv, by default, dask will save each partition to its own csv file, so your path needs to contain an asterisk, e.g.:
df3.to_csv('Output_*.csv', index=False)
There are other options possible (explicit paths, custom name function, see https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single file, you can use
df3.to_csv('Output.csv', index=False, single_file=True)
However, this option is not supported on all systems, so you might want to check that it works using a small sample first (see documentation).

write spark dataframe as array of json (pyspark)

I would like to write my spark dataframe as a set of JSON files and in particular each of which as an array of JSON.
Let's me explain with a simple (reproducible) code.
We have:
import numpy as np
import pandas as pd
df = spark.createDataFrame(pd.DataFrame({'x': np.random.rand(100), 'y': np.random.rand(100)}))
Saving the dataframe as:
df.write.json('s3://path/to/json')
each file just created has one JSON object per line, something like:
{"x":0.9953802385540144,"y":0.476027611419198}
{"x":0.929599290575914,"y":0.72878523939521}
{"x":0.951701684432855,"y":0.8008064729546504}
but I would like to have an array of those JSON per file:
[
{"x":0.9953802385540144,"y":0.476027611419198},
{"x":0.929599290575914,"y":0.72878523939521},
{"x":0.951701684432855,"y":0.8008064729546504}
]
It is not currently possible to have spark "natively" write a single file in your desired format because spark works in a distributed (parallel) fashion, with each executor writing its part of the data independently.
However, since you are okay with having each file be an array of json not only [one] file, here is one workaround that you can use to achieve your desired output:
from pyspark.sql.functions import to_json, spark_partition_id, collect_list, col, struct
df.select(to_json(struct(*df.columns)).alias("json"))\
.groupBy(spark_partition_id())\
.agg(collect_list("json").alias("json_list"))\
.select(col("json_list").cast("string"))\
.write.text("s3://path/to/json")
First you create a json from all of the columns in df. Then group by the spark partition ID and aggregate using collect_list. This will put all the jsons on that partition into a list. Since you're aggregating within the partition, there should be no shuffling of data required.
Now select the list column, convert to a string, and write it as a text file.
Here's an example of how one file looks:
[{"x":0.1420523746714616,"y":0.30876114874052263}, ... ]
Note you may get some empty files.
Presumably you can force spark to write the data in ONE file if you specified an empty groupBy, but this would result in forcing all of the data into a single partition which could result in an out of memory error.
If the data is not super huge and it's okay to have the list as one JSON file, the following workaround is also valid. First, convert the Pyspark data frame to Pandas and then to a list of dicts. Then, the list can be dumped as JSON.
list_of_dicts = df.toPandas().to_dict('records')
json_file = open('path/to/file.json', 'w')
json_file.write(json.dumps(list_of_dicts))
json_file.close()

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Creating Arrays from cvs files in python

So I have a data file, which i must extract specific data from. Using;
x=15 #need a way for code to assess how many lines to skip from given data
maxcol=2000 #need a way to find final row in data
data=numpy.genfromtxt('data.dat.csv',skip_header=x,delimiter=',')
column_one=data[0;max,0]
column_two=data[0:max,1]
this gives me an array for the specific case where there are (x=)15 lines of metadata above the required data and where the number of rows of data is (maxcol=)2000. In what way do I go about changing the code to satisfy any value for x and maxcol?
Use pandas. Its read_csv function does all that you want (I don't include its equivalent of delimiter, sep=',', because comma-delimited is the default):
import pandas as pd
data = pd.read_csv('data.dat.csv', skiprows=x, nrows=maxcol)
If you really want that as a numpy array, you can do this:
data = data.values
But you can probably just leave it as a pandas DataFrame.

Categories