Fill Pandas Dataframe using Databricks - python

i have some code which i'm writing to solve a problem for our devops team, the first of which is to help them locate files on blob storage as they only have azure explorer.
i will create a dataset which lists all the files in a certain directory, this can be parameterised.
this will hopefully be a triggered pipline where the devops team can earmark what files they want to interact with a move/copy the file to a different location all of which will be tracked and audited, however, im stuck at the first hurdle.
below is some code which i'm trying to populate a dataframe, but i only end up with 1 row.. i've tried following a few posts on here but i seems to be missing something.
can anyone spot where i went wrong?
note: walk_dirz is a function which uses
dbutils.fs.ls(dir_path)
to loop through directorys looking for JSON and TXT only
here is the part where i'm stuck
import os
import datetime
import pathlib
#from datetime import datetime as dt
import time
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
os_file_list = walk_dirz('dbfs:/mnt/landing/raw/NewData/Deltas')
df1 = pd.DataFrame()
#create dataframe
df = pd.DataFrame()
for i in os_file_list:
print(i)
pn = i
new_row = {'path':pn}
df1 = df.append(new_row , ignore_index=True)

this looked like its worked
for i in os_file_list:
df = pd.DataFrame(os_file_list,columns =['Path'])

Related

Handling multiple pdf files

I have created a folder with 158 pdf files. I want to extract data of each file. Here is what I have done so far.
Importing modules
from itertools import chain
import pandas as pd
import tabulate
from tabula import read_pdf
Reading data file
data_A = read_pdf('D:\\Code\\Scraping\\DMKQ\\A.pdf', pages='all',encoding='latin1')
data_B = read_pdf('D:\\Code\\Scraping\\DMKQ\\B.pdf', pages='all',encoding='latin1')
# Generating Dataframe and print(len) for each file.
data_A_c = chain(*[data_A[i].values for i in range(0,len(data_A))])
headers=chain(data_A[0])
df_A = pd.DataFrame(data_A_c,columns=headers)
df_A.set_index('Name', inplace=True)
print(len(df_A.index))
data_B_c = chain(*[data_B[i].values for i in range(0,len(data_B))])
headers=chain(data_B[0])
df_B = pd.DataFrame(data_B_c,columns=headers)
df_B.set_index('Name', inplace=True)
print(len(df_B.index))
At the moment, I have to copy the code and change the name for each new file respectively, which is time consuming and almost impossible to perform, given that my folder has 158 files in total.
Does anybody knows how to execute this entire process more efficiently?

Merge Sheets of an Excel Workbook using Python

I have been trying to merge sheets of an excel file using python. I was successful on appending them but merging is becoming a bit twisted for me. Any kind of help is always welcomed.
Following is the code that I tried
import pandas as pd
import numpy as np
import glob
import os, collections, csv
from os.path import basename
f=pd.ExcelFile('E:/internship/All/A.xlsx')
n1=len(f.sheet_names)
print(n1)
data=pd.read_excel(f,sheet_name = 'Sheet1' ,header=None)
for j in range(1, int(n1)+1):
data1 = pd.read_excel(f, sheet_name = 'Sheet'+ str(j), header=None)
data = pd.merge(data,data1,how= 'outer')
print(data)
data.to_excel('Final.xlsx',index=False)
But as this program executes, it seems to join the sheets down instead of merging, something like the picture given below:
Result that i want
Result that my program is giving

error No module named 'xlrd'. how to import excel with python and pandas properly? please close this

I realized that there may be something wrong in my local dev env just now.
I tried my code on colab.
it worked well.
import pandas as pd
df = pd.read_excel('hurun-2018-top50.xlsx')
thank u all.
please close this session.
------- following is original description ---------
I am trying to import excel with python and pandas.
I already pip installed "xlrd" module.
I googled a lot and tried several different methods, none of them worked.
Here is my code.
import pandas as pd
from pandas import ExcelFile
from pandas import ExcelWriter
df = pd.read_excel('hurun-2018-top50.xlsx', index_col=0)
df = pd.read_excel('hurun-2018-top50.xlsx', sheetname='Sheet1')
df = pd.read_excel('hurun-2018-top50.xlsx')
Any response will be appreciated.

S3 Bucket .txt.gz Copy Via PySpark

I am using Python 2 (Jupyter notebook running PySpark on EMR). I am trying to load some data as a dataframe in order to map/reduce it and output it to my own S3 bucket.
I typically use this command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')
This is failing to work for when the file is in S3 and not my own bucket (as I am not sure how to format the .load command) which is most of my use cases now. My files are also a mix of .csv and .txt.gz, both of which I want in csv format (unzipped) when copied over.
I had a look on google and tried the following commands in Python 2 (Jupyter notebook):
import os
import findspark
findspark.init('/usr/lib/spark/')
from pyspark import SparkContext, SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
import sys
if sys.version_info[0] >= 3:
from urllib.request import urlretrieve
else:
from urllib import urlretrieve
# Get file from URL like this:
urlretrieve("https://s3.amazonaws.com/bucketname/path/path2/path3/path4/path3/results.txt.gz")
Which simply outputs: ('/tmp/tmpmDB1EC.gz', <httplib.HTTPMessage instance at 0x7f54db894758>) so I'm unsure what to do now.
I have read through the documentation, and searched this website and Google for simple methods on forming the df but am stuck. I also read up about using my AWS key / secret key (which I have) but I could not find an example to follow.
Can someone kindly help me out?
you need to load it using the spark context
data_file = urlretrieve("https://s3.amazonaws.com/bucketname/path/path2/path3/path4/path3/results.txt.gz")
raw_data = sc.textFile(data_file)

File data\SPY.csv does not exist

Just started learning python and trying to read a CSV file with pandas.
import pandas as pd
df = pd.read_csv(os.path.join(os.path.dirname(__file__), "C:\\Anaconda\\SPY.csv"))
But I get the error:
File data\SPY.csv does not exist
Tried with both one and two / and \ and ' instead of "
this is the connection string: C:\Anaconda\SPY.csv
(This is a file from yahoo finance. I first tried to call to yahoo but was unable so instead I just downloaded the file and saved it as a CSV)
The error is occurring because you are trying to join your current directory which is named "data" but your file is actually in "Anaconda".
Try a simple
import pandas as pd
df = pd.read_csv("C:\\Anaconda\\SPY.csv")
If you really want to use os.path.join, this should do:
import pandas as pd
import os
path = os.path.join("C:","Anaconda","SPY.csv")
df = pd.read_csv(path)
Also, if your SPY.csv file is in the same directory as your Python file, you should replace the path with a simple SPY.csv

Categories