Get Excel from a database and us it in the python script - python

Given the Code below What I want to do is get the excel from a database where anyone can come and upload the excel or CSV file and then this code runs for that excel file. As you can see in the first line after importing libraries I read a local Excel file but I want to import it from a database online (maybe like MongoDB or oracle I'm not sure)
please help me out what will be the best method to achieve this
from geopy.distance import geodesic as GD
import pandas as pd
import xlsxwriter
import sys
path_excel = r"D:\INTERNSHIP RELATED FILES\New Village details of Gazipur.xlsx"
df = pd.read_excel(path_excel)
radius = float(input("Enter the radius "))
Vle_coordinates = []
for i in range(2):
Vle_coordinates.append(float(input("Enter the Latitude and Longitude")))
Village_Name = list(df["Village Name"])
Lats = list(df["Latitude"])
Longs = list(df["Longitude"])
Population = list(df['Village Population'])
temp = list(zip(Lats,Longs))
villages= dict((key,value) for key,value in zip(Village_Name,temp))
distance =[]
for key,values in villages.items():
d = (GD(Vle_coordinates,values).km)
distance.append(round(d,2))
Vle_details = list(zip(Village_Name,distance,Population))
s = sorted(Vle_details, key = lambda x: (x[1], -x[2]))
for items in s:
if (items[1]<=radius):
print(items[0])

Related

Approch to merge a template with header and Items with Data for each entry

I'm trying to learn Python and find a solution for my business.
I'm working on SAP and i need to merge data to fill a template.
Doing the merge based on Excel VBA, it's working but to fill a file with 10 K entries it's take a very long time.
My template is avaiable here
https://docs.google.com/spreadsheets/d/1FXc-4zUYx0fjGRvPf0FgMjeTm9nXVfSt/edit?usp=sharing&ouid=113964169462465283497&rtpof=true&sd=true
And a sample of data is here
https://drive.google.com/file/d/105FP8ti0xKbXCFeA2o5HU7d2l3Qi-JqJ/view?usp=sharing
So I need to merge for each record from my data file into the Excel template where we have an header and 2 lines (it's a FI posting so I need to fill the debit and credit.
In VBA, I have proceed like that:
Fix the cell:
Copy data from the template with function activecell.offset(x,y) ...
From my Data file fill the different record based on technical name.
Now I'm trying the same in Python.
Using Pandas or openpyxyl I can open the file but I can't see how can I continue or proceed to find a way to merge header data (must be copy for eache posting I have to book) and data.
from tkinter import *
import pandas as pd
import datetime
from openpyxl import load_workbook
import numpy as np
def sap_line_item(ligne):
ledger = ligne
print(ligne)
return
# Constante
c_dir = '/Users/sapfinance/PycharmProjects/SAP'
C_FILE_SEP = ';'
root = Tk()
root.withdraw()
# folder_selected = filedialog.askdirectory(initialdir=c_dir)
fiori_selected = filedialog.askopenfile(initialdir=c_dir)
data_selected = filedialog.askopenfile(initialdir=c_dir)
# read data
pd.options.display.float_format = '{:,.2f}'.format
fichier_cible = str(data_selected.name)
target_filename = fichier_cible + '_' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + '.xlsx'
# target = pd.ExcelWriter(target_filename, engine='xlsxwriter')
df_full_data = pd.read_csv(data_selected.name, sep=C_FILE_SEP, encoding='unicode_escape', dtype='unicode')
nb_ligne_data = int(len(df_full_data))
print(nb_ligne_data)
#df_fiori = pd.read_excel(fiori_selected.name)
print(fiori_selected.name)
df_fiori = load_workbook(fiori_selected.name)
df_fiori_data = df_fiori.active
Any help to give some tick to approach and find a solution will be appreciate.
Have a great day
Philippe

How to Speed-Up Writing Dataframe to s3 from EMR PySpark Notebook?

So I'm learning PySpark by playing around with the DMOZ dataset in a jupyter notebook attached to an EMR cluster. The process I'm trying to achieve is as follows:
Load a csv with the location of files in an s3 public dataset in to a PySpark DataFrame (~130k rows)
Map over the DF with a function that retrieves the file contents (html) and rips the text
Join the output with the original DF as a new column
Write the joined DF to s3 (the problem: It seems to hang forever, its not a large job and the output json should only be a few gigs)
All of the writing is done in a function called run_job()
I let it sit for about 2 hours on a cluster with 10 m5.8xlarge instances which should be enough (?). All of the other steps execute fine on their own, except for the df.write(). I have tested on a
much smaller subset and it wrote to s3 with no issue, but when I go to do the whole file it seemingly hangs at at "0/n jobs complete."
I am new to PySpark and distributed computing in general, so its probably a simple "best practice" that I am missing. (Edit: Maybe its in the config of the notebook? I'm not using any magics to configure spark currently, do I need to?)
Code below...
import html2text
import boto3
import botocore
import os
import re
import zlib
import gzip
from bs4 import BeautifulSoup as bs
from bs4 import Comment
# from pyspark import SparkContext, SparkConf
# from pyspark.sql import SQLContext, SparkSession
# from pyspark.sql.types import StructType, StructField, StringType, LongType
import logging
def load_index():
input_file='s3://cc-stuff/uploads/DMOZ_bussineses_ccindex.csv'
df = spark.read.option("header",True) \
.csv(input_file)
#df = df.select('url_surtkey','warc_filename', 'warc_record_offset', 'warc_record_length','content_charset','content_languages','fetch_time','fetch_status','content_mime_type')
return df
def process_warcs(id_,iterator):
html_textract = html2text.HTML2Text()
html_textract.ignore_links = True
html_textract.ignore_images = True
no_sign_request = botocore.client.Config(signature_version=botocore.UNSIGNED)
s3client = boto3.client('s3', config=no_sign_request)
text = None
s3pattern = re.compile('^s3://([^/]+)/(.+)')
PREFIX = "s3://commoncrawl/"
for row in iterator:
try:
start_byte = int(row['warc_record_offset'])
stop_byte = (start_byte + int(row['warc_record_length']))
s3match = s3pattern.match((PREFIX + row['warc_filename']))
bucketname = s3match.group(1)
path = s3match.group(2)
#print('Bucketname: ',bucketname,'\nPath: ',path)
resp = s3client.get_object(Bucket=bucketname, Key=path, Range='bytes={}-{}'.format(start_byte, stop_byte))
content = resp['Body'].read()#.decode()
data = zlib.decompress(content, wbits = zlib.MAX_WBITS | 16).decode('utf-8',errors='ignore')
data = data.split('\r\n\r\n',2)[2]
soup = bs(data,'html.parser')
for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
x.extract()
for x in soup.find_all(["head","script","button","form","noscript","style"]):
x.decompose()
text = html_textract.handle(str(soup))
except Exception as e:
pass
yield (id_,text)
def run_job(write_out=True):
df = load_index()
df2 = df.rdd.repartition(200).mapPartitionsWithIndex(process_warcs).toDF()
df2 = df2.withColumnRenamed('_1','idx').withColumnRenamed('_2','page_md')
df = df.join(df2.select('page_md'))
if write_out:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
df.coalesce(4).write.json(output)
return df
df = run_job(write_out=True)
So I managed to make it work. I attribute this to either of the 2 changes below. I also changed the hardware configuration and opted for a higher quantity of smaller instances. Gosh I just LOVE it when I spend an entire day in a deep state of utter confusion when all I needed to do was add an "/" to the save location......
I added a trailing "/" to the output file location in s3
1 Old:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML"
1 New:
output = "s3://cc-stuff/emr-out/DMOZ_bussineses_ccHTML/"
I removed the "coalesce" in the "run_job()" function, I have 200 output files now, but it worked and it was super quick (under 1 min).
2 Old:
df.coalesce(4).write.json(output)
2 New:
df.write.mode('overwrite').json(output)

Python df lat long in for loop

I wanted to change the code into for-loop so that I can change the style for each point.
Code below is working fine without for-loop:
import simplekml
import pandas as pd
excel_file = 'sample.xlsx'
df=pd.read_excel(excel_file)
kml = simplekml.Kml()
df.apply(lambda X: kml.newpoint( coords=[( X["Long"],X["Lat"])]) ,axis=1)
kml.save(path = "data.kml")
I wanted to do it in for-loop so that I can put style to each point, but my for-loop is not working
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df=pd.read_excel(excel_file)
y=df.Long
x=df.Lat
MinLat=int(df.Lat.min())
MaxLat=int(df.Lat.max())
MinLong=int(df.Long.min())
MaxLong=int(df.Long.max())
multipnt =kml.newmultigeometry()
for long in range(MinLong,MaxLong): # Generate longitude values
for lat in range(MaxLat,MinLat): # Generate latitude values
multipnt.newpoint(coords=[(y,x)])
#kml.newpoint(coords=[(y,x)])
kml.save("Point Shared Style.kml")
If want to iterate over a collection of points in an Excel file and add them to a single Placemark as a MultiGeometry using a for-loop then try this.
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df = pd.read_excel(excel_file)
multipnt = kml.newmultigeometry()
for row in df.itertuples(index=False):
multipnt.newpoint(coords=[(row.Lat, row.Long)])
kml.save("PointSharedStyle.kml")
If want to generate a point grid every decimal degree for the bounding box of the points then you would try the following:
import simplekml
import pandas as pd
kml = simplekml.Kml()
style = simplekml.Style()
excel_file = 'sample1.xlsx'
df = pd.read_excel(excel_file)
MinLat = int(df.Lat.min())
MaxLat = int(df.Lat.max())
MinLong = int(df.Long.min())
MaxLong = int(df.Long.max())
for long in range(MinLong, MaxLong+1): # Generate longitude values
for lat in range(MinLat, MaxLat+1): # Generate latitude values
multipnt.newpoint(coords=[(long, lat)])
#kml.newpoint(coords=[(long,lat)])
kml.save("PointSharedStyle.kml")
Note the Style is assigned to the placemark not the geometry so the MultiGeometry can only be assigned a single Style for all points. If want a different style for each point then need to create one placemark per point and assign each with its own Style.
For help setting styles, see https://simplekml.readthedocs.io/en/latest/styles.html

From NetCDF file to csv by Python

Sorry for my bad English.
I am in an internship, I never have used Python before this.
I need to extract data from a NetCDF file.
I already have created a loop which creates a DataFrame, but when I try to extract this DataFrame I only have 201 values on 41000.
import csv
import numpy as np
import pandas as pd
import netCDF4
from netCDF4 import Dataset, num2date
nc = Dataset('Q:/QGIS/2011001.nc', 'r')
chla = nc.variables['chlorophyll_a'][0]
lons = nc.variables['lon'][:]
lat = nc.variables['lat'][:]
time = nc.variables['time'][:]
nlons=len(lons)
nlat=len(lat)
The first loop give me the 41000 values in ArcGIS python console
for i in range(0,nlat) :
dla = {'lat':lat[i],'long':lons,'chla':chla[i]}
z = pd.DataFrame(dla)
print (z)
z.to_csv('Q:/QGIS/fichier.csv', sep =',', index= True)
But when I do the to.csv I only get 201 values in the csv file.
for y in range(0,nlat):
q[y].to_csv('Q:/QGIS/fichier.csv', sep =',', index= True)
for i in range(0,nlat):
dlo ={'lat':lat[i],'long':lons,'chla':chla[i]}
q[y] = pd.DataFrame(dlo)
print(q)
I hope that you will have an answer to solve this, moreover if you have any script to extract values for create an shp file I would be very grateful if you can share it!
Best regards
Thank you in advance

Filling an array with data from dat files in python

I have a folder that has dat files, each of which contains data that should be places on a 360 x 181 grid. How can I populate an array of that size with the data? First, the data comes out as a strip, that is, 1 x (360*181). The data needs to be reshaped and then placed into the array.
Try as I might I can not get this to work correctly. I was able to get the data to read into an array, however it seemed that it was being placed into elements psuedo-randomly, as each element did not necessarily match up with the correct placement, as I had previously found in MATLAB. I also have the data in txt format, should that make this easier.
Here is what I have so far, not much luck (very new to python):
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(24,384,24):
J = str(j);
for i in range(1,51,1):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_h_'+I+'_FT0'+J+'_'+DATE+'.dat';
fo = open(FileList(i), "r");
data.append(fo);

Categories