Writing variables to exact cell in Excel with Python - python

So in order to refresh my powerBI dashboard I need to write queries to Excel. Otherwise I have to run every single query and do it myself.
I now build the following Python code:
import pandas as pd
from pathlib import Path
data_folder = Path("PATH")
file_to_open = data_folder / "excelfile.xlsx"
df = pd.read_excel(file_to_open)
query_1 = 5
query_2 = 3
query_3 = 12
df.loc[df.iloc[-1,-1]+1,['A']] = query_1
df.loc[df.iloc[-1,-1]+1,['B']] = query_2
df.loc[df.iloc[-1,-1]+1,['C']] = query_3
print(df) #for testing#
df.to_excel(file_to_open, index = False)
It somehow puts query_1 in the right spot (right after the last value in column A)
But query_2 and query_3 both skip one cell. They should all fill in the next empty cell in my excelsheet. My columns are A, B and C.
Can someone help me out?

I think this should work:
df.loc[df.A.count(), 'A'] = query_1
df.loc[df.B.count(), 'B'] = query_2
df.loc[df.C.count(), 'C'] = query_3
If you are curious, here is a good answer regarding different ways to count rows/columns: https://stackoverflow.com/a/55435185/11537601

Related

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

How to extract table name along with table using camelot from pdf files using python?

I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.
I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.
def tables_from_pdfs(filespath):
pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
print(pdffiles)
dictionary = {}
keys = []
for file in pdffiles:
print(file)
n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
print(n)
tables_dict = {}
for i in range(n):
tables = camelot.read_pdf(file, pages = str(i))
tables_dict[i] = tables
head, tail = os.path.split(file)
tail = tail.replace(".pdf", "")
keys.append(tail)
dictionary[tail] = tables_dict
return dictionary, keys
The expected result is a table and the name of the table as stated in the pdf file. For instance:
Table on page x of pdf name: Table 1. Blah Blah blah
'''Table'''
I was able to find a relative solution. Works for me at least.
import os, PyPDF2, time, re, shutil
import pytesseract
from pdf2image import convert_from_path
import camelot
import datefinder
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
similarityAmt = 0.6 # find with 60% similarity
def find_table_name(dataframe, documentString):
# Assuming that you extracted the text from a PDF, it should be multi-lined. We split by line
stringsSeparated = text.split("\n")
for i, string in enumerate(stringsSeparated):
# Split by word
words = string.split()
for k, word in enumerate(words):
# Get the keys from the dataframe as a list (it is initially extracted as a generator type)
dfList = list(dataframe.keys())
keys = str(dfList)
# If the first key is a digit, we assume that the keys are from the row below the keys instead
if keys[0].isdigit():
keys = dataframe[dfList[0]]
# Put all of the keys in a single string
keysAll = ""
for key in keys:
keysAll += key
# Since a row should be horizontal, we check the similarity between that of the text by line.
similarRating = similar(words, keysAll)
if similarRating > similarityAmt: # If similarity rating (which is a ratio from 0 to 1) is above the similarity amount, we approve of it
for j in range(10): # Iterate upwards 10 lines above until we are capable of finding a line that is longer than 4 characters (this is an arbitrary number just to ignore blank lines).
try:
separatedString = stringsSeparated[i-j-1]
if len(separatedString) > 4:
return stringsSeparated[i-j-2]+separatedString # Return the top two lines to hopefully have an accurate name
else:
continue
except:
continue
return "Unnamed"
# Retreive the text from the pdf
pages = convert_from_path(pdf_path, 500) # pdf_path would be the path of the PDF which you extracted the table from
pdf_text = ""
# Add all page strings into a single string, so the entire PDF is one single string
for pageNum, imgBlob in enumerate(pages):
extractedText = pytesseract.image_to_string(imgBlob, lang='eng')
pdf_text += extractedText + "\n"
# Get the name of the table using the table itself and pdf text
tableName = find_table_name(table.df, pdf_text) # A table you extracted with your code, which you want to find the name of
Tables are listed with the TableList and Table functions in the camelot API found here:
https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList
start in the web page where it says:
Lower-Lower-Level Classes
Camelot does not have a reference to the table name just the cell data descriptions.
It does use python's panda database API though which may have the table name in it.
Combine usage of Camelot and Pandas to get the table name.
Get the name of a pandas DataFrame
appended update to answer
from
https://camelot-py.readthedocs.io/en/master/
import camelot
tables = camelot.read_pdf('foo.pdf')
tables
<TableList n=1>
tables.export('foo.csv', f='csv', compress=True) # json, excel, html
tables[0]
<Table shape=(7, 7)>
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
df_table = tables[0].df # get a pandas DataFrame!
#add
df_table.name = 'name here'
#from https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
note: the added 'name' attribute is not part of df. While serializing the df, the added name attribute is lost.
More appended answer, the 'name' attribute is actually called 'index'.
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64

How to filter spark dataframe with URL exists or not?

I want to filter my spark dataframe. In this dataframe, there is an col of URL.
I have tried to use os.path.exists(col("url")) to filter my dataframe, but I got errors like
"string is needed, but column has been found".
here is part of my code, pandas has been used in codes, and now i want to use spark to implement the following code
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')
# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
b = bob_ross.loc[s]['TITLE']
b = b.lower()
b = re.sub(r'[^a-z0-9\s]', '',b)
b = re.sub(r'\s', '_',b)
img = b+".png"
if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
colors.loc[s] = t
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""]
here is how i try to implement it with spark, i am stuck at the error line
from pyspark.sql.functions import *
bob_ross = spark.read.csv('/mnt/umsi-data-science/si618wn2017/bob_ross.csv',header=True)
bob_ross=bob_ross.withColumn("image",concat(lit("/dbfs/mnt/umsi-data-science/si618wn2017/images/"),concat(regexp_replace(regexp_replace(lower(col('TITLE')),r'[^a-z0-9\s]',''),r'\s','_'),lit(".png"))))
#error line ---filter----
bob_ross.filter(os.path.exists(col("image")))
print(bob_ross.head())
You should be using filter function, not an OS function
For example
df.filter("image is not NULL")
os.path.exists only operates on the local filesystem, while Spark is meant to run on many servers, so that should be a sign you're not using the correct function

conditional skipping of the files using pandas.read_sql

I am trying to read the values in the columns which I would like to use from the database files such as MS Access file only if the certain condition are met.
I have 26 different MS access files representing the database for 26 different years.
import pyodbc
import pandas as pd
import numpy as np
k = 1993 + np.arange(24)
for i in k:
print(i)
DBfile = r'D:\PMIS1993_2016'+'\\'+str(i)+'\\pmismzxpdata_'+str(i)+'.mdb'
print(DBfile)
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb)};DBQ='+DBfile)
cur = conn.cursor()
qry = "SELECT JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY, JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY FROM PMIS_JCP_RATINGS WHERE BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'"
dataf = pd.read_sql(qry, conn)
print(dataf)
D = list(dataf.values[0])
print(D)
conn.close()
Here I have tried to read values of variables of JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY and JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY when BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
However, not every year meets the conditions of BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
So, I would like to skip the years which does not meet these condition such as if else function indicating the years which does not satisfy.
If you have any help or idea, I would really appreciate.
Isaac
You can use the .empty attribute:
In [11]: pd.DataFrame().empty # This DataFrame has no rows
Out[11]: True
e.g. to skip the empty datafs:
if not dataf.empty:
D = list(dataf.values[0])
print(D)

How to load data in chunks from a pandas dataframe to a spark dataframe

I have read data in chunks over a pyodbc connection using something like this :
import pandas as pd
import pyodbc
conn = pyodbc.connect("Some connection Details")
sql = "SELECT * from TABLES;"
df1 = pd.read_sql(sql,conn,chunksize=10)
Now I want to read all these chunks into one single spark dataframe using something like:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
The problem is when i do a df2.count() i get the result as 10 which means only the i=0 case is working.Is this a bug with unionAll. Am i doing something wrong here??
The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.union(sqlContext.createDataFrame(chunk))
Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As #zero323 says let's not use .union() in a loop. Let's instead do something like:
def unionAll(*dfs):
' by #zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in df1:
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)

Categories