conditional skipping of the files using pandas.read_sql - python

I am trying to read the values in the columns which I would like to use from the database files such as MS Access file only if the certain condition are met.
I have 26 different MS access files representing the database for 26 different years.
import pyodbc
import pandas as pd
import numpy as np
k = 1993 + np.arange(24)
for i in k:
print(i)
DBfile = r'D:\PMIS1993_2016'+'\\'+str(i)+'\\pmismzxpdata_'+str(i)+'.mdb'
print(DBfile)
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb)};DBQ='+DBfile)
cur = conn.cursor()
qry = "SELECT JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY, JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY FROM PMIS_JCP_RATINGS WHERE BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'"
dataf = pd.read_sql(qry, conn)
print(dataf)
D = list(dataf.values[0])
print(D)
conn.close()
Here I have tried to read values of variables of JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY and JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY when BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
However, not every year meets the conditions of BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
So, I would like to skip the years which does not meet these condition such as if else function indicating the years which does not satisfy.
If you have any help or idea, I would really appreciate.
Isaac

You can use the .empty attribute:
In [11]: pd.DataFrame().empty # This DataFrame has no rows
Out[11]: True
e.g. to skip the empty datafs:
if not dataf.empty:
D = list(dataf.values[0])
print(D)

Related

Writing variables to exact cell in Excel with Python

So in order to refresh my powerBI dashboard I need to write queries to Excel. Otherwise I have to run every single query and do it myself.
I now build the following Python code:
import pandas as pd
from pathlib import Path
data_folder = Path("PATH")
file_to_open = data_folder / "excelfile.xlsx"
df = pd.read_excel(file_to_open)
query_1 = 5
query_2 = 3
query_3 = 12
df.loc[df.iloc[-1,-1]+1,['A']] = query_1
df.loc[df.iloc[-1,-1]+1,['B']] = query_2
df.loc[df.iloc[-1,-1]+1,['C']] = query_3
print(df) #for testing#
df.to_excel(file_to_open, index = False)
It somehow puts query_1 in the right spot (right after the last value in column A)
But query_2 and query_3 both skip one cell. They should all fill in the next empty cell in my excelsheet. My columns are A, B and C.
Can someone help me out?
I think this should work:
df.loc[df.A.count(), 'A'] = query_1
df.loc[df.B.count(), 'B'] = query_2
df.loc[df.C.count(), 'C'] = query_3
If you are curious, here is a good answer regarding different ways to count rows/columns: https://stackoverflow.com/a/55435185/11537601

Equivalent of arcpy.Statistics_analysis using NumPy (or other)

I am having a problem (I think memory related) when trying to do an arcpy.Statistics_analysis on an approximately 40 million row table. I am trying to count the number of non-null values in various columns of the table per category (e.g. there are x non-null values in column 1 for category A). After this, I need to join the statistics results to the input table.
Is there a way of doing this using numpy (or something else)?
The code I currently have is like this:
arcpy.Statistics_analysis(input_layer, output_layer, "'Column1' COUNT; 'Column2' COUNT; 'Column3' COUNT", "Categories")
I am very much a novice with arcpy/numpy so any help much appreciated!
You can convert a table to a numpy array using the function arcpy.da.TableToNumPyArray. And then convert the array to a pandas.DataFrame object.
Here is an example of code (I assume you are working with Feature Class because you use the term null values, if you work with shapefile you will need to change the code as null values are not supported are replaced with a single space string (' '):
import arcpy
import pandas as pd
# Change these values
gdb_path = 'path/to/your/geodatabase.gdb'
table_name = 'your_table_name'
cat_field = 'Categorie'
fields = ['Column1','column2','Column3','Column4']
# Do not change
null_value = -9999
input_table = gdb_path + '\\' + table_name
# Convert to pandas DataFrame
array = arcpy.da.TableToNumPyArray(input_table,
[cat_field] + fields,
skip_nulls=False,
null_value=null_value)
df = pd.DataFrame(array)
# Count number of non null values
not_null_count = {field: {cat: 0 for cat in df[cat_field].unique()}
for field in fields}
for cat in df[cat_field].unique():
_df = df.loc[df[cat_field] == cat]
len_cat = len(_df)
for field in fields:
try: # If your field contains integrer or float
null_count = _df[field].value_counts()[int(null_value)]
except IndexError: # If it contains text (string)
null_count = _df[field].value_counts()[str(null_value)]
except KeyError: # There is no null value
null_count = 0
not_null_count[field][cat] = len_cat - null_count
Concerning joining the results to the input table without more information, it's complicated to give you an exact answer that will meet your expectations (because there are multiple columns, so it's unsure which value you want to add).
EDIT:
Here is some additional code following your clarifications:
# Create a copy of the table
copy_name = '' # name of the copied table
copy_path = gdb_path + '\\' + copy_name
arcpy.Copy_management(input_table, copy_path)
# Dividing copy data with summary
# This step doesn't need to convert the dict (not_null_value) to a table
with arcpy.da.UpdateCursor(copy_path, [cat_field] + fields) as cur:
for row in cur:
category = row[0]
for i, fld in enumerate(field):
row[i+1] /= not_null_count[fld][category]
cur.updateRow(row)
# Save the summary table as a csv file (if needed)
df_summary = pd.DataFrame(not_null_count)
df_summary.index.name = 'Food Area' # Or any name
df_summary.to_csv('path/to/file.csv') # Change path
# Summary to ArcMap Table (also if needed)
arcpy.TableToTable_conversion('path/to/file.csv',
gdb_path,
'name_of_your_new_table')

How to load data in chunks from a pandas dataframe to a spark dataframe

I have read data in chunks over a pyodbc connection using something like this :
import pandas as pd
import pyodbc
conn = pyodbc.connect("Some connection Details")
sql = "SELECT * from TABLES;"
df1 = pd.read_sql(sql,conn,chunksize=10)
Now I want to read all these chunks into one single spark dataframe using something like:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
The problem is when i do a df2.count() i get the result as 10 which means only the i=0 case is working.Is this a bug with unionAll. Am i doing something wrong here??
The documentation for .unionAll() states that it returns a new dataframe so you'd have to assign back to the df2 DataFrame:
i = 0
for chunk in df1:
if i==0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
i = i+1
Furthermore you can instead use enumerate() to avoid having to manage the i variable yourself:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.unionAll(sqlContext.createDataFrame(chunk))
Furthermore the documentation for .unionAll() states that .unionAll() is deprecated and now you should use .union() which acts like UNION ALL in SQL:
for i,chunk in enumerate(df1):
if i == 0:
df2 = sqlContext.createDataFrame(chunk)
else:
df2 = df2.union(sqlContext.createDataFrame(chunk))
Edit:
Furthermore I'll stop saying furthermore but not before I say furthermore: As #zero323 says let's not use .union() in a loop. Let's instead do something like:
def unionAll(*dfs):
' by #zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in df1:
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)

Pandas apply speed on large datasets.

I have a table in pandas that has two columns, QuarterHourDimID and StartDateDimID ; these columns give me an ID for each date / quarter hour pairing. For instance for January 1st, 2015, at 12:15PM the StartDateDimID would equal 1097 and QuarterHourDimID would equal 26. This is how the data I'm reading is organized.
It's a large table that I'm reading using pyodbc and pandas.read_sql(), ~450M rows and ~60 columns, so performance is an issue.
To parse the QuarterHourDimID and StartDateDimID columns into workable datetime indexes I'm running an apply function on every row to create an additional column datetime.
My code reading the table without the additional parsing is around 800ms; however when I run this apply function it adds around 4s to total run time (anywhere between 5.8-6s a query is expected.) The df that is returned is around ~45K rows and 5 columns (~450days*~100quarter-hour-parts)
I am hoping to more efficiently rewrite what I've written and get any input along the way.
Below is the code I've written thus far:
import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = {}
AND DemographicGroupDimID = {}
AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
with pyodbc.connect(connection_string) as cnxn:
df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
def time_map(quarter_hour, date):
if quarter_hour > 72:
return date + timedelta(minutes=(quarter_hour % 73)*15)
return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
map_date = {}
init_date = datetime(year=2012, month=1, day=1)
for x in df.StartDateDimID.unique():
map_date[x] = init_date + timedelta(days=int(x)-1)
#this is the part of my code that is likely bogging things down
df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
map_date[row['StartDateDimID']]),
axis=1)
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
Just to post an example of the date-time conversion being performed in SQL rather than pandas and the time mock-up, using the above code and yielding a mean time of 6.4s/execution, I was able to rewrite the code entirely in SQL and got a mean time of 640ms/execution.
The updated code:
import pandas as pd
import pyodbc
SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
AND naf.RecordTypeDimID = 2
AND naf.AudienceEstimateTypeDimID = 1
AND naf.DailyOrWeeklyDimID = 1
AND naf.RecordSequenceCodeDimID = 5
AND naf.ViewingTypeDimID = 4
AND naf.NetworkDimID = 1278
AND naf.DemographicGroupDimID = 3
AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""
%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(sql=SQL_QUERY,
con=cnxn,
index_col=None)
200 loops, best of 3: 613 ms per loop

Insert a Pandas Dataframe into mongodb using PyMongo

What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?
Attempts
db.myCollection.insert(df.to_dict())
gave an error
InvalidDocument: documents must have only string keys, the key was
Timestamp('2013-11-23 13:31:00', tz=None)
db.myCollection.insert(df.to_json())
gave an error
TypeError: 'str' object does not support item assignment
db.myCollection.insert({id: df.to_json()})
gave an error
InvalidDocument: documents must have only string a keys, key was <built-in function id>
df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount 150 non-null values
price 150 non-null values
tid 150 non-null values
dtypes: float64(2), int64(1)
Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.
db.collection.insert_many(df.to_dict('records'))
I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do
>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
A
1 2013-11-23 21:14:34.118531
>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)
But in case you try to load data back, you'll get:
>>> df = read_mongo(db, 'myCollection')
>>> df
A
0 1385241274118531000
>>> df.dtypes
A int64
dtype: object
so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:
>>> df['A'] = pd.to_datetime(df['A'])
>>> df
A
0 2013-11-23 21:14:34.118531
odo can do it using
odo(df, db.myCollection)
If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:
db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:
from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)
where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).
I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.
def write_df_to_mongoDB( my_df,\
database_name = 'mydatabasename' ,\
collection_name = 'mycollectionname',
server = 'localhost',\
mongodb_port = 27017,\
chunk_size = 100):
#"""
#This function take a list and create a collection in MongoDB (you should
#provide the database name, collection, port to connect to the remoete database,
#server of the remote database, local port to tunnel to the other machine)
#
#---------------------------------------------------------------------------
#Parameters / Input
# my_list: the list to send to MongoDB
# database_name: database name
#
# collection_name: collection name (to create)
# server: the server of where the MongoDB database is hosted
# Example: server = 'XXX.XXX.XX.XX'
# this_machine_port: local machine port.
# For example: this_machine_port = '27017'
# remote_port: the port where the database is operating
# For example: remote_port = '27017'
# chunk_size: The number of items of the list that will be send at the
# some time to the database. Default is 100.
#
#Output
# When finished will print "Done"
#----------------------------------------------------------------------------
#FUTURE modifications.
#1. Write to SQL
#2. Write to csv
#----------------------------------------------------------------------------
#30/11/2017: Rafael Valero-Fernandez. Documentation
#"""
#To connect
# import os
# import pandas as pd
# import pymongo
# from pymongo import MongoClient
client = MongoClient('localhost',int(mongodb_port))
db = client[database_name]
collection = db[collection_name]
# To write
collection.delete_many({}) # Destroy the collection
#aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
my_list = my_df.to_dict('records')
l = len(my_list)
ran = range(l)
steps=ran[chunk_size::chunk_size]
steps.extend([l])
# Inser chunks of the dataframe
i = 0
for j in steps:
print j
collection.insert_many(my_list[i:j]) # fill de collection
i = j
print('Done')
return
I use the following part to insert the dataframe to a collection in the database.
df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)
how about this:
db.myCollection.insert({id: df.to_json()})
id will be a unique string for that df
Just make string keys!
import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)
##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)
If you want to send several at one time:
db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())
If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.
from pymongo import MongoClient
client = MongoClient()
db = client.test
col = db.col
def createDocsFromDF(df, collection = None, insertToDB=False):
docs = []
fields = [col for col in df.columns]
for i in range(len(df)):
doc = {col:df[col][i] for col in df.columns if col != 'index'}
for key, val in doc.items():
# we have to do this, because mongo does not recognize these np. types
if type(val) == np.int64:
doc[key] = int(val)
if type(val) == np.float64:
doc[key] = float(val)
if type(val) == np.bool_:
doc[key] = bool(val)
docs.append(doc)
if insertToDB and collection:
db.collection.insert_many(docs)
return docs
For upserts this worked.
for r in df2.to_dict(orient="records"):
db['utest-pd'].update_one({'a':r['a']},{'$set':r})
Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.

Categories