How to use pandas_profiling with a large database table

How to use pandas_profiling with a large database table - python

I'm trying to use pandas_profiling to profile a table.
It has around 20 columns most of them are float and almost 3 millions records.
I got the following error :
Traceback (most recent call last): File "V:\Python\prof.py", line
53, in
if name == "main": main() File "V:\Python\prof.py", line 21, in main
df = pd.read_sql(query, sql_conn) File "C:\Users\linus\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\sql.py",
line 380, in read_sql
chunksize=chunksize) File "C:\Users\linus\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\sql.py",
line 1477, in read_query
data = self._fetchall_as_list(cursor) File "C:\Users\linus\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\sql.py",
line 1486, in _fetchall_as_ list
result = cur.fetchall() MemoryError
I have tried with less record it worked.
Is there a way to bypass this error ? It looks like it is a memory limitation.
Can we do that another way ? Or it is impossible with Python ?
Thanks for you help

If you are in the position to provide information so that we can replicate the error, we can resolve it. I would recommend opening an issue on the github page.
Disclose: I am co-author of this package.

Related

Pandas and xlrd error while reading excel files

I've been working on a Python script that deals with creating Pandas data frames from Excel files. For the past few days, the Pandas method worked perfectly with the usual pd.read_excel() method.
Today I've been trying to run the same code, but am running into errors. I've tried using the following code on a small test document (just two columns, 5 rows with simple integers):
import pandas as pd
pd.read_excel("tstr.xlsx")
I'm getting this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_base.py", line 353, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\excel\_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\__init__.py", line 130, in open_workbook
bk = xlsx.open_workbook_2007_xml(
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 812, in open_workbook_2007_xml
x12book.process_stream(zflo, 'Workbook')
File "C:\Users\micro\AppData\Local\Programs\Python\Python39\lib\site-packages\xlrd\xlsx.py", line 266, in process_stream
for elem in self.tree.iter() if Element_has_iter else self.tree.getiterator():
AttributeError: 'ElementTree' object has no attribute 'getiterator'
I get the exact same issue when trying to load excel files with xlrd directly. I've tried with several different excel files, and all of my pip installations are up-to-date.
I haven't made any changes to my system since pd.read_excel was last working perfectly (I did reboot my system, but it didn't involve any updates). I'm using a Windows 10 machine, if that's relevant.
Has anyone else had this issue? Any advice on how to proceed?

There can be many different reasons that cause this error, but you should try add engine='xlrd' or other possible values (mostly "openpyxl"). It may solve your issue, as it depends more on the excel file rather then your code.
Also, try to add full path to the file instead of relative one.

openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
So for me the argument:
engine="xlrd" worked on .xls
engine="openpyxl" worked on .xlsx

This works for me
#Back to linux prompt and install openpyxl
pip install openpyxl
#Add engine='openpyxl' in the python argument
data = pd.read_excel(path, sheet_name='Sheet1', parse_dates=True, engine='openpyxl')

Why am I getting a KeyError for trying to open an excel workbook with xlrd

I am trying to loop through a folder of excel spreadsheets and open them to extract data and push out to a database. So far I keep getting an error when trying to use xlrd.open_workbook. I am trying to understand what a keyerror is and why I am getting it. Also some ways to get through it is desired.
import xlrd as rd
book=
rd.open_workbook("C:/Users/me/Desktop/PythonSpyderDesktop/Extract/Bob
Trucking & Warehouse, LLC.xlsm")
I was also trying:
path = "C:\\Users\\me\\Desktop\\PythonSpyderDesktop\\Extract\\"
book=
rd.open_workbook(os.path.join(path,'Bob
Trucking & Warehouse, LLC.xlsm'))
This is my error:
Traceback (most recent call last):
File "<ipython-input-99-682ed177f4f5>", line 1, in <module>
book= rd.open_workbook("C:/Users/me/Desktop/PythonSpyderDesktop/Extract/Bob
Trucking & Warehouse, LLC.xlsm")
File "C:\Python3\WPy-3670\python-3.6.7.amd64\lib\site-
packages\xlrd\__init__.py", line 143, in open_workbook
ragged_rows=ragged_rows,
File "C:\Python3\WPy-3670\python-3.6.7.amd64\lib\site-
packages\xlrd\xlsx.py", line 808, in open_workbook_2007_xml
x12book.process_stream(zflo, 'Workbook')
File "C:\Python3\WPy-3670\python-3.6.7.amd64\lib\site-
packages\xlrd\xlsx.py",
line 265, in process_stream
meth(self, elem)
File "C:\Python3\WPy-3670\python-3.6.7.amd64\lib\site-
packages\xlrd\xlsx.py", line 374, in do_sheet
reltype = self.relid2reltype[rid]
KeyError: ''
If I could get some more understanding of a keyerror that would be terrific. I know it has to do with a dictionary object but I have been coding in python for 2 days so I am still grasping the basics. What does the '' key error mean and how can I fix it?
Thank you!

your first code snippet seems to work fine if you sort out the lines that the code are on.
import xlrd as rd
book= rd.open_workbook(r'C:/Users/me/Desktop/PythonSpyderDesktop/Extract/Bob Trucking & Warehouse, LLC.xlsm')
The program was getting confused and thought that the file name ended at
/Extract/Bob
and a second started at
Trucking & Warehouse, LLC.xlsm")
meaning that it was expecting you to use two sets of quotes to signify two different strings. You can also put 'r' in front of paths to files as it means that the interpreter will ignore special symbols such as \$.

Looping over Pymongo cursor returns bson.errors.InvalidBSON error after some iterations

I'm trying making a simple query with pymongo and looping over the results.
This is the code I'm using:
data = []
tam = db.my_collection.find({'timestamp': {'$gte': start, '$lte':end}}).count()
for i,d in enumerate(table.find({'timestamp': {'$gte': start, '$lte':end}}):
print('%s of %s' % (i,tam))
data.append(d)
start and end variables are datetime python objects. Everything runs fine until I get the following output:
2987 of 12848
2988 of 12848
2989 of 12848
2990 of 12848
2991 of 12848
2992 of 12848
Traceback (most recent call last):
File "db_extraction\extract_data.py", line 68, in <module>
data = extract_data(yesterday,days = 1)
File "db_extraction\extract_data.py", line 24, in extract_data
for i,d in enumerate(table.find({'timestamp': {'$gte': start, '$lte':end}}).limit(100000)):
File "\venv\lib\site-packages\pymongo\cursor.py", line 1169, in next
if len(self.__data) or self._refresh():
File "\venv\lib\site-packages\pymongo\cursor.py", line 1106, in _refresh
self.__send_message(g)
File "\venv\lib\site-packages\pymongo\cursor.py", line 971, in __send_message
codec_options=self.__codec_options)
File "\venv\lib\site-packages\pymongo\cursor.py", line 1055, in _unpack_response
return response.unpack_response(cursor_id, codec_options)
File "\venv\lib\site-packages\pymongo\message.py", line 945, in unpack_response
return bson.decode_all(self.documents, codec_options)
bson.errors.InvalidBSON
First thing I've tried is changing the range of the query to check if it is data related, and it's not. Another range stops at 1615 of 6360 and same error.
I've also tried list(table.find({'timestamp': {'$gte': start, '$lte':end}}) and same error.
Another maybe relevant info is that first queries are really fast. It freezes on the last number for a while before returning the error.
So I need some help. Am I hitting limits here? Or any clue on whats going on?
This is might be related with this 2013 question, but the author says that he gets no error output.
Thanks!
EDIT:
First thank you all for your time and suggestions. Unfortunately, I've tested all sugestions and I get the same error at the same spot. I've printed the problematic file using mongo shell and it is pretty much the same as all others.
I changed the range of the query and tried picking up other days. Same problem in all days, until I found one random run that gave me a MEMORY ERROR.
1737 of 8011
1738 of 8011
1739 of 8011
1740 of 8011
1741 of 8011
Traceback (most recent call last):
File "db_extraction\pymongo_test.py", line 14, in <module>
for post in all_posts:
File "\python_modules\venv\lib\site-packages\pymongo\cursor.py", line 1189, in next
if len(self.__data) or self._refresh():
File "\python_modules\venv\lib\site-packages\pymongo\cursor.py", line 1126, in _refresh
self.__send_message(g)
File "\python_modules\venv\lib\site-packages\pymongo\cursor.py", line 931, in __send_message
operation, exhaust=self.__exhaust, address=self.__address)
File "\python_modules\venv\lib\site-packages\pymongo\mongo_client.py", line 1145, in _send_message_with_response
exhaust)
File "\python_modules\venv\lib\site-packages\pymongo\mongo_client.py", line 1156, in _reset_on_error
return func(*args, **kwargs)
File "\python_modules\venv\lib\site-packages\pymongo\server.py", line 106, in send_message_with_response
reply = sock_info.receive_message(request_id)
File "\python_modules\venv\lib\site-packages\pymongo\pool.py", line 612, in receive_message
self._raise_connection_failure(error)
File "\python_modules\venv\lib\site-packages\pymongo\pool.py", line 745, in _raise_connection_failure
raise error
File "\python_modules\venv\lib\site-packages\pymongo\pool.py", line 610, in receive_message
self.max_message_size)
File "\python_modules\venv\lib\site-packages\pymongo\network.py", line 191, in receive_message
data = _receive_data_on_socket(sock, length - 16)
File "\python_modules\venv\lib\site-packages\pymongo\network.py", line 227, in _receive_data_on_socket
buf = bytearray(length)
MemoryError
This is intermitent. I ran again without changing anything and got the old invalidBSON error, and ran again and got Memory Error.
I started the task manager and ran again, and the memory indeed grows fast up to 95% usage and hangs there. The query should retrieve something like 1GB of data in 8GB RAM machine so... I dont know if this is suposed to happen. Anyway a code suggestion that retrieves the data from mongoDB with pymongo and writes to a file without putting everything into memory probably will do the job. The bonus would be if someone could explain why I'm getting an invalid BSON instead of MemoryError (for vast majority of runs) in my case.
Thanks

Your code runs fine on my computer. Since it works for your first 2992 records, I think the documents may have some inconsistency. Does every document in your collection follow the same schema and format? and is your pymongo updated?
Here is my suggestion if you want to loop through every record:
data = []
all_posts = db.my_collection.find({'timestamp': {'$gte': start, '$lte':end}})
tam = all_posts.count()
i = 0
for post in all_posts:
i += 1
print('%s of %s' % (i,tam))
data.append(post)
Regards,

I ran into this same exact problem myself, it ended up having nothing to do with the documents themselves but the amount of memory that the program was taking up during large queries.
In our specific case, when running the broken query that was giving us this exact bug by itself in a separate script, the bug didn't occur. Eventually we found that we were using a uwsgi config setting:
limit-as = 512
This would immediately kill our process when address space reached 512M, resulting in either an InvalidBSON error OR a MemoryError interchangeably, seemingly at random.
We fixed this by changing the limit-as setting it to reload-on-as instead:
reload-on-as = 512
Ultimately we ended up with deciding to break up large queries like this into smaller pieces and performing them sequentially instead of all at once anyway, but we at least determined it was an external cause instead of an issue with the pymongo driver itself.

Could it be related to specific documents in the DB? Have you checked the document that might cause the error (e.g., the 2992th result of your above query, starting with 0)?
You could also execute some queries against the DB directly (e.g., via the mongo shell) without using pymongo to see whether expected results are returned. For example, you could try db.my_collection.find({...}).skip(2992) to see the result. You could also use cursor.forEach() to print all the retrieved documents.

PySpark DataFrame creation with multiple files with and without header in same directory

I am trying to create a dataframe from a directory with multiple files. Among these files, only one has header. I want to use the infer schema option to create the schema from the header.
When I am creating the DF using one file, it is correctly inferring the schema.
flights = spark.read.csv("/sample/flight/flight_delays1.csv",header=True,inferSchema=True)
But, when I am reading all the files in the directory, it is throwing this error.
flights = spark.read.csv("/sample/flight/",header=True,inferSchema=True)
18/04/21 23:49:18 WARN SchemaUtils: Found duplicate column(s) in the data schema and the partition schema: `11`. You might need to assign different column names.
flights.take(5)
18/04/21 23:49:27 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 476, in take
return self.limit(num).collect()
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 438, in collect
port = self._jdf.collectToPython()
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Reference '11' is ambiguous, could be: 11#13, 11#32.;"
I know one workaround is to remove the header line, and define the schema manually, is there any other tactic to use infer schema on one file and then add other files to the DF?

I would advice you to do so :
# First, you infer the schema from the file you know
schm_file = spark.read.csv(
"/sample/flight/file_with_header.csv", header=True, inferSchema=True
)
# Then you use the schema to read the other files
flights = spark.read.csv(
"/sample/flight/", header=False, mode="DROPMALFORMED", schema=schm_file.schema
)

I figured out another way. but not dynamic enough for a huge number of files. I prefer the way #Steven has suggested.
df1 = spark.read.csv("/sample/flight/flight_delays1.csv",header=True,inferSchema=True)
df2 = spark.read.schema(df1.schema).csv("/sample/flight/flight_delays2.csv")
df3 = spark.read.schema(df1.schema).csv("/sample/flight/flight_delays3.csv")
complete_df = df1.union(df2).union(df3)
complete_df.count()
complete_df.printSchema()

Pyspark2 Writing to CSV Issue?

I am running a py file through the command:
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/bin/spark2-submit --jars /home/jsonnt200/geomesa-hbase-spark-runtime_2.11-1.3.5.1cc.jar,/ccri/hbase-site.zip geomesa_klondike_enrichment2.py
This results in the following error:
Traceback (most recent call last): File
"/home/jsonnt200/geomesa_klondike_enrichment2.py", line 6306, in
df2_500m.write.option('header', 'true').csv('/user/jsonnt200/klondike_201708_1m_500meter_testEQ_union4')
File
"/opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2/python/pyspark/sql/readwriter.py",
line 711, in csv
self._jwrite.csv(path) File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/opt/cloudera/parcels/SPARK2-2.1.0.cloudera2-1.cdh5.7.0.p0.171658/lib/spark2/python/pyspark/sql/utils.py",
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern
component: XXX'
The biggest concern is if I submit this same py file through ipython, it runs correctly. Any ideas on what could be the issue? Unfortunately, I have to use the spark2-submit for tunnelling purposes.

You are using Spark 2.2.0, right? I have encountered the same issue when trying to read a csv file. The problem, I think, is the timestampFormat variabel. Its default value is yyyy-MM-dd'T'HH:mm:ss.SSSXXX. Ref. pyspark.sql documentation.
When I change it to e.g. timestampFormat="yyyy-MM-dd", my code works. This issue is also mentioned in this post. Hope it helps :).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use pandas_profiling with a large database table - python

If you are in the position to provide information so that we can replicate the error, we can resolve it. I would recommend opening an issue on the github page. Disclose: I am co-author of this package.

Related

Pandas and xlrd error while reading excel files

Why am I getting a KeyError for trying to open an excel workbook with xlrd

Looping over Pymongo cursor returns bson.errors.InvalidBSON error after some iterations

PySpark DataFrame creation with multiple files with and without header in same directory

Pyspark2 Writing to CSV Issue?

Categories

Resources