Read parquet file causes the kernel appears to have died [duplicate] - python

This question already has answers here:
How do you fix "runtimeError: package fails to pass a sanity check" for numpy and pandas?
(9 answers)
Closed 1 year ago.
Trying the following code in jupyter notebook (pip install pandas - pip install pyarrow > are installed)
import pandas as pd
parquet_file = r'C:\Users\Future\Desktop\userdata1.parquet'
df = pd.read_parquet(parquet_file, engine='auto')
print(df.head())
When trying the code in jupyter notebook, the kernel appears to have died. I restarted the kernel and tried again but the same error.
I even tried to put the code in .py file and run the code from the terminal but I didn't get any output.
the engine is auto and i tried too pyarrow engine ..
The link the parquet_file https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
** I have installed python 3.8.6 and pandas 1.1.4 and pyarrow 2.0.0 and when trying to run the code I encountered the following error
** On entry to DGEBAL parameter number 3 had an illegal value
** On entry to DGEHRD parameter number 2 had an illegal value
** On entry to DORGHR DORGQR parameter number 2 had an illegal value
** On entry to DHSEQR parameter number 4 had an illegal value
Traceback (most recent call last):
File "demo.py", line 1, in <module>
import pandas as pd
File "C:\Users\Future\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\__init__.py", line 11, in <module>
__import__(dependency)
File "C:\Users\Future\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\__init__.py", line 305, in <module>
_win_os_check()
File "C:\Users\Future\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\__init__.py", line 302, in _win_os_check
raise RuntimeError(msg.format(__file__)) from None
RuntimeError: The current Numpy installation ('C:\\Users\\Future\\AppData\\Local\\Programs\\Python\\Python38\\lib\\site-packages\\numpy\\__init__.py') fails to pass a sanity check due to a bug in the windows runtime. See this issue for more information: https:// tiny url.com/y3dm3h86

Running
import pandas as pd
parquet_file = r'userdata1.parquet'
df = pd.read_parquet(parquet_file, engine='auto')
print(df.head())
returns
registration_dttm id first_name last_name email \
0 2016-02-03 07:55:29 1 Amanda Jordan ajordan0#com.com
1 2016-02-03 17:04:03 2 Albert Freeman afreeman1#is.gd
2 2016-02-03 01:09:31 3 Evelyn Morgan emorgan2#altervista.org
3 2016-02-03 00:36:21 4 Denise Riley driley3#gmpg.org
4 2016-02-03 05:05:31 5 Carlos Burns cburns4#miitbeian.gov.cn
gender ip_address cc country birthdate \
0 Female 1.197.201.2 6759521864920116 Indonesia 3/8/1971
1 Male 218.111.175.34 Canada 1/16/1968
2 Female 7.161.136.94 6767119071901597 Russia 2/1/1960
3 Female 140.35.109.83 3576031598965625 China 4/8/1997
4 169.113.235.40 5602256255204850 South Africa
salary title comments
0 49756.53 Internal Auditor 1E+02
1 150280.17 Accountant IV
2 144972.51 Structural Engineer
3 90263.05 Senior Cost Accountant
4 NaN
using pyarrow 2.0.0 on python 3.8.6 and pandas 1.1.4
with df.shape giving (1000, 13)

Related

Automatically Get the Latest Version of a Shapefile From ONS Open Geography Portal

I am trying to automate a data pipeline in Python.
Currently I manually download the latest version of the Local Authority Districts UK BUC shapefile from: https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2020-uk-buc?geometry=-39.021%2C51.099%2C34.148%2C59.780 and open it with Geopandas.
I am able to get the same shapefile via the API Explorer using the following:
import pandas as pd
import geopandas as gpd
import requests
import json
data = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
lad_gdf = gpd.GeoDataFrame.from_features(data.json(),crs=4326)
lad_gdf.head()
OUT:
geometry OBJECTID LAD20CD LAD20NM ... LONG LAT Shape__Area Shape__Length
0 POLYGON ((-1.24224 54.72297, -1.24194 54.72272... 1 E06000001 Hartlepool ... -1.27018 54.676140 9.602987e+07 51065.295913
1 POLYGON ((-1.19860 54.58287, -1.16664 54.55423... 2 E06000002 Middlesbrough ... -1.21099 54.544670 5.523139e+07 35500.386745
2 POLYGON ((-0.79189 54.55824, -0.80042 54.55101... 3 E06000003 Redcar and Cleveland ... -1.00608 54.567520 2.483428e+08 78449.389240
3 POLYGON ((-1.19319 54.62905, -1.20018 54.62350... 4 E06000004 Stockton-on-Tees ... -1.30664 54.556911 2.052314e+08 87566.566061
4 POLYGON ((-1.43836 54.59508, -1.42333 54.60313... 5 E06000005 Darlington ... -1.56835 54.535339 1.988128e+08 91926.839545
[5 rows x 11 columns]
I can't find a version of the above api that will always download the latest version of the shapefile as opposed to the December 2020 version.
Is there a way to ensure that I always download the newest release?
Thanks

Having trouble with - class 'pandas.core.indexing._AtIndexer'

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

Tab and line separation in python pandas

I have the attached file which I need to upload in python. I need to ignore NETSIM and 10 value on top and read the remaining. I used the following code to read the file in python:
import pandas as pd
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep=r'\\\t',engine='python',skiprows=(0,1,2), header=None)
I used the tab separator in my code but the output is still show me as follows:
0 0\t0.362291\t0.441396
1 1\t0.156279\t0.341383
2 2\t0.699696\t0.045577
3 3\t0.714313\t0.171668
4 4\t0.378966\t0.495494
5 5\t0.961942\t0.444337
6 6\t0.726886\t0.575888
7 7\t0.168639\t0.406223
8 8\t0.875627\t0.061439
9 9\t0.540054\t0.317061
10 5\t7\t155200000.000000\t54000000.000000\t37997...
11 3\t4\t155200000.000000\t40500000.000000\t24507...
12 4\t6\t155200000.000000\t33000000.000000\t18606...
13 5\t6\t155200000.000000\t72000000.000000\t39198...
14 4\t1\t155200000.000000\t40500000.000000\t24507...
15 3\t9\t155200000.000000\t39000000.000000\t22698...
Can someone please guide me as to what's wrong?
The attached file
You want to split on a literal tab character, not the string \\t, so you shouldn't be using a raw string literal here. Change sep to just '\t'.
x=pd.read_csv('C:/Users/oq/Desktop/FAST/Algorithms/project/benchmark/input10.txt',sep='\t',engine='python',skiprows=(0,1,2), header=None)

Show me duplicated Addresses pandas

I have these two columns in my csv (Address of New Home and Cancelled Can in the csv). If any Address is cancelled, under Can True has to be written but sometimes the end user forget to write True and the same Address appears twice. I want Python to tell me(not remove) the Addresses that appear twice without the first one being cancelled out.
Example:
Date_Booked Address of New Home Can
01/07/2017 1234 SO Drive True
02/14/2017 4321 Python Court
03/17/2017 1234 SO Drive
03/23/2017 4321 Python Court
As you can view from the above example, 1234 SO Drive was cancelled and True was written, this is what we want but 4321 Python Court was cancelled that is why it was written twice but since it does not say True under the Cancelled it will show up twice in our csv and cause all sorts of issues.
import pandas as pd
first = pd.read_csv('Z:PCR.csv')
df = pd.DataFrame(first)
non_cancelled = df['Can'].apply(lambda x: x != 'True')
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
raise Exception ('Same address written twice without cancellation')
I am getting the following error:
Traceback (most recent call last):
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)
TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
KeyError: 'Address of New Home'
Any assistance would be greatly appreciated.
This should update your Can column by keeping the True that is already there an updating with ones that were missed.
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(can.where(can, ''))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court True
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court
Per request
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(pd.Series(np.where(can, 'Missed', ''), df.index))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court Missed
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court
Your column is Address_of_New_Home, not Address of New Home. Just forgot the underscores
The problem is in this statement:
non_cancelled = df['Can'].apply(lambda x: x != 'True')
When you apply this argument, you are applying to to the series df['Can'], so the method returns a series, not the full DataFrame. To illustrate, here is some code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.arange(0,5), 'b': np.arange(5,10), 'c': np.arange(10,15)})
print(df)
The output is this
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
But when I do this:
a = df['a'].apply(lambda x: x*20)
print(a)
I get:
0 0
1 20
2 40
3 60
4 80
To do what you would like to do, try doing this instead:
non_cancelled = df[df['Can'] != True]
This gives us all rows in the DataFrame where the condition (df['Can'] != True) returns as True

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

Categories