Facing Wierd Issue using pandas-0.24.2

Facing Wierd Issue using pandas-0.24.2 - python

I am using python/pandas on Windows 10 from last month & did not face the below issue that suddenly came into being. I have a csv file that is read with pandas. However, the dataframe is arbitrarily joining the comma separated heading into one & while doing this abruptly leaving off last few characters, as a result of this, the code though very simple, is failing. Has anyone seen this kind of problem? Suggestions to overcome this would be of great help
Was trying to check the date format to be in 'yyyy-mm-dd'. Since I got the error, put a print statement to check column names,
Reinstalled python 3.6.8, pandas etc, but that did not help.
import pandas as pd
df = pd.read_csv('Data.csv','r')
print(df.columns)
for pdt in df.PublicDate:
try:
dat = pdt[0:10]
if dat[4] != '-' or dat[7] != '-':
print('\nPub Date Format Error',dat)
except TypeError as e:
print(e)
Test Data csv file has:
PIC,PublicDate,Version,OriginalDate,BPD
ABCD,2019-06-15T19:25:22.000000000Z,1,2019-06-1519.25.22.0000000000,15-06-2019
EFGH,06/15/2019T19:26:22.000000000Z,,2019-06-1519.26.22.0000000000,15-06-2019
IJKL,2019-06-15T20:26:22.000000000Z,1,2019-06-1520.26.22.0000000000,6/25/2019
MNOP,,,2019-06-1520.26.22.0000000000,6/25/2019
QRST,2019-06-15T22:26:22.000000000Z,1,,6/25/2019
Expected:
dates of the format 6/25/2019 should be pointed out for not being in the format 2019-06-25
Actual Result: Below Error
=============== RESTART: H:\Python\DateFormat.py ===============
Index(['PIC,PublicDate,Ve', 'sion,O', 'iginalDate,BPD'], dtype='object')
Traceback (most recent call last):
File "H:\Program Files\Python\DateFormat.py", line 8, in <module>
for pdt in df.PublicDate:
File "G:\Program Files\lib\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'PublicDate'

The problem in the second parameter:
df = pd.read_csv('Data.csv','r')
Without it the example works fine:
df = pd.read_csv('Data.csv')
It happens because the second parameter is separator, not access modifier. With this configuration pandas still available to read the file but cannot create an index or work properly.

Related

ParseException when changing data type in Spark Dataframe

In my Databricks notebook, I am getting ParseException in the last line of the code below when converting string to Date data type. The column in csv file does correctly have hiring_date in a date format.
Question: What I may be doing wrong here and how can we fix the error?
Remark: I am using python and NOT scala. I do not know scala.
from pyspark.sql.functions import *
df = spark.read.csv(".../Test/MyFile.csv", header="true", inferSchema="true")
df2 = df.withColumn("hiring_date",df["hiring_date"].cast('DateType'))

If it is the last line of your code, with reference to this doc, the code should be modified as follows:
df2 = df.withColumn("hiring_date", df.hiring_date.cast(DateType()))
It seems you put a wrong value for cast function.
The following code would work as well:
df2 = df.withColumn("hiring_date", df["hiring_date"].cast('Date'))

streamlit change formatting type for big integers

Suppose I have a simple pandas dataframe and I want to show these values inside a web app using streamlit:
import pandas as pd
import streamlit as st
df = pd.DataFrame({'a': [1174505943511396352, 2414501743231376356]})
st.table(df)
Printing the dataframe gives me:
>>> df
a
0 1174505943511396352
1 2414501743231376356
But the result inside the web app would be shown like below:
As you can see the values are not correct, and they are somehow rounded and the original value is not displayed.
What have I done?
I tried changing the dtype of my dataframe, but it did not work:
df['a'] = df['a'].astype(np.uint64)
st.table(df) # no changes!
I also tried style.format, it did not work either:
func = lambda x: np.uint64(x)
df = df.style.format({"a": func}
st.table(df) # I got errors as below
Error message:
Traceback (most recent call last):
File "c:\users\a.tabarayi\desktop\analyze-clustering-results\venv\lib\site-packages\streamlit\script_runner.py", line 349, in _run_script
exec(code, module.__dict__)
File "C:\Users\a.tabarayi\Desktop\analyze-clustering-results\src\app.py", line 90, in <module>
st.table(x)
File "c:\users\a.tabarayi\desktop\analyze-clustering-results\venv\lib\site-packages\streamlit\elements\data_frame.py", line 120, in table
marshall_data_frame(data, table_proto)
File "c:\users\a.tabarayi\desktop\analyze-clustering-results\venv\lib\site-packages\streamlit\elements\data_frame.py", line 150, in marshall_data_frame
File "c:\users\a.tabarayi\desktop\analyze-clustering-results\venv\lib\site-packages\streamlit\elements\data_frame.py", line 169, in _marshall_styles
translated_style = styler._translate()
TypeError: _translate() missing 2 required positional arguments: 'sparse_index' and 'sparse_cols'
Any help would be appreciated, thanks!
Edit:
The following solution seems to be working for me, but let me know if there is a better approach to prevent unnecessary type casting!
df['a'] = df['a'].astype(str)
st.table(df)

I think it's a bug in streamlit and they have solved it. I dont know when it will be active for the "normal" users
https://github.com/streamlit/streamlit/issues/3526
https://github.com/streamlit/streamlit/pull/3536
A patch will be released next week and a minor update in the next couple of weeks (*). In the meantime, you could change your requirements.txt and put
pandas==1.1.4
in it

Export SAS lib to csv with correct date format (in CSV file)

I use:
Python 3.7
SAS v7.1 Eterprise
I want to export some data (from library) from SAS to CSV. After that I want to import this CSV to Pandas Dataframe and use it.
I have problem, because when I export data from SAS with this code:
proc export data=LIB.NAME
outfile='path\to\export\file.csv'
dbms=csv
replace;
run;
Every column were exported correctly instead of Column with Date. In SAS I see something like:
06NOV2018
16APR2018
and so on... In CSV it looks the same. But if i import this CSV to DataFrame, unfortunatelly, Python see the column with date as Object/string instead of date type.
So here is my question. How Can I export whole library to CSV from SAS with correct type of column (ecpessially column with Date). Maybe I should convert something before Export? Plz help me with this, In SAS I'm new, i want to just import Data from it and use it in Python.
Before you write something, keep in mind, that I had tried with pandas read_sas function, but during this command I've got such Exception with error:
df1 = pd.read_sas(path)
ValueError: Unexpected non-zero end_of_first_byte Exception ignored
in: 'pandas.io.sas._sas.Parser.process_byte_array_with_data' Traceback
(most recent call last): File "pandas\io\sas\sas.pyx", line 31, in
pandas.io.sas._sas.rle_decompress
I put fillna function and show the same error :/
df = pd.DataFrame.fillna((pd.read_sas(path)), value="")
I tried with sas7bdat module in Python, but I've got the same error.
Then I tried with sas7bdat_converter module. But CSV has the same values in Date column, so problem with dtype will arrive after convert csv to DataFrame.
Have you got any sugestions? I've spent 2 days tried to figure it out, but without any positive results :/

Regarding the read_sas error, a Git issue has been reported but closed for lack of reproducible example. However, I can easily import SAS data files with Pandas using .sas7bdat files generated from SAS 9.4 base (possibly the v7.1 Enterprise is the issue).
However, consider using parse_dates argument of read_csv as it can convert your date DDMMMYY format to datetime during import. No change needed with your SAS exported dataset.
sas_df = pd.read_csv(r"path\to\export\file.csv", parse_dates = ['DATE_COLUMN'])

Pandas version 0.22.0 - drop_duplicates() got an unexpected keyword argument 'keep'

I am trying to drop duplicates in my dataframe using drop_duplicates(subset=[''], keep=False). Apparently it is working just okay in my Jupyter Notebook but when I am trying to executing through terminal as .py file I am getting the below error:
Traceback (most recent call last):
File "/home/source/fork/PySpark_Analytics/Notebooks/Krish/beryllium_pandas.py", line 54, in <module>
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
File "/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 88, in wrapper
return func(*args, **kwargs)
TypeError: drop_duplicates() got an unexpected keyword argument 'keep'
Checked that the pandas version is >0.18 as keep = false was introduced then.
# Trying to drop both the records with same last name
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
I want to drop both the records being dropped. Hence keep=false is necessary.
It just works fine if I remove the keep=false.

It may be that your object is not a native pandas dataframe but instead a pyspark dataframe.
From this http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates it seems that subset is only argument accepted. Can you add your imports and the lines where you create the dataframe.

How to read HDF table from pandas?

I have an my_file.h5 file that, presumably, contains data in HDF5 format (PyTables). I try to read this file using pandas:
import pandas as pd
store = pd.HDFStore('my_file.h5')
Then I try to use the store object:
print store
As a result I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/pandas/io/pytables.py", line 133, in __repr__
kind = v._v_attrs.pandas_type
File "/usr/lib/python2.7/dist-packages/tables/attributeset.py", line 302, in __getattr__
(name, self._v__nodePath)
AttributeError: Attribute 'pandas_type' does not exist in node: '/data'
Does anybody know what am I doing wrong? Can the problem be caused by the fact that my *.h5 is not really what I think it is (not data in hdf5 format)?

In your /usr/lib/pymodules/python2.7/pandas/io/pytables.py, line 133
kind = v._v_attrs.pandas_type
In my pytables.py I see
kind = getattr(n._v_attrs,'pandas_type',None)
By using getattr, if there is no pandas_type attribute, then kind is set to None. I'm guessing my version of Pandas
In [7]: import pandas as pd
In [8]: pd.__version__
Out[8]: '0.10.0'
is newer than yours. If so, the fix is to upgrade your pandas.

I had a h5 table. Made with pytables independent of pandas and needed to turn it into a list of tuples then import it to a df. This woked nice because it allows me to make use of my pytables index to run a "where" on the input. This saves me reading all the rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Facing Wierd Issue using pandas-0.24.2 - python

Related

ParseException when changing data type in Spark Dataframe

streamlit change formatting type for big integers

Export SAS lib to csv with correct date format (in CSV file)

Pandas version 0.22.0 - drop_duplicates() got an unexpected keyword argument 'keep'

How to read HDF table from pandas?

Categories

Resources