Simple Pandas issue, Python

Simple Pandas issue, Python - python

I want to import a txt file, and do a few basic actions on it.
For some reason I keep getting an unhashable type error, not sure what the issue is:
def loadAndPrepData(filepath):
import pandas as pd
pd.set_option('display.width',200)
dataFrame = pd.read_csv(filepath,header=0,sep='\t') #set header to first row and sep by tab
df = dataFrame[0:639,:]
print df
filepath = 'United States Cancer Statistics, 1999-2011 Incidencet.txt'
loadAndPrepData(filepath)
Traceback:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 16, in <module>
loadAndPrepData(filepath)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 12, in loadAndPrepData
df = dataFrame[0:639,:]
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1797, in __getitem__
return self._getitem_column(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\generic.py", line 1082, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type

The problem is that using the item getter ([]) needs hashable types. When you provide it with [:] this is fine, but when you provide it with [:,:], you will get this error.
pd.DataFrame({"foo":range(1,10)})[:,:]
TypeError: unhashable type
While this works just fine:
pd.DataFrame({"foo":range(1,10)})[:]
However, you should be using .loc no matter how you want to slice.
pd.DataFrame({"foo":range(1,10)}).loc[:,:]
foo
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9

Related

I cannot trim a string in a dataframe, if the string is in the first record of dataframe

I use python 3.8 with pandas. I have some data in csv file. I get data into pandas dataframe and try to delete some parts of Client_Name strings. Some Client_Name s are like Client_Name = myserv(4234) so i have to clean (4234) to make Client_Name = myserv. Shortly, i have to clean the paranthesis (4234) from Client_Name s.
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
I wrote above code and it is cleaning the paranthesis from Client_Name s. My problem is if the (4234) is at the first row of dataframe it gives error. If (4234) is at other rows there is no problem.
The working data is :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3(4234),6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5(4234),3<br>
5,2018-10-14T21:02:29Z,myserv6(4234),3<br>
When i run my code it deletes the (4234) s and data turn into below format :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
But if the (4234) is on the first row like below, my code throws error :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1(4234),5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
The error is :
test1.py:97: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 972, in __setitem__
self._set_with_engine(key, value)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 1005, in _set_with_engine
loc = self.index._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '0 True
1 False
2 False
3 False
4 False
...
116 False
117 False
118 False
119 False
120 False
Name: Client_Name, Length: 121, dtype: bool' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test1.py", line 97, in <module>
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 992, in __setitem__
self._where(~key, value, inplace=True)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 9129, in _where
new_data = self._mgr.putmask(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 579, in putmask
return self.apply(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py", line 1144, in putmask
raise ValueError("cannot assign mismatch length to masked array")
ValueError: cannot assign mismatch length to masked array

Your slicing method generates a copy, which you modify, this is giving the warning.
You could use instead:
df['Client_Name'] = df['Client_Name'].str.replace('\(.*?\)', '', regex=True)
output:
time Client_Name Minutes
0 2018-10-14T21:01:00Z myserv1 5
1 2018-10-14T21:01:00Z myserv2 5
2 2018-10-14T21:01:00Z myserv3 6
3 2018-10-14T21:01:00Z myserv4 6
4 2018-10-14T21:02:07Z myserv5 3
5 2018-10-14T21:02:29Z myserv6 3

List index out of range in tkinter

I'm getting an "Exception in Tkinter callback" when I try to run a program on a larger Excel file.
The program converts a column in the file which has hobbies data ", " separated. I just need to split each value around ", " in a new row keeping rest all columns the same as shown here
Below is the code which is working perfectly for splitting if used directly:
Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1).stack().str.split(' ✘', expand=True).stack().unstack(-2).reset_index(-1, drop=True).reset_index())
It gives me the below error when I tried running in Tkinter. It is strange when I tried running on a smaller file it runs perfectly but when file size increases it gives me this error.
**Exception in Tkinter callback**
**Traceback (most recent call last):**
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\tkinter\__init__.py", **line 1705, in __call__**
return self.func(*args)
File "<ipython-input-5-c95adf78dd09>", **line 176, in Main**
Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1).stack().str.split(' ✘', expand=True).stack().unstack(-2).reset_index(-1, drop=True).reset_index())
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", **line 6245, in stack**
return stack(self, level, dropna=dropna)
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", **line 547, in stack**
dtype = dtypes[0]
**IndexError: list index out of range**
Below error when Run code separately on file
IndexError Traceback (most recent call last)
<ipython-input-23-d3e138bf57f9> in <module>
4 Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1)
5 .stack()
----> 6 .str.split(' ✘', expand=True)
7 .stack()
8 .unstack(-2)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
6243 return stack_multiple(self, level, dropna=dropna)
6244 else:
-> 6245 return stack(self, level, dropna=dropna)
6246
6247 def explode(self, column: Union[str, Tuple]) -> "DataFrame":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in stack(frame, level, dropna)
545 # we concatenate instead.
546 dtypes = list(frame.dtypes.values)
--> 547 dtype = dtypes[0]
548
549 if is_extension_array_dtype(dtype):
IndexError: list index out of range

Python 3.5 multiprocessing pools && 'numpy.int64 has no attribute .loc'

I'm trying to learn about multiprocessing and pools to process some tweets I've got in a MySQL DB. Here is the code and error messages.
import multiprocessing
import sqlalchemy
import pandas as pd
import config
from nltk import tokenize as token
q = multiprocessing.Queue()
engine = sqlalchemy.create_engine(config.sqlConnectionString)
def getRow(pandasSeries):
df = pd.DataFrame()
tweetTokenizer = token.TweetTokenizer()
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
for tokens in tweetTokenizer.tokenize(pandasSeries.loc['BODY']):
df = df.append(pd.Series(data=[pandasSeries.loc['ID'], tokens, pandasSeries.loc['AUTHOR'],
pandasSeries.loc['RETWEET_COUNT'], pandasSeries.loc['FAVORITE_COUNT'],
pandasSeries.loc['FOLLOWERS_COUNT'], pandasSeries.loc['FRIENDS_COUNT'],
pandasSeries.loc['PUBLISHED_AT']],
index=['id', 'tweet', 'author', 'retweet', 'fav', 'followers', 'friends',
'published_at']), ignore_index=True)
df.to_sql(name="tweet_tokens", con=engine, if_exists='append')
if __name__ == '__main__':
##LOADING SQL INTO DATAFRAME##
databaseData = pd.read_sql_table(config.tweetTableName, engine)
pool = multiprocessing.Pool(6)
for row in databaseData.iterrows():
print(row)
pool.map(getRow, row)
pool.close()
q.close()
q.join_thread()
"""
OUPUT
C:\Users\Def\Anaconda3\python.exe C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py
(0, ID 3247
AUTHOR b'Elon Musk News'
RETWEET_COUNT 0
FAVORITE_COUNT 0
FOLLOWERS_COUNT 20467
FRIENDS_COUNT 14313
BODY Elon Musk Takes an Adorable 5th Grader's Idea ...
PUBLISHED_AT 2017-03-03 00:00:01
Name: 0, dtype: object)
Elon Musk Takes an Adorable 5th Grader's
<class 'str'>
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "C:\Users\Def\Dropbox\Dissertation\testThreadCopy.py", line 16, in getRow
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
AttributeError: 'numpy.int64' object has no attribute 'loc'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py", line 34, in <module>
pool.map(getRow, row)
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 608, in get
raise self._value
AttributeError: 'numpy.int64' object has no attribute 'loc'
Process finished with exit code 1
"""
What I don't understand is why it prints out the first Series and then crashes? And why does it say that pandasSeries.loc['BODY'] is of type numpy.int64 when the print out it says that it is of type string? I'm sure I've gone wrong in a number of other places if you can see where please can you point it out.
Thanks.

When I construct a simple dataframe:
frame
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
and iterate twice I get:
for row in databaseData.iterrows():
for i in row:
print(i, type(i))
That inner loop produces 2 items, a row index/label, and a Series with the values.
0 <class 'numpy.int64'>
0 0
1 1
2 2
3 3
Name: 0, dtype: int32 <class 'pandas.core.series.Series'>
Your map does the same, sending a numeric index to one process (which produces the error), and a series to another.
If I use pool.map without the for row:
pool.map(getRow, databaseData.iterrows())
then getRow receives a 2 element tuple.
def getRow(aTuple):
rowlbl, rowSeries = aTuple
print(rowSeries)
...
Your print(row) shows this tuple; it's just harder to see because the Series part is multiline. If I add a \n it might be clearer
(0, # row label
ID 3247 # multiline Series
AUTHOR b'Elon Musk News'
RETWEET_COUNT 0
....
Name: 0, dtype: object)

error in parallel processing in python using dataframe

I have a dataframe HH that looks like this:
end_Date latitude longitude start_Date
0 9/5/2014 41.8927 -90.4031 4/1/2014
1 9/5/2014 41.8928 -90.4031 4/1/2014
2 9/5/2014 41.8927 -90.4030 4/1/2014
3 9/5/2014 41.8928 -90.4030 4/1/2014
4 9/5/2014 41.8928 -90.4029 4/1/2014
5 9/5/2014 41.8923 -90.4028 4/1/2014
I am trying to parallelize my function using multiprocessing package in python:
here's what i wrote:
if __name__ =='__main__':
pool = Pool(200)
start = time.time()
print "Hello"
H = pool.map(funct_parallel, HH)
pool.close()
pool.join()
when I run this code, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Desktop/testparallel.py", line 198, in <module>
H = pool.map(funct_parallel, HH)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
raise self._value
TypeError: string indices must be integers, not str
not sure where I am going wrong?

pool.map requires an iterable as second argument that it feeds to the function see docs.
If you iterate over the DataFrame, you get the column names - hence the complaint about the string indices.
for i in df:
print(i)
end_Date
latitude
longitude
start_Date
You need instead to break the DataFrame into pieces that can be processed in parallel by the pool, for instance by reading the file in chunks as explained in the I/O docs.

Extracting Data From a Pandas object to put into JIRA

df is an object created by pandas which contains 13 columns of data that I want to input just data from two columns into JIRA by creating new issues. It is a 272X13 object. Each column represents a different field for an issue in JIRA. Every new issue created in JIRA should get info from just two columns in df: Summary and Comments.
How do I extract every value from the two columns as I go through each row in a for loop? I only want the string values from each row and column, no index, no object. My code is below:
from jira.client import JIRA
import pandas as pd
df = pd.read_csv('C:\\Python27\\scripts\\export.csv')
# Set the column names from the export.csv file equal to variables using the
# pandas python module
# Loop to create new issues
for row in df.iterrows():
summ = str(df.loc[row.index, 'Summary'])[:30]
comments = str(df.loc[row.index, 'Comments'])
jira.create_issue(project={'key': 'DEL'}, summary=summ, description=comments, issuetype={'name': 'Bug'})
When I do this I get the error:
Traceback (most recent call last):
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\JIRAprocess_Delta.py", line 86, in <module>
summ = str(df.loc[row.index, 'Summary'])[:30]
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 669, in __getitem__
return self._getitem_tuple(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 252, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 361, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 758, in _getitem_axis
return self._get_label(key, axis=axis)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 60, in _get_label
return self.obj._xs(label, axis=axis, copy=True)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\frame.py", line 2281, in xs
loc = self.index.get_loc(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\index.py", line 755, in get_loc
return self._engine.get_loc(key)
File "index.pyx", line 130, in pandas.index.IndexEngine.get_loc (pandas\index.c:3238)
File "index.pyx", line 147, in pandas.index.IndexEngine.get_loc (pandas\index.c:3085)
File "index.pyx", line 293, in pandas.index.Int64Engine._check_type (pandas\index.c:5393)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\series.py", line 523, in __hash__
raise TypeError('unhashable type')
TypeError: unhashable type
TypeError: unhashable type
Here is some example data that is showing up in JIRA for every single issue created in the comments field:
Issue 1:
0 NaN
1 Found that the Delta would leak packets receiv...
2 The Delta will reset everytime you disconnect ...
3 NaN
4 It's supposed to get logged when CP takes to l...
5 Upon upgrading the IDS via the BioMed menu, th...
6 Upon upgrading the IDS via the BioMed menu, th...
7 Upon upgrading the IDS via the BioMed menu, th...
8 Increased Fusion heap size and the SCC1 Initia...
9 Recheck using build 142+, after Matt delivers ...
10 When using WPA2, there is EAPOL key echange go...
11 When using WPA2, there is EAPOL key echange go...
12 NaN
13 NaN
14 NaN
...
I only want each issue to have its own string values, and not the index numbers or the NaN to show up like this:
Issue 1:
Issue 2: Found that the Delta would leak packets receiv...
Issue 3: The Delta will reset everytime you disconnect ...
...

The problem is in the use of iterrows.
From the documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html), the function df.iterrows() iterate over DataFrame rows as (index, Series) pairs.
What you need is to replace row.index with "row[0]" which gives you the index of the dataframe you are iterating over
for row in df.iterrows():
summ = str(df.loc[row[0], 'Summary'])[:30]
comments = str(df.loc[row[0], 'Comments'])
By the way, I think you don't need iterrows at all:
for row_index in df.index:
summ = str(df.loc[row_index, 'Summary'])[:30]
comments = str(df.loc[row_index, 'Comments'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple Pandas issue, Python - python

Related

I cannot trim a string in a dataframe, if the string is in the first record of dataframe

List index out of range in tkinter

Python 3.5 multiprocessing pools && 'numpy.int64 has no attribute .loc'

error in parallel processing in python using dataframe

Extracting Data From a Pandas object to put into JIRA

Categories

Resources