List index out of range in tkinter - python

I'm getting an "Exception in Tkinter callback" when I try to run a program on a larger Excel file.
The program converts a column in the file which has hobbies data ", " separated. I just need to split each value around ", " in a new row keeping rest all columns the same as shown here
Below is the code which is working perfectly for splitting if used directly:
Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1).stack().str.split(' ✘', expand=True).stack().unstack(-2).reset_index(-1, drop=True).reset_index())
It gives me the below error when I tried running in Tkinter. It is strange when I tried running on a smaller file it runs perfectly but when file size increases it gives me this error.
**Exception in Tkinter callback**
**Traceback (most recent call last):**
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\tkinter\__init__.py", **line 1705, in __call__**
return self.func(*args)
File "<ipython-input-5-c95adf78dd09>", **line 176, in Main**
Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1).stack().str.split(' ✘', expand=True).stack().unstack(-2).reset_index(-1, drop=True).reset_index())
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", **line 6245, in stack**
return stack(self, level, dropna=dropna)
File "C:\Users\Shivaji Choursiya\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", **line 547, in stack**
dtype = dtypes[0]
**IndexError: list index out of range**
Below error when Run code separately on file
IndexError Traceback (most recent call last)
<ipython-input-23-d3e138bf57f9> in <module>
4 Raw_Data_1 = (Raw_Data_1.set_index(Index_Raw_data_Columns_1)
5 .stack()
----> 6 .str.split(' ✘', expand=True)
7 .stack()
8 .unstack(-2)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in stack(self, level, dropna)
6243 return stack_multiple(self, level, dropna=dropna)
6244 else:
-> 6245 return stack(self, level, dropna=dropna)
6246
6247 def explode(self, column: Union[str, Tuple]) -> "DataFrame":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py in stack(frame, level, dropna)
545 # we concatenate instead.
546 dtypes = list(frame.dtypes.values)
--> 547 dtype = dtypes[0]
548
549 if is_extension_array_dtype(dtype):
IndexError: list index out of range

Related

Python - Pandas, csv row iteration to divide by half based on criteria met

I work with and collect data from the meters on the side of peoples houses in our service area. I worked on a python script to send high/low voltage alerts to my email whenever it occurs, but the voltage originally came in as twice what it actually was (So instead of 125V it showed 250V). So, I used pandas to divide the entire column by half.... Well, turns out a small handful of meters were programmed to send back the actual voltage of 125... So I can no longer halve the column and now must iterate and divide individually. I'm a bit new to scripting so my problem might be simple..
df = pd.read_csv(csvPath)
if "timestamp" not in df:
df.insert(0, 'timestamp',currenttimestamp)
for i in df['VoltageA']:
if df['VoltageA'][i]>200:
df['VoltageA'][i] = df['VoltageA'][i]/2
df['VoltageAMax'][i] = df['VoltageAMax'][i]/2
df['VoltageAMin'][i] = df['VoltageAMin'][i]/2
df.to_csv(csvPath, index=False)
Timestamp is there just as a 'key' to avoid duplicate errors later in the same day.
Error I am getting,
Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 250 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\admin\Documents\script.py", line 139, in <module>
updateTable(csvPath, tableName, truncate, email)
File "C:\Users\admin\Documents\script.py", line 50, in updateTable
if df['VoltageA'][i]>200.0:
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 250.0
If this isn't enough and you actually need to see a csv snippet, let me know. Just trying not to put unnecessary info here. Note, the first VoltageA value is 250.0
This example code below show how to use loc to conditionally change the value in multiple columns
import pandas as pd
df = pd.DataFrame({
'voltageA': [190,230,250,100],
'voltageMax': [200,200,200,200],
'voltageMin': [100,100,100,100]
})
df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']] = df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']]/2
df
Output
voltageA
voltageMax
voltageMin
190
200
100
115
100
50
125
100
50
100
200
100
The data in 2nd and 3rd row were divided by 2 because in the original data the value of voltageA in the two rows exceeds 200.

I cannot trim a string in a dataframe, if the string is in the first record of dataframe

I use python 3.8 with pandas. I have some data in csv file. I get data into pandas dataframe and try to delete some parts of Client_Name strings. Some Client_Name s are like Client_Name = myserv(4234) so i have to clean (4234) to make Client_Name = myserv. Shortly, i have to clean the paranthesis (4234) from Client_Name s.
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
I wrote above code and it is cleaning the paranthesis from Client_Name s. My problem is if the (4234) is at the first row of dataframe it gives error. If (4234) is at other rows there is no problem.
The working data is :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3(4234),6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5(4234),3<br>
5,2018-10-14T21:02:29Z,myserv6(4234),3<br>
When i run my code it deletes the (4234) s and data turn into below format :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
But if the (4234) is on the first row like below, my code throws error :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1(4234),5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
The error is :
test1.py:97: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 972, in __setitem__
self._set_with_engine(key, value)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 1005, in _set_with_engine
loc = self.index._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '0 True
1 False
2 False
3 False
4 False
...
116 False
117 False
118 False
119 False
120 False
Name: Client_Name, Length: 121, dtype: bool' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test1.py", line 97, in <module>
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 992, in __setitem__
self._where(~key, value, inplace=True)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 9129, in _where
new_data = self._mgr.putmask(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 579, in putmask
return self.apply(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py", line 1144, in putmask
raise ValueError("cannot assign mismatch length to masked array")
ValueError: cannot assign mismatch length to masked array
Your slicing method generates a copy, which you modify, this is giving the warning.
You could use instead:
df['Client_Name'] = df['Client_Name'].str.replace('\(.*?\)', '', regex=True)
output:
time Client_Name Minutes
0 2018-10-14T21:01:00Z myserv1 5
1 2018-10-14T21:01:00Z myserv2 5
2 2018-10-14T21:01:00Z myserv3 6
3 2018-10-14T21:01:00Z myserv4 6
4 2018-10-14T21:02:07Z myserv5 3
5 2018-10-14T21:02:29Z myserv6 3

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

Simple Pandas issue, Python

I want to import a txt file, and do a few basic actions on it.
For some reason I keep getting an unhashable type error, not sure what the issue is:
def loadAndPrepData(filepath):
import pandas as pd
pd.set_option('display.width',200)
dataFrame = pd.read_csv(filepath,header=0,sep='\t') #set header to first row and sep by tab
df = dataFrame[0:639,:]
print df
filepath = 'United States Cancer Statistics, 1999-2011 Incidencet.txt'
loadAndPrepData(filepath)
Traceback:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 16, in <module>
loadAndPrepData(filepath)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 12, in loadAndPrepData
df = dataFrame[0:639,:]
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1797, in __getitem__
return self._getitem_column(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\generic.py", line 1082, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
The problem is that using the item getter ([]) needs hashable types. When you provide it with [:] this is fine, but when you provide it with [:,:], you will get this error.
pd.DataFrame({"foo":range(1,10)})[:,:]
TypeError: unhashable type
While this works just fine:
pd.DataFrame({"foo":range(1,10)})[:]
However, you should be using .loc no matter how you want to slice.
pd.DataFrame({"foo":range(1,10)}).loc[:,:]
foo
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9

Extracting Data From a Pandas object to put into JIRA

df is an object created by pandas which contains 13 columns of data that I want to input just data from two columns into JIRA by creating new issues. It is a 272X13 object. Each column represents a different field for an issue in JIRA. Every new issue created in JIRA should get info from just two columns in df: Summary and Comments.
How do I extract every value from the two columns as I go through each row in a for loop? I only want the string values from each row and column, no index, no object. My code is below:
from jira.client import JIRA
import pandas as pd
df = pd.read_csv('C:\\Python27\\scripts\\export.csv')
# Set the column names from the export.csv file equal to variables using the
# pandas python module
# Loop to create new issues
for row in df.iterrows():
summ = str(df.loc[row.index, 'Summary'])[:30]
comments = str(df.loc[row.index, 'Comments'])
jira.create_issue(project={'key': 'DEL'}, summary=summ, description=comments, issuetype={'name': 'Bug'})
When I do this I get the error:
Traceback (most recent call last):
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\JIRAprocess_Delta.py", line 86, in <module>
summ = str(df.loc[row.index, 'Summary'])[:30]
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 669, in __getitem__
return self._getitem_tuple(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 252, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 361, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 758, in _getitem_axis
return self._get_label(key, axis=axis)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 60, in _get_label
return self.obj._xs(label, axis=axis, copy=True)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\frame.py", line 2281, in xs
loc = self.index.get_loc(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\index.py", line 755, in get_loc
return self._engine.get_loc(key)
File "index.pyx", line 130, in pandas.index.IndexEngine.get_loc (pandas\index.c:3238)
File "index.pyx", line 147, in pandas.index.IndexEngine.get_loc (pandas\index.c:3085)
File "index.pyx", line 293, in pandas.index.Int64Engine._check_type (pandas\index.c:5393)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\series.py", line 523, in __hash__
raise TypeError('unhashable type')
TypeError: unhashable type
TypeError: unhashable type
Here is some example data that is showing up in JIRA for every single issue created in the comments field:
Issue 1:
0 NaN
1 Found that the Delta would leak packets receiv...
2 The Delta will reset everytime you disconnect ...
3 NaN
4 It's supposed to get logged when CP takes to l...
5 Upon upgrading the IDS via the BioMed menu, th...
6 Upon upgrading the IDS via the BioMed menu, th...
7 Upon upgrading the IDS via the BioMed menu, th...
8 Increased Fusion heap size and the SCC1 Initia...
9 Recheck using build 142+, after Matt delivers ...
10 When using WPA2, there is EAPOL key echange go...
11 When using WPA2, there is EAPOL key echange go...
12 NaN
13 NaN
14 NaN
...
I only want each issue to have its own string values, and not the index numbers or the NaN to show up like this:
Issue 1:
Issue 2: Found that the Delta would leak packets receiv...
Issue 3: The Delta will reset everytime you disconnect ...
...
The problem is in the use of iterrows.
From the documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html), the function df.iterrows() iterate over DataFrame rows as (index, Series) pairs.
What you need is to replace row.index with "row[0]" which gives you the index of the dataframe you are iterating over
for row in df.iterrows():
summ = str(df.loc[row[0], 'Summary'])[:30]
comments = str(df.loc[row[0], 'Comments'])
By the way, I think you don't need iterrows at all:
for row_index in df.index:
summ = str(df.loc[row_index, 'Summary'])[:30]
comments = str(df.loc[row_index, 'Comments'])

Categories