Length mismatch error when scaling up multiIndex slicer on large dataset

Length mismatch error when scaling up multiIndex slicer on large dataset - python

I am trying to split imported csv files (timeseries) and manipulated them, using pandas MultiIndex Slicer command .xs(). The following df replicates the structure of my imported csv file.
import pandas as pd
df = pd.DataFrame(
{'Sensor ID': [14,1,3,14,3],
'Building ID': [109,109,109,109,109],
'Date/Time': ["26/10/2016 14:31:14","26/10/2016 14:31:16", "26/10/2016 14:32:17", "26/10/2016 14:35:14", "26/10/2016 14:35:38"],
'Reading': [20.95, 20.62, 22.45, 20.65, 22.83],
})
df.set_index(['Sensor ID','Date/Time'], inplace=True)
df.sort_index(inplace=True)
print(df)
SensorList = [1, 3, 14]
for s in SensorList:
df1 = df.xs(s, level='Sensor ID')
I have tested the code on a small excerpt of csv data and it works fine. However, when implementing with the entire csv file, I receive the error: ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements.
Printing df.info() returns the following:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65981 entries, (1, 2016-10-26 14:35:15) to (19, 2016-11-07 11:27:14)
Data columns (total 2 columns):
Building ID 65981 non-null int64
Reading 65981 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5+ MB
None
Any tip on what may be causing the error?
EDIT
I inadvertently truncated my code, thus leaving it pointless in its current form. The original code resamples values into 15-minutes and 1-hour intervals.
with:
units = ['D1','D3','D6','D10']
unit_output_path = './' + unit + '/'
the loop does:
for s in SensorList:
## Slice multi-index to isolate all readings for sensor s
df1 = df_mi.xs(s, level='Sensor ID')
df1.drop('Building ID', axis=1, inplace=True)
## Resample by 15min and 1hr intervals and exports individual csv files
df1_15min = df1.resample('15Min').mean().round(1)
df1_hr = df1.resample('60Min').mean().round(1)
Traceback:
File "D:\AN6478\AN6478_POE_ABo.py", line 52, in <module>
df1 = df_mi.xs(s, level='Sensor ID')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1736, in xs
setattr(result, result._get_axis_name(axis), new_ax)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2685, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\src\properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas\lib.c:44748)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 428, in _set_axis
self._data.set_axis(axis, labels)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2635, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements

I can't tell you why exactly df1 = df_mi.xs(s, level='Sensor ID') raises the ValueError here. Where does df_mi come from?
Here is an alternative using groupby which accomplishes what you want on your given dummy data frame without relying on multiIndex and xs. :
# reset index to have DatetimeIndex, otherwise resample won't work
df = df.reset_index(0)
df.index = pd.to_datetime(df.index)
# create data frame for each sensor, keep relevant "Reading" column
grouped = df.groupby("Sensor ID")["Reading"]
# iterate each sensor data frame
for sensor, sub_df in grouped:
quarterly = sub_df.resample('15Min').mean().round(1)
hourly = sub_df.resample('60Min').mean().round(1)
# implement your to_csv saving here
Note, you could also use the groupby on the multiIndex with df.groupby(level="Sensor ID"), however since you want to resample later on, it is easier to drop Sensor ID from the multiIndex which simplifies it overall.

Related

How do I reorder rows in a CSV file by referring to a single column?

In Test1.csv, in all strings after the second line of the Entry column, I would like to write a code that sorts all the lines of Test1.csv according to the order of the Entry column in Test2.csv.
I would appreciate your advice. Thank you for your cooperation.
This is a simplified version of this data (more than 1000 lines).
import pandas as pd
input_path1 = "Test1.csv"
input_path2 = "Test2.csv"
output_path = "output.csv"
df1 = pd.read_csv(filepath_or_buffer=input_path1, encoding="utf-8")
df2 = pd.read_csv(filepath_or_buffer=input_path2, encoding="utf-8")
(df1.merge(df2, how='left', on='Entry')
.set_index('Entry')
.drop('Number_x', axis='columns')
.rename({'Number_y': 'Number'}, axis='columns')
.to_csv(output_path)
Error massage
Traceback (most recent call last):
File "narabekae.py", line 28, in <module>
.drop('Number_x', axis='columns')
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/frame.py", line 4102, in drop
errors=errors,
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3914, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/generic.py", line 3946, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/Users/macuser/downloads/yes/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 5340, in drop
raise KeyError("{} not found in axis".format(labels[mask]))
KeyError: "['Number_x'] not found in axis"
The output what I want
,V1,V2,>sp,Entry,details,PepPI
1,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
Test1.csv
,V1,V2,>sp,Entry,details,PepPI
1,OS=Re,MAERS,>sp,Q2K8U6,HFQ_RHIEC,8.154348935
2,OS=Sh,MAKGQ,>sp,B4TFA6,HFQ_SALHS,7.158609631
3,OS=Ha,MTNKG,>sp,A4G4K7,HFQ_HERAR,7.028864399
4,OS=Bc,MAERS,>sp,A9M5C4,HFQ_BRUC2,8.154348935
5,OS=Oi,MAQSV,>sp,Q8EQQ9,HFQ_OCEIH,9.229953074
Test2.csv
pI,Molecular weight (average),Entry,Entry name,Organism
6.82,8763.13,A4G4K7,HFQ_HERAR,Rat
6.97,11119.33,B4TFA6,HFQ_SALHS,Pig
9.22,8438.69,Q8EQQ9,HFQ_OCEIH,Bacteria
7.95,8854.28,A9M5C4,HFQ_BRUC2,Mouse
7.95,9044.5,Q2K8U6,HFQ_RHIEC,Human
Additional information
macOS10.15.4 Python3.7.3 Atom

To reorder the columns, you just define list of columns in the order that you want, and use df[columns];
In [17]: columns = ["V1","V2",">sp","Entry","details","PepPI"]
In [18]: df = df1.merge(df2, how='left', on='Entry')
In [19]: df[columns]
Out[19]:
V1 V2 >sp Entry details PepPI
0 OS=Re MAERS >sp Q2K8U6 HFQ_RHIEC 8.154349
1 OS=Sh MAKGQ >sp B4TFA6 HFQ_SALHS 7.158610
2 OS=Ha MTNKG >sp A4G4K7 HFQ_HERAR 7.028864
3 OS=Bc MAERS >sp A9M5C4 HFQ_BRUC2 8.154349
4 OS=Oi MAQSV >sp Q8EQQ9 HFQ_OCEIH 9.229953
Naturally, you can save it normally with the to_csv() method:
df[columns].to_csv(output_path)
Notes
The errors are not reproducible with the data given, since there are no Number columns in the dataframes df1 and df2.
You should not set_index("Entry") if you want to have it saved in the .csv in the middle (since in the "The output what I want" you have simple integer based indexing).

memory error of multi column calculation in large data

I want to calculate the existing column and make a new column.
df = dd.from_pandas(ddf, npartitions=100)
df['new_column'] = df[['column']].apply(lambda dpan_india_df: dpan_india_df['column']*8000, axis = 1, meta=('object'))
How can I use memory efficiently?
For your information, this file is 800M file.
File "Sectorize3.py", line 55, in <lambda>
df['new_column'] = df[['column']].apply(lambda ddf: ddf['column']*8000, axis = 1, meta=('object'))
MemoryError: occurred at index 1512070

You can do it like this.
df['new_column']=df['column']*8000

Issue with exporting list based pandas dataframe to Excel

I have a series of dataframes which I am exporting to excel within the same file. A number of them appear to be stored as a list of dictionaries due to the way they have been constructed. I converted them using .from_dict. but when I use the df.to_excel an error is raised.
An example of one of the df's which is raising the error is shown below. My code:
excel_writer = pd.ExcelWriter('My_DFs.xlsx')
df_Done_Major = df[
(df['currency_str'].str.contains('INR|ZAR|NOK|HUF|MXN|PLN|SEK|TRY')==False) &
(df['state'].str.contains('Done'))
][['Year_Month','state','currency_str','cust_cdr_display_name','rbc_security_type1','rfq_qty','rfq_qty_CAD_Equiv']].copy()
# Trades per bucket
df_Done_Major['Bucket'] = pd.cut(df_Done['rfq_qty'], bins=bins, labels=labels)
# Polpulate empty buckets with 0 so HK, SY and TK data can be pasted adjacently
df_Done_Major_Fill_Empty_Bucket = df_Done_Major.groupby(['Year_Month','Bucket'], as_index=False)['Bucket'].size()
mux = pd.MultiIndex.from_product([df_Done_Major_Fill_Empty_Bucket.index.levels[0], df_Done_Major['Bucket'].cat.categories])
df_Done_Major_Fill_Empty_Bucket = df_Done_Major_Fill_Empty_Bucket.reindex(mux, fill_value=0)
dfTemp = df_Done_Major_Fill_Empty_Bucket
display(dfTemp)
dfTemp = pd.DataFrame.from_dict(dfTemp)
display(dfTemp)
# Export
dfTemp.to_excel(excel_writer, sheet_name='Sheet1', startrow=0, startcol=21, na_rep=0, header=True, index=True, merge_cells= True)
2018-05 0K 0
10K 2
20K 4
40K 10
60K 3
80K 1
100K 14
> 100K 273
dtype: int64
TypeError: Unsupported type <class 'pandas._libs.period.Period'> in write()
Even though I have converted to df is there additional conversion required?
Update: I can get the data into the excel using the following but the format of the dataframe is lost, which means significant excel vba to resolve.
list = [{"Data": dfTemp}, ]

How can I add columns in a data frame?

I have the following data:
Example:
DRIVER_ID;TIMESTAMP;POSITION
156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346)
I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude.
So far, I got:
cur_cab = pd.DataFrame.from_csv(
path,
sep=";",
header=None,
parse_dates=[1]).reset_index()
cur_cab.columns = ['cab_id', 'datetime', 'point']
path specifies the .txt file containing the data.
I already wrote a function that returns the longitude and latitude values from the point formated string.
How do I expand the data frame with the additional column and the splitted values ?

After loading, if you're using a recent version of pandas then you can use the vectorised str methods to parse the column:
In [87]:
df['pos_x'], df['pos_y']= df['point'].str[6:-1].str.split(expand=True)
df
Out[87]:
cab_id datetime \
0 156 2014-01-31 23:00:00.739166
point pos_x pos_y
0 POINT(41.8836718276551 12.4877775603346) 0 1
Also you should stop using from_csv it's no longer updated, use the top level read_csv so your loading code would be:
cur_cab = pd.read_csv(
path,
sep=";",
header=None,
parse_dates=[1],
names=['cab_id', 'datetime', 'point'],
skiprows=1)

Pandas dataframe float index and transpose error

I'm trying to properly import data from a space delimited file into a pandas dataframe so that I can plot it properly. My data file looks like so:
Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000 ...
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008 ...
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009 ...
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008 ...
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008 ...
So the test location is originally in the rows with the columns increasing in voltage to the right to 20V.
My code to read the data file into the dataframe is:
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, index_col="Vmeas", delim_whitespace=True, header=0)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index',axis=1, inplace=True)
make_plots(df, save)
The actual plotting is done by:
def make_plots(df, save):
voltage = np.arange(-5, 20, 0.5)
plt.figure(figsize=(10, 7))
for col in df:
plt.plot(voltage, col, legend=False)
plt.show()
At first, I encountered problems with the voltage being treated by pandas as a string and since pandas doesn't play nice with float indexes. Trying that initially started my plot of a diode current-voltage relationship at 0. (http://i.imgur.com/wgIZCyq.jpg) Then, I re-indexed it but then plotting that still didn't work. Now, I've re-indexed the dataframe, dropped the old index column and when I check the df.head() everything looks right:
Die_numbers vfd3051 vfd3151 vfd3251 vfd3351
Voltage
0 -3.202241e-08 -3.711351e-08 -4.728576e-08 -2.184733e-08
1 -1.493095e-09 -6.580329e-09 3.594383e-09 -3.710431e-08
2 1.377107e-08 -6.581644e-09 8.683344e-09 3.595368e-09
except now I keep running into a ValueError in mpl. I think this is related to the col values being strings instead of floats which I don't understand because it was printing the currents properly before.
Admittedly, I'm new to pandas but it seems like at every turn I am stopped, by my ignorance no doubt, but it's getting tiresome. Is there a better way to do this? Perhaps I should just ignore the first row of the logfile? Can I convert from scientific notation while reading the file in? Keep plugging away?
Thanks.
df.info() is:
Int64Index: 51 entries, 0 to 50
Columns: 1092 entries, vfd3051 to vfd6824
dtypes: float64(1092)
Everything seems to load into pandas correctly but mpl doesn't like something in the data. The columns are floats, I'm not using the index of integers. If the column names were being added as my first row, the columns would be treated as str or obj type. The error is:
Traceback (most recent call last):
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 51, in <module>
make_plots(df, save)
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 21, in make_plots
plt.plot(voltage, col, legend=False)
File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 4139, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 319, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 278, in _plot_args
linestyle, marker, color = _process_plot_format(tup[-1])
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 131, in _process_plot_format
'Unrecognized character %c in format string' % c)
ValueError: Unrecognized character f in format string

I figured out how to make this work entirely in pandas. Don't indicate an index nor a header row. Transpose the dataframe and drop the index. Then, create a list out of the first row of data which will be your string titles for the columns you really wanted. Assign the column names to this list and then reassign the dataframe to a sliced dataframe eliminating the first row of string names ('vfd3021' in my case).
After that, you're good to go. The columns are float and since my voltage range is fixed, I just create a list with arange when I plot.
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, delim_whitespace=True)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index', axis=1, inplace=True)
names = df.iloc[0].values
df.columns = names
df = df[1:]
make_plots(df, save)

As far as I can see all your problems are coming from not getting your data in the
correct format to begin with. Just focus on importing the data and print what your going to plot
checking that the types are what you would expect them to be.
I would advise using a different method to import the data as the file format is not what pandas
works best with (e.g it is transposed). For example, you could use numpy.genfromtxt, an introduction is given here.
import numpy as np
from StringIO import StringIO
data_file = (
"""Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008
""")
data = np.genfromtxt(StringIO(data_file), dtype=None)
print data
>>> array([('Vmeas', -5.0, -4.5, -4.0, -3.5),
('vfd3051', -3.2e-08, -1.49e-09, 1.38e-08, -1.17e-08),
('vfd3151', -3.71e-08, -6.58e-09, -6.58e-09, -6.58e-09),
('vfd3251', -4.73e-08, 3.59e-09, 8.68e-09, -1.68e-08),
('vfd3351', -2.18e-08, -3.71e-08, 3.6e-09, -3.2e-08)],
dtype=[('f0', 'S7'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
So now we have a numpy array of tuples with the column names as the first index and all the
data as the rest of the tuple. Most importantly all the numbers are numbers, try to avoid having
strings because conversions are messy.
Then we could do the following to get a nice pandas DataFrame:
DataDictionary = {row[0]:list(row)[1:] for row in iter(data)}
pd.DataFrame(DataDictionary)
Firstly we create a dictionary of the data by using a Python dictionary comprehension, then pass this into the DataFrame. This results in a well behaved dataframe with columns
named by the strings "Vmeas", "vdf*" and an index of all the data.
Vmeas vfd3051 vfd3151 d3251 vfd3351
0 -5.0 -3.200000e-08 -3.710000e-08 -4.730000e-08 -2.180000e-08
1 -4.5 -1.490000e-09 -6.580000e-09 3.590000e-09 -3.710000e-08
2 -4.0 1.380000e-08 -6.580000e-09 8.680000e-09 3.600000e-09
3 -3.5 -1.170000e-08 -6.580000e-09 -1.680000e-08 -3.200000e-08
I doubt this will fully answer your question but it is a start to getting the data correct before plotting it which I think was your problem. Try to keep it as simple as possible!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Length mismatch error when scaling up multiIndex slicer on large dataset - python

Related

How do I reorder rows in a CSV file by referring to a single column?

memory error of multi column calculation in large data

Issue with exporting list based pandas dataframe to Excel

How can I add columns in a data frame?

Pandas dataframe float index and transpose error

Categories

Resources