memory error of multi column calculation in large data - python

I want to calculate the existing column and make a new column.
df = dd.from_pandas(ddf, npartitions=100)
df['new_column'] = df[['column']].apply(lambda dpan_india_df: dpan_india_df['column']*8000, axis = 1, meta=('object'))
How can I use memory efficiently?
For your information, this file is 800M file.
File "Sectorize3.py", line 55, in <lambda>
df['new_column'] = df[['column']].apply(lambda ddf: ddf['column']*8000, axis = 1, meta=('object'))
MemoryError: occurred at index 1512070

You can do it like this.
df['new_column']=df['column']*8000

Related

how can I make a for loop to populate a DataFrame?

and from the begining I thanks everyone that seeks to help.
I have started to learn python and came across a opportunity to use python to my advantage at work
Im basically made a script that reads a google sheets file, import it into pandas and cleaned up the data.
In the end, I just wanna have the name of the agents in the columns and all of their values for resolucao column below them so I can take the average amount of time for all of the agentes, but I'm struggling to make it with a list comprehension / for loop.
This is what the DataFrame looks like after I cleaned it up
And this is the Code that I tried to Run
PS: Sorry for the messy code.
agentes_unique = list(df['Agente'].unique())
agentes_duplicated = df['Agente']
value_resolucao_duplicated = df['resolucao']
n_of_rows = []
for row in range(len(df)):
n_of_rows.append(row)
i = 0
while n_of_rows[i] < len(n_of_rows):
df2 = pd.DataFrame({agentes_unique[i]: (value for value in df['resolucao'][i] if df['Agente'][i] == agentes_unique[i])})
i+= 1
df2.to_excel('teste.xlsx',index = True, header = True)
But in the end it came to this error:
Traceback (most recent call last):
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\FELIPE\Desktop\Python\webscraping\bot_csv_extract\bot_login\main.py", line 50, in <module>
df2 = pd.DataFrame({'Agente': (valor for valor in df['resolucao'][i] if df['Agente'][i] == 'Gabriel')})
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\FELIPE\Desktop\Python\webscraping\.venv\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 0
I feel like I'm making some obvious mistake but I cant fix it
Again, thanks to anyone who tries to help
Are you looking to do something like this? This is just sample data, but a good start for what you are looking to do if I understand what your wanting to do.
data = {
'Column1' : ['Data', 'Another_Data', 'More_Data', 'Last_Data'],
'Agente' : ['Name', 'Another_Name', 'More_Names', 'Last_Name'],
'Data' : [1, 2, 3, 4]
}
df = pd.DataFrame(data)
df = df.pivot(index = 'Column1', columns=['Agente'], values = 'Data')
df.reset_index()
It is not recommended to use for loops against pandas DataFrames: It is considered messy and inefficient.
With some practice you will be able to approach problems in such a way that you will not need to use for loops in these cases.
From what I understand, your goal can be realized in 3 simple steps:
1. Select the 2 columns of interest. I recommend you take a look at how to access different elements of a dataFrame:
df = df[["Agent", "resolucao"]]
2. Convert the column you want to average to a numeric value. Say seconds:
df["resolucao"] = pd.to_timedelta(df['resolucao'].astype(str)).dt.total_seconds()
3. Apply an average aggregation, via the groupby() function:
df = df.groupby(["Agente"]).mean().reset_index()
Hope this helps.
For the next time, I also recommend you to not post the database as an image in order to be able to reproduce your code.
Cheers and keep it up!

Slicing DAT file by Fixed Width Stored in Dict

I am having some trouble (been trying this for long time) and still couldn't get solution on my own. I have a dat file that looks like this format:
abc900800007.2
And I have a dict that contains the column name as key and the values corresponding to the fixed width for the DAT file, my dict goes like mydict = {'col1': 3, 'col2': 8, 'col3': 3).
What I want to do is to create a df by combining both item, so slicing the DAT file through the dict value. The df should be like:
col1 col 2 col 3
abc 90080000 7.2
Any help would be highly appreciated!
I think a possible (but depending on the file size memory intensive) solution is:
data = {'col1':[], 'col2':[], 'col3':[]}
for line in open('file.dat'):
data['col1'].append(line[:mydict['col1']])
begin = mydict['col1']
end = begin + mydict['col2']
data['col2'].append(line[begin:end])
begin = end
end = begin + mydict['col3']
data['col3'].append(line[begin:end])
df = pd.DataFrame(data) # create the DataFrame
del data # delete the auxiliar data

Length mismatch error when scaling up multiIndex slicer on large dataset

I am trying to split imported csv files (timeseries) and manipulated them, using pandas MultiIndex Slicer command .xs(). The following df replicates the structure of my imported csv file.
import pandas as pd
df = pd.DataFrame(
{'Sensor ID': [14,1,3,14,3],
'Building ID': [109,109,109,109,109],
'Date/Time': ["26/10/2016 14:31:14","26/10/2016 14:31:16", "26/10/2016 14:32:17", "26/10/2016 14:35:14", "26/10/2016 14:35:38"],
'Reading': [20.95, 20.62, 22.45, 20.65, 22.83],
})
df.set_index(['Sensor ID','Date/Time'], inplace=True)
df.sort_index(inplace=True)
print(df)
SensorList = [1, 3, 14]
for s in SensorList:
df1 = df.xs(s, level='Sensor ID')
I have tested the code on a small excerpt of csv data and it works fine. However, when implementing with the entire csv file, I receive the error: ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements.
Printing df.info() returns the following:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65981 entries, (1, 2016-10-26 14:35:15) to (19, 2016-11-07 11:27:14)
Data columns (total 2 columns):
Building ID 65981 non-null int64
Reading 65981 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5+ MB
None
Any tip on what may be causing the error?
EDIT
I inadvertently truncated my code, thus leaving it pointless in its current form. The original code resamples values into 15-minutes and 1-hour intervals.
with:
units = ['D1','D3','D6','D10']
unit_output_path = './' + unit + '/'
the loop does:
for s in SensorList:
## Slice multi-index to isolate all readings for sensor s
df1 = df_mi.xs(s, level='Sensor ID')
df1.drop('Building ID', axis=1, inplace=True)
## Resample by 15min and 1hr intervals and exports individual csv files
df1_15min = df1.resample('15Min').mean().round(1)
df1_hr = df1.resample('60Min').mean().round(1)
Traceback:
File "D:\AN6478\AN6478_POE_ABo.py", line 52, in <module>
df1 = df_mi.xs(s, level='Sensor ID')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1736, in xs
setattr(result, result._get_axis_name(axis), new_ax)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2685, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\src\properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas\lib.c:44748)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 428, in _set_axis
self._data.set_axis(axis, labels)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2635, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements
I can't tell you why exactly df1 = df_mi.xs(s, level='Sensor ID') raises the ValueError here. Where does df_mi come from?
Here is an alternative using groupby which accomplishes what you want on your given dummy data frame without relying on multiIndex and xs. :
# reset index to have DatetimeIndex, otherwise resample won't work
df = df.reset_index(0)
df.index = pd.to_datetime(df.index)
# create data frame for each sensor, keep relevant "Reading" column
grouped = df.groupby("Sensor ID")["Reading"]
# iterate each sensor data frame
for sensor, sub_df in grouped:
quarterly = sub_df.resample('15Min').mean().round(1)
hourly = sub_df.resample('60Min').mean().round(1)
# implement your to_csv saving here
Note, you could also use the groupby on the multiIndex with df.groupby(level="Sensor ID"), however since you want to resample later on, it is easier to drop Sensor ID from the multiIndex which simplifies it overall.

Get Pandas DataFrame first column

This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.
using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)

Pandas dataframe float index and transpose error

I'm trying to properly import data from a space delimited file into a pandas dataframe so that I can plot it properly. My data file looks like so:
Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000 ...
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008 ...
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009 ...
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008 ...
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008 ...
So the test location is originally in the rows with the columns increasing in voltage to the right to 20V.
My code to read the data file into the dataframe is:
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, index_col="Vmeas", delim_whitespace=True, header=0)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index',axis=1, inplace=True)
make_plots(df, save)
The actual plotting is done by:
def make_plots(df, save):
voltage = np.arange(-5, 20, 0.5)
plt.figure(figsize=(10, 7))
for col in df:
plt.plot(voltage, col, legend=False)
plt.show()
At first, I encountered problems with the voltage being treated by pandas as a string and since pandas doesn't play nice with float indexes. Trying that initially started my plot of a diode current-voltage relationship at 0. (http://i.imgur.com/wgIZCyq.jpg) Then, I re-indexed it but then plotting that still didn't work. Now, I've re-indexed the dataframe, dropped the old index column and when I check the df.head() everything looks right:
Die_numbers vfd3051 vfd3151 vfd3251 vfd3351
Voltage
0 -3.202241e-08 -3.711351e-08 -4.728576e-08 -2.184733e-08
1 -1.493095e-09 -6.580329e-09 3.594383e-09 -3.710431e-08
2 1.377107e-08 -6.581644e-09 8.683344e-09 3.595368e-09
except now I keep running into a ValueError in mpl. I think this is related to the col values being strings instead of floats which I don't understand because it was printing the currents properly before.
Admittedly, I'm new to pandas but it seems like at every turn I am stopped, by my ignorance no doubt, but it's getting tiresome. Is there a better way to do this? Perhaps I should just ignore the first row of the logfile? Can I convert from scientific notation while reading the file in? Keep plugging away?
Thanks.
df.info() is:
Int64Index: 51 entries, 0 to 50
Columns: 1092 entries, vfd3051 to vfd6824
dtypes: float64(1092)
Everything seems to load into pandas correctly but mpl doesn't like something in the data. The columns are floats, I'm not using the index of integers. If the column names were being added as my first row, the columns would be treated as str or obj type. The error is:
Traceback (most recent call last):
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 51, in <module>
make_plots(df, save)
File "D:\Python\el_plot_top_10\IV_plot_all.py", line 21, in make_plots
plt.plot(voltage, col, legend=False)
File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 4139, in plot
for line in self._get_lines(*args, **kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 319, in _grab_next_args
for seg in self._plot_args(remaining, kwargs):
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 278, in _plot_args
linestyle, marker, color = _process_plot_format(tup[-1])
File "C:\Anaconda3\lib\site-packages\matplotlib\axes.py", line 131, in _process_plot_format
'Unrecognized character %c in format string' % c)
ValueError: Unrecognized character f in format string
I figured out how to make this work entirely in pandas. Don't indicate an index nor a header row. Transpose the dataframe and drop the index. Then, create a list out of the first row of data which will be your string titles for the columns you really wanted. Assign the column names to this list and then reassign the dataframe to a sliced dataframe eliminating the first row of string names ('vfd3021' in my case).
After that, you're good to go. The columns are float and since my voltage range is fixed, I just create a list with arange when I plot.
if __name__ == '__main__':
file_path = str(input("Enter the filename to open: "))
save = str(input('Do you wish to save a pdf of the IV plots? (y/n): '))
df = pd.read_csv(file_path, delim_whitespace=True)
df = df.T
df.reset_index(inplace=True)
df.index.names = ['Voltage']
df.columns.names = ['Die_numbers']
df.drop('index', axis=1, inplace=True)
names = df.iloc[0].values
df.columns = names
df = df[1:]
make_plots(df, save)
As far as I can see all your problems are coming from not getting your data in the
correct format to begin with. Just focus on importing the data and print what your going to plot
checking that the types are what you would expect them to be.
I would advise using a different method to import the data as the file format is not what pandas
works best with (e.g it is transposed). For example, you could use numpy.genfromtxt, an introduction is given here.
import numpy as np
from StringIO import StringIO
data_file = (
"""Vmeas -5.00E+000 -4.50E+000 -4.00E+000 -3.50E+000
vfd3051 -3.20E-008 -1.49E-009 1.38E-008 -1.17E-008
vfd3151 -3.71E-008 -6.58E-009 -6.58E-009 -6.58E-009
vfd3251 -4.73E-008 3.59E-009 8.68E-009 -1.68E-008
vfd3351 -2.18E-008 -3.71E-008 3.60E-009 -3.20E-008
""")
data = np.genfromtxt(StringIO(data_file), dtype=None)
print data
>>> array([('Vmeas', -5.0, -4.5, -4.0, -3.5),
('vfd3051', -3.2e-08, -1.49e-09, 1.38e-08, -1.17e-08),
('vfd3151', -3.71e-08, -6.58e-09, -6.58e-09, -6.58e-09),
('vfd3251', -4.73e-08, 3.59e-09, 8.68e-09, -1.68e-08),
('vfd3351', -2.18e-08, -3.71e-08, 3.6e-09, -3.2e-08)],
dtype=[('f0', 'S7'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
So now we have a numpy array of tuples with the column names as the first index and all the
data as the rest of the tuple. Most importantly all the numbers are numbers, try to avoid having
strings because conversions are messy.
Then we could do the following to get a nice pandas DataFrame:
DataDictionary = {row[0]:list(row)[1:] for row in iter(data)}
pd.DataFrame(DataDictionary)
Firstly we create a dictionary of the data by using a Python dictionary comprehension, then pass this into the DataFrame. This results in a well behaved dataframe with columns
named by the strings "Vmeas", "vdf*" and an index of all the data.
Vmeas vfd3051 vfd3151 d3251 vfd3351
0 -5.0 -3.200000e-08 -3.710000e-08 -4.730000e-08 -2.180000e-08
1 -4.5 -1.490000e-09 -6.580000e-09 3.590000e-09 -3.710000e-08
2 -4.0 1.380000e-08 -6.580000e-09 8.680000e-09 3.600000e-09
3 -3.5 -1.170000e-08 -6.580000e-09 -1.680000e-08 -3.200000e-08
I doubt this will fully answer your question but it is a start to getting the data correct before plotting it which I think was your problem. Try to keep it as simple as possible!

Categories