Pandas creating big dataFrame and filling it in loop - python

I created already the columns of my dataframe
id=[f'GeneID_region_{i}' for i in range(43)]
value=[f'GeneValue_region_{i}' for i in range(43)]
lst=[]
for i in range(43):
lst.append(id[i])
lst.append(value[i])
df = pd.DataFrame(lst)
df = df.T
Now it looks like that:
df
Out[158]:
0 1 ... 84 85
0 GeneID_region_0 GeneValue_region_0 ... GeneID_region_42 GeneValue_region_42
[1 rows x 86 columns]
GeneID_region... are my columns, and now I want to fill the columns line by line.. But I think I haven't defined my rows as rows yet because I cant do:
df.GeneID_region_0
Traceback (most recent call last):
File "<ipython-input-159-2760f7e0dd61>", line 1, in <module>
df.GeneID_region_0
File "/home/anja/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'GeneID_region_0'
Can someone help me how to do that properly?
The result should look like the following:
I have an numpy array of dimension 43x25520.
I want to have 25520 values in column 'GeneID_region0'
and than 25520 values in column 'GeneValue_region0'
and so on.. so in the end I want to have a pandas frame of dimension (25520,86)

I am guessing that what you wanted was GeneID_region_n etc for column names, and then to fill your df with data. You can do this (using 0 as fake data since you didnt specify) like this:
id=[f'GeneID_region_{i}' for i in range(43)]
value=[f'GeneValue_region_{i}' for i in range(43)]
lst=[]
for i in range(43):
lst.append(id[i])
lst.append(value[i])
df = pd.DataFrame([[0 for i in range(43+43)]],columns=lst)

Related

Type error when trying to multiply two Lists in Python

I am trying to write a simple program where a new column is added to an existing dataframe. The new column is created by multiplying values of two existing columns.
This is the code I have written :
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df=pd.DataFrame(data)
df=df.to_string(index=False)
print(df)
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
df['Total Amounts']=totalamount
print(df)
Even though I've used the int() method to convert the values back to integer, it still gives the type error, the exact error being:
Traceback (most recent call last):
File "File Path", line 11, in <module>
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
TypeError: string indices must be integers
Earlier when I did not have the line df=df.to_string(index=False) line, and also did not use the int() function, there wasn't any error. The list was multiplied, although printed in this manner
[0 400
1 400
2 900
3 1250
4 300
dtype: int64]
But further in the code where I try to add the list to the Dataframe it gives the error ValueError: Length of values (1) does not match length of index (5)
I tried to look any other ways to do this, but can't seem to find any. Thank you in Advance!
you can try this short answer:
df['Total Amounts'] = df.apply(lambda x: x['No. of Tickets'] * x['Ticket Rate'], axis=1)
output:
# print(df['Total Amounts'])
0 400
1 400
2 900
3 1250
4 300
Name: Total Amounts, dtype: int64
You converted your df to a string and again reassigned it to df.
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df = pd.DataFrame(data)
string_df = df.to_string(index=False) #Assign it to another variable!
print(string_df)
totalamount = [df['No. of Tickets'] * df['Ticket Rate'] ]
df['Total Amounts'] = totalamount
print(df)

Length mismatch error when scaling up multiIndex slicer on large dataset

I am trying to split imported csv files (timeseries) and manipulated them, using pandas MultiIndex Slicer command .xs(). The following df replicates the structure of my imported csv file.
import pandas as pd
df = pd.DataFrame(
{'Sensor ID': [14,1,3,14,3],
'Building ID': [109,109,109,109,109],
'Date/Time': ["26/10/2016 14:31:14","26/10/2016 14:31:16", "26/10/2016 14:32:17", "26/10/2016 14:35:14", "26/10/2016 14:35:38"],
'Reading': [20.95, 20.62, 22.45, 20.65, 22.83],
})
df.set_index(['Sensor ID','Date/Time'], inplace=True)
df.sort_index(inplace=True)
print(df)
SensorList = [1, 3, 14]
for s in SensorList:
df1 = df.xs(s, level='Sensor ID')
I have tested the code on a small excerpt of csv data and it works fine. However, when implementing with the entire csv file, I receive the error: ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements.
Printing df.info() returns the following:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65981 entries, (1, 2016-10-26 14:35:15) to (19, 2016-11-07 11:27:14)
Data columns (total 2 columns):
Building ID 65981 non-null int64
Reading 65981 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5+ MB
None
Any tip on what may be causing the error?
EDIT
I inadvertently truncated my code, thus leaving it pointless in its current form. The original code resamples values into 15-minutes and 1-hour intervals.
with:
units = ['D1','D3','D6','D10']
unit_output_path = './' + unit + '/'
the loop does:
for s in SensorList:
## Slice multi-index to isolate all readings for sensor s
df1 = df_mi.xs(s, level='Sensor ID')
df1.drop('Building ID', axis=1, inplace=True)
## Resample by 15min and 1hr intervals and exports individual csv files
df1_15min = df1.resample('15Min').mean().round(1)
df1_hr = df1.resample('60Min').mean().round(1)
Traceback:
File "D:\AN6478\AN6478_POE_ABo.py", line 52, in <module>
df1 = df_mi.xs(s, level='Sensor ID')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1736, in xs
setattr(result, result._get_axis_name(axis), new_ax)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2685, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\src\properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas\lib.c:44748)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 428, in _set_axis
self._data.set_axis(axis, labels)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2635, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements
I can't tell you why exactly df1 = df_mi.xs(s, level='Sensor ID') raises the ValueError here. Where does df_mi come from?
Here is an alternative using groupby which accomplishes what you want on your given dummy data frame without relying on multiIndex and xs. :
# reset index to have DatetimeIndex, otherwise resample won't work
df = df.reset_index(0)
df.index = pd.to_datetime(df.index)
# create data frame for each sensor, keep relevant "Reading" column
grouped = df.groupby("Sensor ID")["Reading"]
# iterate each sensor data frame
for sensor, sub_df in grouped:
quarterly = sub_df.resample('15Min').mean().round(1)
hourly = sub_df.resample('60Min').mean().round(1)
# implement your to_csv saving here
Note, you could also use the groupby on the multiIndex with df.groupby(level="Sensor ID"), however since you want to resample later on, it is easier to drop Sensor ID from the multiIndex which simplifies it overall.

Get Pandas DataFrame first column

This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.
using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)

Python Pandas sum of dataframe with one column

I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.

Get the first pandas DataFrame's column?

I want to calculate std of my first prices DataFrame's column.
Here is my code:
import pandas as pd
def std(returns):
return pd.DataFrame(returns.std(axis=0, ddof=0))
prices = pd.DataFrame([[-0.33333333, -0.25343423, -0.1666666667],
[+0.23432323, +0.14285714, -0.0769230769],
[+0.42857143, +0.07692308, +0.1818181818]])
print(std(prices.ix[:,0]))
When I run it, i get the following error:
Traceback (most recent call last):
File "C:\Users\*****\Documents\******\******\****.py", line 12, in <module>
print(std(prices.ix[:,0]))
File "C:\Users\*****\Documents\******\******\****.py", line 10, in std
return pd.DataFrame(returns.std(axis=0, ddof=0))
File "C:\Python33\lib\site-packages\pandas\core\frame.py", line 453, in __init__
raise PandasError('DataFrame constructor not properly called!')
pandas.core.common.PandasError: DataFrame constructor not properly called!
How can I fix that?
Thank you!
Take a closer look at what is going in in your code:
>>> prices.ix[:,0]
0 -0.333333
1 0.234323
2 0.428571
>>> prices.ix[:,0].std(axis=0, ddof=0)
0.32325861621668445
So you are calling the DataFrame constructor like this:
pd.DataFrame(0.32325861621668445)
The constructor has no idea what to do with single float parameter. It needs some kind of sequence or iterable. Maybe what you what is this:
>>> pd.DataFrame([0.32325861621668445])
0
0 0.323259
It should be as simple as this:
In [0]: prices[0].std()
Out[0]: 0.39590933234452624
Columns of DataFrames are Series. You can call Series methods on them directly.

Categories