pandas dataframe conversion for linear regression - python

I read the CSV file and get a dataframe (name: data) that has a few columns, the first a few are in format numeric long(type:pandas.core.series.Series) and the last column(label) is a binary response variable string 'P(ass)'/'F(ail)'
import statsmodels.api as sm
label = data.ix[:, -1]
label[label == 'P'] = 1
label[label == 'F'] = 0
fea = data.ix[:, 0: -1]
logit = sm.Logit(label, fea)
result = logit.fit()
print result.summary()
Pandas throws me this error message: "ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)"
Numpy,Pandas etc modules are imported already. I tried to convert fea columns to float but still does not go through. Could someone tell me how to correct?
Thanks
update:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 68135 to 3002
Data columns (total 8 columns):
TestQty 500 non-null int64
WaferSize 500 non-null int64
ChuckTemp 500 non-null int64
Notch 500 non-null int64
ORIGINALDIEX 500 non-null int64
ORIGINALDIEY 500 non-null int64
DUTNo 500 non-null int64
PassFail 500 non-null object
dtypes: int64(7), object(1)
memory usage: 35.2+ KB
data.sum()
TestQty 530
WaferSize 6000
ChuckTemp 41395
Notch 135000
ORIGINALDIEX 12810
ORIGINALDIEY 7885
DUTNo 271132
PassFail 20
dtype: float64

Shouldn't your features be this:
fea = data.ix[:, 0:-1]
From you data, you see that PassFail sums to 20 before you convert 'P' to 1 and 'F' to zero. I believe that is the source of your error.
To see what is in there, try:
data.PassFail.unique()
To verify that it totals to 500 (the number of rows in the DataFrame):
sum(label[label == 0]) + sum(label[label == 1)
Finally, try passing values to the function rather than Series and DataFrames:
logit = sm.Logit(label.values, fea.values)

Related

python stock data, fill open with previous close

I import a csv with s&p 500 data and some of the opening price data is set to 0.
import pandas as pd
#import spx data
spx = pd.read_csv('C:/Users/joshu/Desktop/SPX1970.csv')
spx['DateTime'] = pd.to_datetime(spx['Date'],utc=False, format="%d-%b-%y")
spx.sort_values(by=['DateTime'],ascending=True, inplace=True)
spx
and the output is below. Some of the opening price data is missing
Date Open High Low Close DateTime
13285 5-Jan-70 0 94.25 92.53 93.46 1970-01-05
13284 6-Jan-70 0 93.81 92.13 92.82 1970-01-06
13283 7-Jan-70 0 93.38 91.93 92.63 1970-01-07
13282 8-Jan-70 0 93.47 91.99 92.68 1970-01-08
13281 9-Jan-70 0 93.25 91.82 92.4 1970-01-09
... ... ... ... ... ... ...
4 29-Aug-22 4,034.58 4,062.99 4,017.42 4,030.61 2022-08-29
3 30-Aug-22 4,041.25 4,044.98 3,965.21 3,986.16 2022-08-30
2 31-Aug-22 4,000.67 4,015.37 3,954.53 3,955.00 2022-08-31
1 1-Sep-22 3,936.73 3,970.23 3,903.65 3,966.85 2022-09-01
0 2-Sep-22 3,994.66 4,018.43 3,906.21 3,924.26 2022-09-02
13286 rows × 6 columns
I could easily use excel and assign my 0 - in the open column - to the previous close but would like to figure out how to do in with python.
I looked at using various methods from the "working with missing data" section of the pandas documentation but that doesn't really fit my case. Searched the web, not really finding my case either. So I wrote the following, which is not working. The if condition for open = = 0 is never true
# for open value = 0 copy previous close into open and copy this row's close for next iteration
for index, row in spx.iterrows():
if row['Open'] == 0:
row['Open'] = C
C = row['Close']
print (C) # I added this to see if this was ever true
else:
C = row['Close']
continue
spx
Any help is appreciated.
EDIT:
I corrected my data types and the logic prints out the close if the open is 0 - so i believe it works so I'm stuck at row['Open'] = C not working.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13287 entries, 0 to 13286
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 13287 non-null int64
1 Date 13287 non-null object
2 Open 13287 non-null float64
3 High 13287 non-null float64
4 Low 13287 non-null float64
5 Close 13287 non-null float64
6 DateTime 13287 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(1), object(1)
memory usage: 726.8+ KB
There are three main problems you are trying to solve, and I believe this code covers the three. See the comments in the below code for what each of the steps do.
import pandas as pd
df = pd.DataFrame({"a": [7, 0, 0, 3], "b": [2, 4, 6, 8]})
# Find all instances of zeros in your data
zeros = (df['a'] == 0)
# Set those to None, which is what fillna wants. Note you can use masks to do the following too
df.loc[zeros, 'a'] = None
# Make a shifted copy of b - this covers your "previous close value" requirement
previous_b = df['b'].shift()
# Replace missing values (Nones) in a with the value in b
df['a'] = df['a'].fillna(previous_b)

Pandas HDF5ExtError: Problems creating the Array

I was trying to save my df with texts to h5 file, and got an awesome error:)
>>> df_texts["text"] = df_texts.text.apply(str)
>>> df_texts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2987771 entries, 0 to 2987770
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 text object
1 labels int32
dtypes: int32(1), object(1)
memory usage: 34.2+ MB
>>> core["texts"] = df_texts
HDF5ExtError: Problems creating the Array.
I cant understand it...
Traceback: https://pastebin.com/0vRQewvy

How to convert array to being 1 dimensional for use as an index

I'm trying to create an index from a numpy array, but everytime i try i get the following error 'ValueError: Cannot index with multidimensional key'. How can I get this 'indices' array into the correct format to work?
Here is the relevant code:
Dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
default.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default 10000 non-null object
student 10000 non-null object
balance 10000 non-null float64
income 10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB
def regression (X,y,indices):
reg = smf.logit('Default_ ~ balance + income',default,subset=indices).fit()
beta_0 = reg.coeffs(1)
print(reg.coeffs)
n_iter = 1
for i in range(0,n_iter):
sample_size = len(default)
X = default[['balance','income']]
y = default['default']
#create random set of indices
indices = np.round(np.random.rand(len(default),1)*len(default)).astype(int)
regression(X,y,indices)
Format of array im trying to use as index:
[[2573]
[8523]
[2403]
...
[1635]
[6665]
[6364]]
Just collapse it to the one-dimensional array using flatten()
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html

Why does memory usage in Pandas report the same number for integers as for object dtype?

I'm trying to understand the difference in memory usage between integers and string (objects) dtypes in Pandas.
import pandas as pd
df_int = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=int)
As expected, this takes around 3.2 KB of memory as each column is a 64-bit integer
In [38]: df_int.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null int64
B 100 non-null int64
C 100 non-null int64
D 100 non-null int64
dtypes: int64(4)
memory usage: 3.2 KB
However, when I try to initialize it as a string, it is telling me that it has roughly the same memory usage
import pandas as pd
df_str = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=str)
In [40]: df_str.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null object
B 100 non-null object
C 100 non-null object
D 100 non-null object
dtypes: object(4)
memory usage: 3.2+ KB
When I use sys.getsizeof, the difference is clear. For the dataframe containing only 64-bit integers, the size is roughly 3.3 KB (including the dataframe overhead of 24 bytes)
In [44]: sys.getsizeof(df_int)
Out[44]: 3304
For the dataframe initialized with integers converted to strings, it is nearly 24 KB
In [42]: sys.getsizeof(df_str)
Out[42]: 23984
Why does memory usage in Pandas report the same number for integers as for strings (object dtype)?
Following the docs, use 'deep' to get the actual value (otherwise it's an estimate)
df_str.info(memory_usage='deep')
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 100 entries, 0 to 99
#Data columns (total 4 columns):
#A 100 non-null object
#B 100 non-null object
#C 100 non-null object
#D 100 non-null object
#dtypes: object(4)
#memory usage: 23.3 KB
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2 representation).
Without deep introspection a memory estimation is made based in column
dtype and number of rows assuming values consume the same memory
amount for corresponding dtypes. With deep memory introspection, a
real memory usage calculation is performed at the cost of
computational resources.

Plotting Pandas' pivot_table from long data

I have a xls file with data organized in long format. I have four columns: the variable name, the country name, the year and the value.
After importing the data in Python with pandas.read_excel, I want to plot the time series of one variable for different countries. To do so, I create a pivot table that transforms the data in wide format. When I try to plot with matplotlib, I get an error
ValueError: could not convert string to float: 'ZAF'
(where 'ZAF' is the label of one country)
What's the problem?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('raw_emissions_energy.xls','raw data', index_col = None, thousands='.',parse_cols="A,C,F,M")
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data[(data['VAR']=='CO2_PBPROD')], index='COU', columns='Year')
plt.plot(data_CO2PROD)
The xls file with raw data looks like:
raw data Excel view
This is what I get from data_CO2PROD.info()
<class 'pandas.core.frame.DataFrame'>
Index: 105 entries, ARE to ZAF
Data columns (total 16 columns):
(Value, 1990) 104 non-null float64
(Value, 1995) 105 non-null float64
(Value, 2000) 105 non-null float64
(Value, 2001) 105 non-null float64
(Value, 2002) 105 non-null float64
(Value, 2003) 105 non-null float64
(Value, 2004) 105 non-null float64
(Value, 2005) 105 non-null float64
(Value, 2006) 105 non-null float64
(Value, 2007) 105 non-null float64
(Value, 2008) 105 non-null float64
(Value, 2009) 105 non-null float64
(Value, 2010) 105 non-null float64
(Value, 2011) 105 non-null float64
(Value, 2012) 105 non-null float64
(Value, 2013) 105 non-null float64
dtypes: float64(16)
memory usage: 13.9+ KB
None
Using data_CO2PROD.plot() instead of plt.plot(data_CO2PROD) allowed me to plot the data. http://pandas.pydata.org/pandas-docs/stable/visualization.html.
Simple code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data= pd.DataFrame(np.random.randn(3,4), columns=['VAR','COU','Year','VAL'])
data['VAR'] = ['CC','CC','KK']
data['COU'] =['ZAF','NL','DK']
data['Year']=['1987','1987','2006']
data['VAL'] = [32,33,35]
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')], index='COU', columns='Year')
data_CO2PROD.plot()
plt.show()
I think you need add parameter values to pivot_table:
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')],
index='COU',
columns='Year',
values='Value')
data_CO2PROD.plot()
plt.show()

Categories