I import a csv with s&p 500 data and some of the opening price data is set to 0.
import pandas as pd
#import spx data
spx = pd.read_csv('C:/Users/joshu/Desktop/SPX1970.csv')
spx['DateTime'] = pd.to_datetime(spx['Date'],utc=False, format="%d-%b-%y")
spx.sort_values(by=['DateTime'],ascending=True, inplace=True)
spx
and the output is below. Some of the opening price data is missing
Date Open High Low Close DateTime
13285 5-Jan-70 0 94.25 92.53 93.46 1970-01-05
13284 6-Jan-70 0 93.81 92.13 92.82 1970-01-06
13283 7-Jan-70 0 93.38 91.93 92.63 1970-01-07
13282 8-Jan-70 0 93.47 91.99 92.68 1970-01-08
13281 9-Jan-70 0 93.25 91.82 92.4 1970-01-09
... ... ... ... ... ... ...
4 29-Aug-22 4,034.58 4,062.99 4,017.42 4,030.61 2022-08-29
3 30-Aug-22 4,041.25 4,044.98 3,965.21 3,986.16 2022-08-30
2 31-Aug-22 4,000.67 4,015.37 3,954.53 3,955.00 2022-08-31
1 1-Sep-22 3,936.73 3,970.23 3,903.65 3,966.85 2022-09-01
0 2-Sep-22 3,994.66 4,018.43 3,906.21 3,924.26 2022-09-02
13286 rows × 6 columns
I could easily use excel and assign my 0 - in the open column - to the previous close but would like to figure out how to do in with python.
I looked at using various methods from the "working with missing data" section of the pandas documentation but that doesn't really fit my case. Searched the web, not really finding my case either. So I wrote the following, which is not working. The if condition for open = = 0 is never true
# for open value = 0 copy previous close into open and copy this row's close for next iteration
for index, row in spx.iterrows():
if row['Open'] == 0:
row['Open'] = C
C = row['Close']
print (C) # I added this to see if this was ever true
else:
C = row['Close']
continue
spx
Any help is appreciated.
EDIT:
I corrected my data types and the logic prints out the close if the open is 0 - so i believe it works so I'm stuck at row['Open'] = C not working.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13287 entries, 0 to 13286
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 13287 non-null int64
1 Date 13287 non-null object
2 Open 13287 non-null float64
3 High 13287 non-null float64
4 Low 13287 non-null float64
5 Close 13287 non-null float64
6 DateTime 13287 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(1), object(1)
memory usage: 726.8+ KB
There are three main problems you are trying to solve, and I believe this code covers the three. See the comments in the below code for what each of the steps do.
import pandas as pd
df = pd.DataFrame({"a": [7, 0, 0, 3], "b": [2, 4, 6, 8]})
# Find all instances of zeros in your data
zeros = (df['a'] == 0)
# Set those to None, which is what fillna wants. Note you can use masks to do the following too
df.loc[zeros, 'a'] = None
# Make a shifted copy of b - this covers your "previous close value" requirement
previous_b = df['b'].shift()
# Replace missing values (Nones) in a with the value in b
df['a'] = df['a'].fillna(previous_b)
Related
I'm trying to create an index from a numpy array, but everytime i try i get the following error 'ValueError: Cannot index with multidimensional key'. How can I get this 'indices' array into the correct format to work?
Here is the relevant code:
Dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
default.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default 10000 non-null object
student 10000 non-null object
balance 10000 non-null float64
income 10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB
def regression (X,y,indices):
reg = smf.logit('Default_ ~ balance + income',default,subset=indices).fit()
beta_0 = reg.coeffs(1)
print(reg.coeffs)
n_iter = 1
for i in range(0,n_iter):
sample_size = len(default)
X = default[['balance','income']]
y = default['default']
#create random set of indices
indices = np.round(np.random.rand(len(default),1)*len(default)).astype(int)
regression(X,y,indices)
Format of array im trying to use as index:
[[2573]
[8523]
[2403]
...
[1635]
[6665]
[6364]]
Just collapse it to the one-dimensional array using flatten()
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
I'm trying to understand the difference in memory usage between integers and string (objects) dtypes in Pandas.
import pandas as pd
df_int = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=int)
As expected, this takes around 3.2 KB of memory as each column is a 64-bit integer
In [38]: df_int.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null int64
B 100 non-null int64
C 100 non-null int64
D 100 non-null int64
dtypes: int64(4)
memory usage: 3.2 KB
However, when I try to initialize it as a string, it is telling me that it has roughly the same memory usage
import pandas as pd
df_str = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'), dtype=str)
In [40]: df_str.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
A 100 non-null object
B 100 non-null object
C 100 non-null object
D 100 non-null object
dtypes: object(4)
memory usage: 3.2+ KB
When I use sys.getsizeof, the difference is clear. For the dataframe containing only 64-bit integers, the size is roughly 3.3 KB (including the dataframe overhead of 24 bytes)
In [44]: sys.getsizeof(df_int)
Out[44]: 3304
For the dataframe initialized with integers converted to strings, it is nearly 24 KB
In [42]: sys.getsizeof(df_str)
Out[42]: 23984
Why does memory usage in Pandas report the same number for integers as for strings (object dtype)?
Following the docs, use 'deep' to get the actual value (otherwise it's an estimate)
df_str.info(memory_usage='deep')
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 100 entries, 0 to 99
#Data columns (total 4 columns):
#A 100 non-null object
#B 100 non-null object
#C 100 non-null object
#D 100 non-null object
#dtypes: object(4)
#memory usage: 23.3 KB
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2 representation).
Without deep introspection a memory estimation is made based in column
dtype and number of rows assuming values consume the same memory
amount for corresponding dtypes. With deep memory introspection, a
real memory usage calculation is performed at the cost of
computational resources.
I read some weather data from a .csv file as a dataframe named "weather". The problem is that the data type of one of the columns is object. This is weird, as it indicates temperature. How do I change it to having a float data type? I tried to_numeric, but it can't parse it.
weather.info()
weather.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 304 entries, 2017-01-01 to 2017-10-31
Data columns (total 2 columns):
Temp 304 non-null object
Rain 304 non-null float64
dtypes: float64(1), object(1)
memory usage: 17.1+ KB
Temp Rain
Date
2017-01-01 12.4 0.0
2017-02-01 11 0.6
2017-03-01 10.4 0.6
2017-04-01 10.9 0.2
2017-05-01 13.2 0.0
You can use pandas.Series.astype
You can do something like this :
weather["Temp"] = weather.Temp.astype(float)
You can also use pd.to_numeric that will convert the column from object to float
For details on how to use it checkout this link :http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
Example :
s = pd.Series(['apple', '1.0', '2', -3])
print(pd.to_numeric(s, errors='ignore'))
print("=========================")
print(pd.to_numeric(s, errors='coerce'))
Output:
0 apple
1 1.0
2 2
3 -3
=========================
dtype: object
0 NaN
1 1.0
2 2.0
3 -3.0
dtype: float64
In your case you can do something like this:
weather["Temp"] = pd.to_numeric(weather.Temp, errors='coerce')
Other option is to use convert_objects
Example is as follows
>> pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
0 1
1 2
2 3
3 4
4 NaN
dtype: float64
You can use this as follows:
weather["Temp"] = weather.Temp.convert_objects(convert_numeric=True)
I have showed you examples because if any of your column won't have a number then it will be converted to NaN... so be careful while using it.
I tried all methods suggested here but sadly none worked. Instead, found this to be working:
df['column'] = pd.to_numeric(df['column'],errors = 'coerce')
And then check it using:
print(df.info())
I eventually used:
weather["Temp"] = weather["Temp"].convert_objects(convert_numeric=True)
It worked just fine, except that I got the following message.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning:
convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
You can try the following:
df['column'] = df['column'].map(lambda x: float(x))
First check your data cuz you may get an error if you have ',' instead of '.'
if so, you need to transform every ',' into '.' with a function :
def replacee(s):
i=str(s).find(',')
if(i>0):
return s[:i] + '.' + s[i+1:]
else :
return s
then you need to apply this function on every row in your column :
dfOPA['Montant']=dfOPA['Montant'].apply(replacee)
then the convert function will work fine :
dfOPA['Montant'] = pd.to_numeric(dfOPA['Montant'],errors = 'coerce')
Eg, For Converting $40,000.00 object to 40000 int or float32
Follow this step by step :
$40,000.00 ---(**1**. remove $)---> 40,000.00 ---(**2**. remove , comma)---> 40000.00 ---(**3**. remove . dot)---> 4000000 ---(**4**. remove empty space)---> 4000000 ---(**5**. Remove NA Values)---> 4000000 ---(**6**. now this is object type so, convert to int using .astype(int) )---> 4000000 ---(**7**. divide by 100)---> 40000
Implementing code In Pandas
table1["Price"] = table1["Price"].str.replace('$','')<br>
table1["Price"] = table1["Price"].str.replace(',','')<br>
table1["Price"] = table1["Price"].str.replace('.','')<br>
table1["Price"] = table1["Price"].str.replace(' ','')
table1 = table1.dropna()<br>
table1["Price"] = table1["Price"].astype(int)<br>
table1["Price"] = table1["Price"] / 100<br>
Finally it's done
I read the CSV file and get a dataframe (name: data) that has a few columns, the first a few are in format numeric long(type:pandas.core.series.Series) and the last column(label) is a binary response variable string 'P(ass)'/'F(ail)'
import statsmodels.api as sm
label = data.ix[:, -1]
label[label == 'P'] = 1
label[label == 'F'] = 0
fea = data.ix[:, 0: -1]
logit = sm.Logit(label, fea)
result = logit.fit()
print result.summary()
Pandas throws me this error message: "ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)"
Numpy,Pandas etc modules are imported already. I tried to convert fea columns to float but still does not go through. Could someone tell me how to correct?
Thanks
update:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 68135 to 3002
Data columns (total 8 columns):
TestQty 500 non-null int64
WaferSize 500 non-null int64
ChuckTemp 500 non-null int64
Notch 500 non-null int64
ORIGINALDIEX 500 non-null int64
ORIGINALDIEY 500 non-null int64
DUTNo 500 non-null int64
PassFail 500 non-null object
dtypes: int64(7), object(1)
memory usage: 35.2+ KB
data.sum()
TestQty 530
WaferSize 6000
ChuckTemp 41395
Notch 135000
ORIGINALDIEX 12810
ORIGINALDIEY 7885
DUTNo 271132
PassFail 20
dtype: float64
Shouldn't your features be this:
fea = data.ix[:, 0:-1]
From you data, you see that PassFail sums to 20 before you convert 'P' to 1 and 'F' to zero. I believe that is the source of your error.
To see what is in there, try:
data.PassFail.unique()
To verify that it totals to 500 (the number of rows in the DataFrame):
sum(label[label == 0]) + sum(label[label == 1)
Finally, try passing values to the function rather than Series and DataFrames:
logit = sm.Logit(label.values, fea.values)
I have a pandas dataframe that I filled with this:
import pandas.io.data as web
test = web.get_data_yahoo('QQQ')
The dataframe looks like this in iPython:
In [13]: test
Out[13]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 2010-01-04 00:00:00 to 2012-11-23 00:00:00
Data columns:
Open 729 non-null values
High 729 non-null values
Low 729 non-null values
Close 729 non-null values
Volume 729 non-null values
Adj Close 729 non-null values
dtypes: float64(5), int64(1)
When I divide one column by another, I get a float64 result that has a satisfactory number of decimal places. I can even divide one column by another column offset by one, for instance test.Open[1:]/test.Close[:], and get a satisfactory number of decimal places. When I divide a column by itself offset, however, I get just 1:
In [83]: test.Open[1:] / test.Close[:]
Out[83]:
Date
2010-01-04 NaN
2010-01-05 0.999354
2010-01-06 1.005635
2010-01-07 1.000866
2010-01-08 0.989689
2010-01-11 1.005393
...
In [84]: test.Open[1:] / test.Open[:]
Out[84]:
Date
2010-01-04 NaN
2010-01-05 1
2010-01-06 1
2010-01-07 1
2010-01-08 1
2010-01-11 1
I'm probably missing something simple. What do I need to do in order to get a useful value out of that sort of calculation? Thanks in advance for the assistance.
If you're looking to do operations between the column and lagged values, you should be doing something like test.Open / test.Open.shift().
shift realigns the data and takes an optional number of periods.
You may not be getting what you think you are when you do test.Open[1:]/test.Close. Pandas matches up the rows based on their index, so you're still getting each element of one column divided by its corresponding element in the other column (not the element one row back). Here's an example:
>>> print d
A B C
0 1 3 7
1 -2 1 6
2 8 6 9
3 1 -5 11
4 -4 -2 0
>>> d.A / d.B
0 0.333333
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
>>> d.A[1:] / d.B
0 NaN
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
Notice that the values returned are the same for both operations. The second one just has nan for the first one, since there was no corresponding value in the first operand.
If you really want to operate on offset rows, you'll need to dig down to the numpy arrays that underpin the pandas DataFrame, to bypass pandas's index-aligning features. You can get at these innards with the values attribute of a column.
>>> d.A.values[1:] / d.B.values[:-1]
array([-0.66666667, 8. , 0.16666667, 0.8 ])
Now you really are getting each value divided by the one before it in the other column. Note that here you have to explicitly slice the second operand to leave off the last element, to make them equal in length.
So you can do the same to divide a column by an offset version of itself:
>>> d.A.values[1:] / d.A.values[:-1]
45: array([-2. , -4. , 0.125, -4. ])