How to preprocess a pandas dataset for TensorFlow - python

I'm trying to do an experiment with using svg data on the mnist images as an input into a fullyconnected Neural Network in Tensorflow
I've parsed the images and put them into a csv file with each row a different image, the first column is the classifier (0-9) and the remaining columns are the coordinates (x1,y1,x2,y2 etc)
csv_file = '/content/mnist_training_0s_svg.csv'
max_features = 100 # the max number in reality is more like 70, will ultimately truncate
dataframe = pd.read_csv(csv_file, header=None, names = list(range(0,max_features)))
print (dataframe.head())
0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
0 0 180 196 0 -11 -5 -16 -14 -13 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 152 203 -46 -18 -56 -153 -11 -153 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 106 128 -21 -30 -21 -62 0 -54 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 180 90 25 -25 60 -94 60 -119 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 115 178 -20 -29 -20 -92 1 -109 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 100 columns]
What I'm trying to do next is use pad_sequences to add zeros and truncate each "image" to 70 coordinates and then use keras preprocessing to normalize the data so that it can be effectively intput into the NN.
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop(0) #
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe)), labels))
ds = pad_sequences(ds.as_numpy_iterator(), #TypeError: from_tensor_slices() takes 1 positional argument but 2 were given
padding='post',truncating='post',maxlen=70)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
train_ds = df_to_dataset(dataframe, batch_size = 5)
The details of the above is pretty arbitrary. I've tried many variations of the above only to get a variety of TypeErrors. I've gone through the pandas dataframe documentation and the pad_sequences documentation, but I feel I'm missing something more basic, namely, in layperson's terms: how do I pull out a row of this dataframe (minus the classifier) to pad and how do I pull out a column (minus the labels) to normalize?

Related

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

Locating columns values in pandas dataframe with conditions

We have a dataframe (df_source):
Unnamed: 0 DATETIME DEVICE_ID COD_1 DAT_1 COD_2 DAT_2 COD_3 DAT_3 COD_4 DAT_4 COD_5 DAT_5 COD_6 DAT_6 COD_7 DAT_7
0 0 200520160941 002222111188 35 200408100500.0 12 200408100400 16 200408100300 11 200408100200 19 200408100100 35 200408100000 43
1 19 200507173541 000049000110 00 190904192701.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20 200507173547 000049000110 00 190908185501.0 08 190908185501 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 21 200507173547 000049000110 00 190908205601.0 08 190908205601 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 22 200507173547 000049000110 00 190909005800.0 08 190909005800 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
159 775 200529000843 000049768051 40 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
160 776 200529000843 000049015792 00 200529000701.0 33 200529000701 NaN NaN NaN NaN NaN NaN NaN NaN NaN
161 779 200529000843 000049180500 00 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
162 784 200529000843 000049089310 00 200529000201.0 03 200529000201 61 200529000201 NaN NaN NaN NaN NaN NaN NaN
163 786 200529000843 000049768051 40 200529000401.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
We calculated values_cont, a dict, for a subset:
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
values_cont = pd.value_counts(df_source[v_subset].values.ravel())
We obtained as result (values, counter):
00 134
08 37
42 12
40 12
33 3
11 3
03 2
35 2
43 2
44 1
61 1
04 1
12 1
60 1
05 1
19 1
34 1
16 1
Now, the question is:
How to locate values in columns corresponding to counter, for instance:
How to locate:
df['DEVICE_ID'] # corresponding with values ('00') and counter ('134')
df['DEVICE_ID'] # corresponding with values ('08') and counter ('37')
...
df['DEVICE_ID'] # corresponding with values ('16') and counter ('1')
I believe you need DataFrame.melt with aggregate join for ID and GroupBy.size for counts.
This implementation will result in a dataframe with a column (value) for the CODES, all the associated DEVICE_IDs, and the count of ids associated with each code.
This is an alternative to values_cont in the question.
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
df = (df_source.melt(id_vars='DEVICE_ID', value_vars=v_subset)
.dropna(subset=['value'])
.groupby('value')
.agg(DEVICE_ID = ('DEVICE_ID', ','.join), count= ('value','size'))
.reset_index())
print (df)
value DEVICE_ID count
0 00 000049000110,000049000110,000049000110,0000490... 7
1 03 000049089310 1
2 08 000049000110,000049000110,000049000110 3
3 11 002222111188 1
4 12 002222111188 1
5 16 002222111188 1
6 19 002222111188 1
7 33 000049015792 1
8 35 002222111188,002222111188 2
9 40 000049768051,000049768051 2
10 43 002222111188 1
11 61 000049089310 1
# print DEVICE_ID for CODES == '03'
print(df.DEVICE_ID[df.value == '03'])
[out]:
1 000049089310
Name: DEVICE_ID, dtype: object
Given the question as related to df_source, to select specific parts of the dataframe, use Pandas: Boolean Indexing
# to return all rows where COD_1 is '00'
df_source[df_source.COD_1 == '00']
# to return only the DEVICE_ID column where COD_1 is '00'
df_source['DEVICE_ID'][df_source.COD_1 == '00']
You can use df.iloc to search out rows that match based on columns. Then from that row you can select the column of interest and output it. There may be a more pythonic way to do this.
df2=df.iloc[df['COD_1']==00]
df3=df2.iloc[df2['DAT_1']==134]
df_out=df3.iloc['DEVICE_ID']
here's more info in .iloc: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Need to iterate over row to check conditions and retrieve values from different columns if the conditions are met

I have a daily price data for a stock. Pasting last 31 rows of the data as an example dataset as below:
Date RSI Smooth max min
110 2019-02-13 38.506874 224.006543 NaN NaN
111 2019-02-14 39.567068 227.309923 NaN NaN
112 2019-02-15 43.774479 229.830776 NaN NaN
113 2019-02-18 43.651440 231.690179 NaN NaN
114 2019-02-19 43.467237 232.701976 NaN NaN
115 2019-02-20 44.370123 233.526131 NaN NaN
116 2019-02-21 45.605073 233.834988 233.834988 NaN
117 2019-02-22 46.837518 232.335179 NaN NaN
118 2019-02-25 42.087860 229.570711 NaN NaN
119 2019-02-26 39.008014 226.379526 NaN NaN
120 2019-02-27 39.542339 225.607475 NaN 225.607475
121 2019-02-28 39.051104 228.305615 NaN NaN
122 2019-03-01 48.191687 232.544289 NaN NaN
123 2019-03-05 51.909527 237.063534 NaN NaN
124 2019-03-06 52.988668 240.243201 NaN NaN
125 2019-03-07 54.205990 242.265173 NaN NaN
126 2019-03-08 54.967076 243.912033 NaN NaN
127 2019-03-11 58.080738 244.432163 244.432163 NaN
128 2019-03-12 55.587328 243.573710 NaN NaN
129 2019-03-13 51.714123 241.191933 NaN NaN
130 2019-03-14 48.948075 238.470485 NaN NaN
131 2019-03-15 46.615111 236.144640 NaN NaN
132 2019-03-18 48.219815 233.588265 NaN NaN
133 2019-03-19 41.866898 230.271903 NaN 230.271903
134 2019-03-20 34.818844 239.457110 NaN NaN
135 2019-03-22 42.167870 246.824173 NaN NaN
136 2019-03-25 60.228588 255.294124 NaN NaN
137 2019-03-26 66.896640 267.069173 NaN NaN
138 2019-03-27 68.823285 278.222343 NaN NaN
139 2019-03-28 63.654023 289.042091 289.042091 NaN
I am trying to develop a logic of code which as below:
if max > 0, then search for the previous non-zero max value which and assign it to max2. Also, assign the corresponding RSI of previous non-zero max as RSI2.
Desired output:
For line 139 in the data set, max2 will be 244.432163 and RSI2 will be 58.080738
For line 138 in the data set, max2 will be 0 and RSI 2 will be 0 and so on...
I tried different approached but was unsuccessful at getting any outputs so I do not have a sample code to paste.
I also tried using if loops but I am unable to make it work. I am very new at programming.
First you will need to iterate the dataframe.
Then you will need to store the previous values that you will need to save on the next hit. Since you are always going back to the previous max, you can reuse that as you loop through.
Something like this (did not test, just for an idea):
last_max = 0
last_rsi = 0
for index, row in df.iterrows():
if row['max']:
row['max2'] = last_max
row['rsi2'] = last_rsi
last_max = row['max'] # store this max/rsi for next time
last_rsi = row['rsi']
The right answer is to add a line of code as below:
df[['max2', 'RSI2']] = df[['max', 'RSI']].dropna(subset=['max']).shift(1).fillna(0)

Pandas dataframe manipulation and plotting

Using WinPython 3.4, matplotlib 1.3.1, I'm pulling data for a dataframe from a mysql database. The raw dataframe that I get from the query looks like:
wafer_number test_type test_pass x_coord y_coord test_el_id wavelength intensity
0 HT2731 T2 1 38 54 24 288.68 4413
1 HT2731 T2 1 40 54 25 257.42 2595
2 HT2731 T2 1 50 54 28 300.00 2836
3 HT2731 T2 1 52 54 29 300.00 2862
4 HT2731 T2 1 54 54 30 300.00 3145
5 HT2731 T2 1 56 54 31 300.00 2804
6 HT2731 T2 1 58 54 32 255.69 2803
7 HT2731 T2 1 59 54 33 257.23 2991
8 HT2731 T2 1 60 54 34 262.45 3946
9 HT2731 T2 1 62 54 35 291.84 9398
10 HT2801 T2 1 38 55 54 288.68 4125
11 HT2801 T2 1 38 56 55 265.25 4258
What I need is to plot wavelength and intensity on the x and y axes respectively with each different wafer number as it's own series. I need to keep the x_coord and y_coord variables so that I can identify standout data points later ideally by clicking on them and adding them to a list. I'll get that working after I get these things plotted.
I thought that using the built-in dataframes plotting capability requires me to perform a pivot_table method
wl_vs_int = results.pivot_table(values='intensity', rows=['x_coord', 'y_coord','wavelength'], cols='wafer_number')
on my dataframe which then turns the dataframe into:
wafer_number HT2478 HT2625 HT2644 HT2671 HT2673 HT2719 HT2731 HT2796 HT2801
x_coord y_coord wavelength
27 35 289.07 NaN NaN NaN 5137 NaN NaN NaN NaN NaN
36 250.88 4585 NaN NaN NaN NaN NaN NaN NaN NaN
37 260.90 NaN NaN NaN NaN 4270 NaN NaN NaN NaN
38 288.87 NaN NaN NaN 8191 NaN NaN NaN NaN NaN
40 259.74 NaN NaN NaN NaN 17027 NaN NaN NaN NaN
41 259.74 NaN NaN NaN NaN 18742 NaN NaN NaN NaN
42 259.74 NaN NaN NaN NaN 34098 NaN NaN NaN NaN
28 34 268.27 NaN NaN NaN NaN 2080 NaN NaN NaN NaN
38 257.42 7727 NaN NaN NaN NaN NaN NaN NaN NaN
44 260.13 NaN NaN NaN NaN 55329 NaN NaN NaN NaN
but now the index is a multi-index of the x, y coords and the wavelength so when I just try to print the wl vs columns,
plt.scatter(wl_vs_int.wavelength, wl_vs_int.columns)
I get the AttributeError:
AttributeError: 'DataFrame' object has no attribute 'wavelength'
I've tried to reindex the dataframe back to a default index but that still gives me the results that 'DataFrame' object has no 'wavelength' attribute.
There's got to be a better way to either rearrange the dataframe to make this possible through the built-in dataframe plotting capabilities or to plot only select columns vs other columns (with the columns being dynamic). I'm clearly new to python and pandas but I've spent days of time trying to do this in different ways and with no results. Any help would be greatly appreciated. Thanks.
To plot wavelength and intensity on the x and y axes respectively
with each different wafer number as it's own series, one can group
data wrt wafer_number, and then deal with each group
import pandas as pd
from StringIO import StringIO
import matplotlib.pyplot as plt
data = \
"""wafer_number,test_type,test_pass,x_coord,y_coord,test_el_id,wavelength,intensity
HT2731,T2,1,38,54,24,288.68,4413
HT2731,T2,1,40,54,25,257.42,2595
HT2731,T2,1,50,54,28,300.00,2836
HT2731,T2,1,52,54,29,300.00,2862
HT2731,T2,1,54,54,30,300.00,3145
HT2731,T2,1,56,54,31,300.00,2804
HT2731,T2,1,58,54,32,255.69,2803
HT2731,T2,1,59,54,33,257.23,2991
HT2731,T2,1,60,54,34,262.45,3946
HT2731,T2,1,62,54,35,291.84,9398
HT2801,T2,1,38,55,54,288.68,4125
HT2801,T2,1,38,56,55,265.25,4258"""
df = pd.read_csv(StringIO(data),sep = ',')
dfg = df.groupby('wafer_number')
colors = 'bgrcmyk'
fig, ax = plt.subplots()
for i,k in enumerate(dfg.groups.keys()):
currentGroup = df.loc[dfg.groups[k]]
color = colors[i % len(colors)]
ax.plot(currentGroup['wavelength'].values,currentGroup['intensity'].values,\
ls='', color = color, label = k, marker = 'o', markersize = 8)
legend = ax.legend(loc='upper center', shadow=True)
plt.xlabel('wavelength')
plt.ylabel('intensity')
plt.show()

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories