Pandas dataframe manipulation and plotting - python

Using WinPython 3.4, matplotlib 1.3.1, I'm pulling data for a dataframe from a mysql database. The raw dataframe that I get from the query looks like:
wafer_number test_type test_pass x_coord y_coord test_el_id wavelength intensity
0 HT2731 T2 1 38 54 24 288.68 4413
1 HT2731 T2 1 40 54 25 257.42 2595
2 HT2731 T2 1 50 54 28 300.00 2836
3 HT2731 T2 1 52 54 29 300.00 2862
4 HT2731 T2 1 54 54 30 300.00 3145
5 HT2731 T2 1 56 54 31 300.00 2804
6 HT2731 T2 1 58 54 32 255.69 2803
7 HT2731 T2 1 59 54 33 257.23 2991
8 HT2731 T2 1 60 54 34 262.45 3946
9 HT2731 T2 1 62 54 35 291.84 9398
10 HT2801 T2 1 38 55 54 288.68 4125
11 HT2801 T2 1 38 56 55 265.25 4258
What I need is to plot wavelength and intensity on the x and y axes respectively with each different wafer number as it's own series. I need to keep the x_coord and y_coord variables so that I can identify standout data points later ideally by clicking on them and adding them to a list. I'll get that working after I get these things plotted.
I thought that using the built-in dataframes plotting capability requires me to perform a pivot_table method
wl_vs_int = results.pivot_table(values='intensity', rows=['x_coord', 'y_coord','wavelength'], cols='wafer_number')
on my dataframe which then turns the dataframe into:
wafer_number HT2478 HT2625 HT2644 HT2671 HT2673 HT2719 HT2731 HT2796 HT2801
x_coord y_coord wavelength
27 35 289.07 NaN NaN NaN 5137 NaN NaN NaN NaN NaN
36 250.88 4585 NaN NaN NaN NaN NaN NaN NaN NaN
37 260.90 NaN NaN NaN NaN 4270 NaN NaN NaN NaN
38 288.87 NaN NaN NaN 8191 NaN NaN NaN NaN NaN
40 259.74 NaN NaN NaN NaN 17027 NaN NaN NaN NaN
41 259.74 NaN NaN NaN NaN 18742 NaN NaN NaN NaN
42 259.74 NaN NaN NaN NaN 34098 NaN NaN NaN NaN
28 34 268.27 NaN NaN NaN NaN 2080 NaN NaN NaN NaN
38 257.42 7727 NaN NaN NaN NaN NaN NaN NaN NaN
44 260.13 NaN NaN NaN NaN 55329 NaN NaN NaN NaN
but now the index is a multi-index of the x, y coords and the wavelength so when I just try to print the wl vs columns,
plt.scatter(wl_vs_int.wavelength, wl_vs_int.columns)
I get the AttributeError:
AttributeError: 'DataFrame' object has no attribute 'wavelength'
I've tried to reindex the dataframe back to a default index but that still gives me the results that 'DataFrame' object has no 'wavelength' attribute.
There's got to be a better way to either rearrange the dataframe to make this possible through the built-in dataframe plotting capabilities or to plot only select columns vs other columns (with the columns being dynamic). I'm clearly new to python and pandas but I've spent days of time trying to do this in different ways and with no results. Any help would be greatly appreciated. Thanks.

To plot wavelength and intensity on the x and y axes respectively
with each different wafer number as it's own series, one can group
data wrt wafer_number, and then deal with each group
import pandas as pd
from StringIO import StringIO
import matplotlib.pyplot as plt
data = \
"""wafer_number,test_type,test_pass,x_coord,y_coord,test_el_id,wavelength,intensity
HT2731,T2,1,38,54,24,288.68,4413
HT2731,T2,1,40,54,25,257.42,2595
HT2731,T2,1,50,54,28,300.00,2836
HT2731,T2,1,52,54,29,300.00,2862
HT2731,T2,1,54,54,30,300.00,3145
HT2731,T2,1,56,54,31,300.00,2804
HT2731,T2,1,58,54,32,255.69,2803
HT2731,T2,1,59,54,33,257.23,2991
HT2731,T2,1,60,54,34,262.45,3946
HT2731,T2,1,62,54,35,291.84,9398
HT2801,T2,1,38,55,54,288.68,4125
HT2801,T2,1,38,56,55,265.25,4258"""
df = pd.read_csv(StringIO(data),sep = ',')
dfg = df.groupby('wafer_number')
colors = 'bgrcmyk'
fig, ax = plt.subplots()
for i,k in enumerate(dfg.groups.keys()):
currentGroup = df.loc[dfg.groups[k]]
color = colors[i % len(colors)]
ax.plot(currentGroup['wavelength'].values,currentGroup['intensity'].values,\
ls='', color = color, label = k, marker = 'o', markersize = 8)
legend = ax.legend(loc='upper center', shadow=True)
plt.xlabel('wavelength')
plt.ylabel('intensity')
plt.show()

Related

How to create a combined wide and long format dataframe from a long input dataset?

I have a dataframe in long format which can be created using the code below:
import pandas as pd
import numpy as np
long_dict = {'Period':['2021-01-01','2021-02-01','2021-01-03','2021-02-04','2022-02-01','2022-03-01','2021-03-01'],
'Indicator':['number of tracks','number of tracks','defects','defects','gears and bobs','gears and bobs','staff'],
'indicator_status':['active','active','active','active','active','active','active'],
'mech_code':['73','73','44','44','106','107','106'],
'Value':[100,120,7,17,25,99,81]}
df = pd.DataFrame(long_dict)
The client/customer wants to specifically have a unified long and wide view using the Indicator field (for proprietary software visualisation purposes). What is the best way which results in the output below:
Try:
df = pd.concat([df, df.pivot(columns="Indicator", values="Value")], axis=1)
print(df)
Prints:
Period Indicator indicator_status mech_code Value defects gears and bobs number of tracks staff
0 2021-01-01 number of tracks active 73 100 NaN NaN 100.0 NaN
1 2021-02-01 number of tracks active 73 120 NaN NaN 120.0 NaN
2 2021-01-03 defects active 44 7 7.0 NaN NaN NaN
3 2021-02-04 defects active 44 17 17.0 NaN NaN NaN
4 2022-02-01 gears and bobs active 106 25 NaN 25.0 NaN NaN
5 2022-03-01 gears and bobs active 107 99 NaN 99.0 NaN NaN
6 2021-03-01 staff active 106 81 NaN NaN NaN 81.0

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

How to preprocess a pandas dataset for TensorFlow

I'm trying to do an experiment with using svg data on the mnist images as an input into a fullyconnected Neural Network in Tensorflow
I've parsed the images and put them into a csv file with each row a different image, the first column is the classifier (0-9) and the remaining columns are the coordinates (x1,y1,x2,y2 etc)
csv_file = '/content/mnist_training_0s_svg.csv'
max_features = 100 # the max number in reality is more like 70, will ultimately truncate
dataframe = pd.read_csv(csv_file, header=None, names = list(range(0,max_features)))
print (dataframe.head())
0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
0 0 180 196 0 -11 -5 -16 -14 -13 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 152 203 -46 -18 -56 -153 -11 -153 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 106 128 -21 -30 -21 -62 0 -54 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 180 90 25 -25 60 -94 60 -119 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 115 178 -20 -29 -20 -92 1 -109 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 100 columns]
What I'm trying to do next is use pad_sequences to add zeros and truncate each "image" to 70 coordinates and then use keras preprocessing to normalize the data so that it can be effectively intput into the NN.
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop(0) #
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe)), labels))
ds = pad_sequences(ds.as_numpy_iterator(), #TypeError: from_tensor_slices() takes 1 positional argument but 2 were given
padding='post',truncating='post',maxlen=70)
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
ds = ds.prefetch(batch_size)
return ds
train_ds = df_to_dataset(dataframe, batch_size = 5)
The details of the above is pretty arbitrary. I've tried many variations of the above only to get a variety of TypeErrors. I've gone through the pandas dataframe documentation and the pad_sequences documentation, but I feel I'm missing something more basic, namely, in layperson's terms: how do I pull out a row of this dataframe (minus the classifier) to pad and how do I pull out a column (minus the labels) to normalize?

Locating columns values in pandas dataframe with conditions

We have a dataframe (df_source):
Unnamed: 0 DATETIME DEVICE_ID COD_1 DAT_1 COD_2 DAT_2 COD_3 DAT_3 COD_4 DAT_4 COD_5 DAT_5 COD_6 DAT_6 COD_7 DAT_7
0 0 200520160941 002222111188 35 200408100500.0 12 200408100400 16 200408100300 11 200408100200 19 200408100100 35 200408100000 43
1 19 200507173541 000049000110 00 190904192701.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20 200507173547 000049000110 00 190908185501.0 08 190908185501 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 21 200507173547 000049000110 00 190908205601.0 08 190908205601 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 22 200507173547 000049000110 00 190909005800.0 08 190909005800 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
159 775 200529000843 000049768051 40 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
160 776 200529000843 000049015792 00 200529000701.0 33 200529000701 NaN NaN NaN NaN NaN NaN NaN NaN NaN
161 779 200529000843 000049180500 00 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
162 784 200529000843 000049089310 00 200529000201.0 03 200529000201 61 200529000201 NaN NaN NaN NaN NaN NaN NaN
163 786 200529000843 000049768051 40 200529000401.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
We calculated values_cont, a dict, for a subset:
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
values_cont = pd.value_counts(df_source[v_subset].values.ravel())
We obtained as result (values, counter):
00 134
08 37
42 12
40 12
33 3
11 3
03 2
35 2
43 2
44 1
61 1
04 1
12 1
60 1
05 1
19 1
34 1
16 1
Now, the question is:
How to locate values in columns corresponding to counter, for instance:
How to locate:
df['DEVICE_ID'] # corresponding with values ('00') and counter ('134')
df['DEVICE_ID'] # corresponding with values ('08') and counter ('37')
...
df['DEVICE_ID'] # corresponding with values ('16') and counter ('1')
I believe you need DataFrame.melt with aggregate join for ID and GroupBy.size for counts.
This implementation will result in a dataframe with a column (value) for the CODES, all the associated DEVICE_IDs, and the count of ids associated with each code.
This is an alternative to values_cont in the question.
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
df = (df_source.melt(id_vars='DEVICE_ID', value_vars=v_subset)
.dropna(subset=['value'])
.groupby('value')
.agg(DEVICE_ID = ('DEVICE_ID', ','.join), count= ('value','size'))
.reset_index())
print (df)
value DEVICE_ID count
0 00 000049000110,000049000110,000049000110,0000490... 7
1 03 000049089310 1
2 08 000049000110,000049000110,000049000110 3
3 11 002222111188 1
4 12 002222111188 1
5 16 002222111188 1
6 19 002222111188 1
7 33 000049015792 1
8 35 002222111188,002222111188 2
9 40 000049768051,000049768051 2
10 43 002222111188 1
11 61 000049089310 1
# print DEVICE_ID for CODES == '03'
print(df.DEVICE_ID[df.value == '03'])
[out]:
1 000049089310
Name: DEVICE_ID, dtype: object
Given the question as related to df_source, to select specific parts of the dataframe, use Pandas: Boolean Indexing
# to return all rows where COD_1 is '00'
df_source[df_source.COD_1 == '00']
# to return only the DEVICE_ID column where COD_1 is '00'
df_source['DEVICE_ID'][df_source.COD_1 == '00']
You can use df.iloc to search out rows that match based on columns. Then from that row you can select the column of interest and output it. There may be a more pythonic way to do this.
df2=df.iloc[df['COD_1']==00]
df3=df2.iloc[df2['DAT_1']==134]
df_out=df3.iloc['DEVICE_ID']
here's more info in .iloc: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories