Unable to parse DataFrame values

Unable to parse DataFrame values - python

In spite of searching for and adapting several potential solutions online and via StackOverflow, I seem to be making no headway. I was testing this air quality index (AQI) library (you can install it using - $ pip install python-aqi) that I recently discovered using two different data. First data is a value stored in a variable, while the second is a series of values in a data frame. For the former, the code ran successfully, but for the latter, I kept getting this error: InvalidOperation: [<class 'decimal.ConversionSyntax'>]. Please help. Thanks. Here is the link to the data: https://drive.google.com/file/d/16QQnDIEFkDQGafNYTk6apY9JcRimRBxH/view?usp=sharing
Code 1 - successful
import aqi # import air quality library
import pandas as pd
# define the PM2.5 variable and store a value
pmt_2_5 = 60.5
mush_2pt5 = aqi.to_iaqi(aqi.POLLUTANT_PM25, str(pmt_2_5)) #AQI code
mush_2pt5
Code 2 - Unsuccessful
import aqi # import air quality library
import pandas as pd
# Read data
df_mush = pd.read_csv("book1.csv", parse_dates=["Date"], index_col="Date", sep=",")
# parse data in the third column to the variable (see image)
pmt_2_5 = df_mushin['mush-pm_2_5']
# parse the variable and calculate AQI
df_mush_2pt5 = aqi.to_iaqi(aqi.POLLUTANT_PM25, str(pmt_2_5))
df_mush_2pt5

You can't pass a Series but only a string:
df_mush = pd.read_csv('book1.csv', parse_dates=['Date'], index_col='Date', sep=',')
df_mush_2pt5 = (df_mush['mush-pm_2_5'].astype(str)
.map(lambda cc: aqi.to_iaqi(aqi.POLLUTANT_PM25, cc)))
Output:
# it's not a dataframe but a series
>>> df_mush_2pt5
0 154
1 152
2 162
3 153
4 153
5 158
6 153
7 134
8 151
9 136
10 154
Name: mush-pm_2_5, dtype: object
Documentation:
Help on function to_iaqi in module aqi:
to_iaqi(elem, cc, algo='aqi.algos.epa')
Calculate an intermediate AQI for a given pollutant. This is the
heart of the algo.
.. warning:: the concentration is passed as a string so
:class:`decimal.Decimal` doesn't act up with binary floats.
:param elem: pollutant constant
:type elem: int
:param cc: pollutant contentration (µg/m³ or ppm)
:type cc: str # <-- This is a string, not Series
:param algo: algorithm module canonical name
:type algo: str

Related

Avoiding type <class 'method'> in a Pandas Dataframe

I have a code that I am trying to run and it should be fairly simple, it is just math. But whenever I try to do the math with the Pandas Dataframes I'm using, I'm getting an error. I know that in the dataframe, it is the column labeled 'first' which is giving me some issues. I have gone through and checked all the others. I have also tried a few things to convert it to a column with floating point values, so I can do the math with it. But I am getting errors every time. I am attaching the code along with comments of what I have tried and what errors I have been getting.
Any help would be greatly appreciated! I am very stuck on this. Thank you!
# Set constants
pi = 3.14159265359
e = 2.71828
h = 6.62607004*(10**-34) # J*s
c = 299792458 # m / s
kb = 1.380649*(10**-23) # J/K
temp3 = 3000 # K
temp10 = 10000 # K
constant = (2*pi*h*(c**2))
bb_df = pd.DataFrame({ 'wl_nm' : range(200, 1101 ,1)}) # Gets wavelength ranges I will want plotted
#bb_df.wl
bb_df['wl_m'] = (bb_df.wl_nm * (10**-9)) # Gets wavelength in meters (this one does work doing math with)
bb_df['first'] = constant/((bb_df.wl_m)**5) # This one does not work doing math with; says it's a method, not number, and cannot figure out how to change it
#bb_df['first'] = bb_df['first'].astype(float) # Tried this, but get error: TypeError: Cannot broadcast np.ndarray with operand of type <class 'method'>
#float(bb_df['first']) # Tried this, but get error: TypeError: cannot convert the series to <class 'float'>
bb_df['exponent'] = (h*c)/((bb_df.wl_m)*kb*temp3)
bb_df['denominator'] = e ** (bb_df.exponent) - 1
bb_df['second'] = 1 / bb_df.denominator
bb_df['second'] = bb_df.second + 1
bb_df['final'] = (bb_df.first) * (bb_df.second) # ERROR (because of bb_df.first)
#bb_df['test'] = float(bb_df.first) - float(bb_df.second)
#bb_df['intensity'] = (((2*pi*h*(c**2))/((bb_df.wl_m**5))(1/(e**((h*c)/((bb_df.wl_m)*kb*temp3))-1)))) # Also just tried typing out entire equation here, but this also gives an error
print(bb_df)
When I comment out all the lines that are not working, this is the dataframe I get. It is the 'first' column that says it is a method, and I have been trouble converting it to a floating point value to do math with. I thought perhaps it was because the number is so small, but then I should also not have had the issue when I just tried to do the entire equation all at once (in the 'intensity' column attempt, which also did not work):
wl_nm wl_m first exponent denominator second
0 200 2.000000e-07 1.169304e+18 23.979614 2.595417e+10 1.000000
1 201 2.010000e-07 1.140505e+18 23.860313 2.303537e+10 1.000000
2 202 2.020000e-07 1.112552e+18 23.742192 2.046898e+10 1.000000
3 203 2.030000e-07 1.085418e+18 23.625236 1.820969e+10 1.000000
4 204 2.040000e-07 1.059074e+18 23.509426 1.621836e+10 1.000000
.. ... ... ... ... ... ...
896 1096 1.096000e-06 2.366053e+14 4.375842 7.850652e+01 1.012738
897 1097 1.097000e-06 2.355289e+14 4.371853 7.819001e+01 1.012789
898 1098 1.098000e-06 2.344583e+14 4.367871 7.787533e+01 1.012841
899 1099 1.099000e-06 2.333935e+14 4.363897 7.756247e+01 1.012893
900 1100 1.100000e-06 2.323346e+14 4.359930 7.725142e+01 1.012945
[901 rows x 6 columns]

From the documentation:
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.
You cannot use bb_df.first to mean the same thing as bb_df['first'], because there is already a .first method of the DataFrame class. That's why the error message tells you that bb_df.first is a method - because it is, one that is pre-defined by Pandas. The first column of your DataFrame contains floating-point values the entire time, and no attempt to convert the already-floating-point values into floating-point is relevant because the problem is not with the column. The problem is that the code bb_df.first does not access the column.
Just use indexing consistently (bb_df['first']). The other way is a not-really-convenience that results in this problem sometimes.

Instead of typing bb_df.columnname, change it to bb_df['columnname']. It fixes the issue!

scraping table from web to excel using python pandas, storing numbers as text in excel. How to store as Value?

I am trying to scrape a table from a website using pandas. The code is shown below:
import pandas as pd
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1.to_excel("combinedGP.xlsx", index=False)
In the resulting excel file, the numbers are saved as text. Since I am planning to build a file with around 1000 rows, I cannot manually change the data type. So is there another way to store them as actual values and not text? TIA

The website can be very unresponsive...
there are unwanted header rows, and two rows of column headers
simple way to manage this is to_csv(), from_csv() with appropriate parameters.
import pandas as pd
import io
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1 = pd.read_csv(io.StringIO(df1.to_csv(index=False)), skiprows=3, header=[0,1])
# df1.to_excel("combinedGP.xlsx", index=False)
sample after cleaning up
S.No District HH issued jobcards No. of HH Provided Employment EMP. Provided No. of Persondays generated Families Completed 100 Days
S.No District SCs STs Others Total SCs STs Others Total No. of Women SCs STs Others Total Women SCs STs Others Total
0 1.0 ALAPPUZHA 32555 760 254085 287400 20237 565 132744 153546 157490 1104492 40209 6875586 8020287 7635748 1346 148 5840 7334
1 2.0 ERNAKULAM 36907 2529 212534 251970 15500 1517 68539 85556 82270 908035 104040 3788792 4800867 4467329 2848 301 11953 15102

Creating a document term matrix using fit_transform

I have an array that takes in string values from a json file. I want to create a document matrix to see the repeated words but when I pass in the array I get an error:
AttributeError: 'NoneType' object has no attribute 'lower'
This is the line that gets the error all the time:
sparse_matrix = count_vectorizer.fit_transform(issues_description)
issues_description = []
issues_key = []
with open('issues_CLOVER.json') as json_file:
data = json.load(json_file)
for record in data:
issues_key.append(record['key'])
issues_description.append(record['fields']['description'])
df = pd.DataFrame({'Key' : issues_key, 'Description' : issues_description})
df.head(10)
This is the data that gets displayed:
Key Description
0 CLOV-1985 h2. Environment Details\r\n\r\nThis bug occurs...
1 CLOV-1984 Clover fails to instrument source code in case...
2 CLOV-1979 If a type argument for a parameterized type ha...
3 CLOV-1978 Bug affects Clover 3.3.0 and higher.\r\n\r\n \...
4 CLOV-1977 Add support to able to:\r\n * instrument sourc...
5 CLOV-1976 Add support to Groovy code in Clover for Eclip...
6 CLOV-1973 See also --CLOV-1956--.\r\n\r\nIn case HUDSON_...
7 CLOV-1970 Steps to reproduce:\r\n\r\nCoverage Explorer >...
8 CLOV-1967 Test Clover against IntelliJ IDEA 2016.3 EAP (...
9 CLOV-1966 *Problem*\r\n\r\nClover Maven Plugin replaces ...
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(issues_description)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=[issues_key[0],issues_key[1],issues_key[2]])
df
What do I change in order to get issues_description a passable arg or can someone point to me what I need to know in order for it to work?
Thanks.

matching tuples in pandas and data processing

The following is a simplified blob of my dataframe. I want to process
first.csv
,No.,Time,Source,Destination,Protocol,Length,Info,src_dst_pair
325778,112.305107,02:e0,Broadcast,ARP,64,Who has 253.244.230.77? Tell 253.244.230.67,"('02:e0', 'Broadcast')"
801130,261.868118,02:e0,Broadcast,ARP,64,Who has 253.244.230.156? Tell 253.244.230.67,"('02:e0', 'Broadcast')"
700094,222.055094,02:e0,Broadcast,ARP,60,Who has 253.244.230.77? Tell 253.244.230.156,"('02:e0', 'Broadcast')"
766542,766543,247.796156,100.118.138.150,41.177.26.176,TCP,66,32222 > http [SYN] Seq=0,"('100.118.138.150', '41.177.26.176')"
767405,248.073313,100.118.138.150,41.177.26.176,TCP,64,32222 > http [ACK] Seq=1,"('100.118.138.150', '41.177.26.176')"
767466,248.083268,100.118.138.150,41.177.26.176,HTTP,380,Continuation [Packet capture],"('100.118.138.150', '41.177.26.176')"
I have all the unique elements of the (last element) src_dst_pair
uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel())
[('02:e0', 'Broadcast') ('100.118.138.150', '41.177.26.176')]
How can I do the following in pandas
for each element in uniq_src_dst_pair, check against the df.src_dst_pair. If it matches, add df.Length and store it in a separate column
my expected result is
('02:e0', 'Broadcast') : 188
('100.118.138.150', '41.177.26.176') : 510
How can I do this?
Below is my try
import pandas
import numpy
data = pandas.read_csv('first.csv')
print data
uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel())
print uniq_src_dst_pair
print len(uniq_src_dst_pair)
# following is hardcoded, but need to be more general for the above list
match1 = data[data.src_dst_pair == "('02:e0:ed:0a:fb:5f', 'Broadcast')"] # doesn't work

Your csv file is messed up. You shouldn't have the first comma in the header, and you have an extra field in your 4th non-header row. Fixing that, you could use:
In [6]: data.groupby('src_dst_pair').Length.sum()
Out[6]:
src_dst_pair
('02:e0', 'Broadcast') 188
('100.118.138.150', '41.177.26.176') 510
Name: Length, dtype: int64
However, your final field, 'src_dst_pair' is superfluous if this is what you wanted to accomplish because you can simply do something like the following:
In [8]: data.groupby(['Source','Destination']).Length.sum()
Out[8]:
Source Destination
02:e0 Broadcast 188
100.118.138.150 41.177.26.176 510
Name: Length, dtype: int64

filtering a pytables table on pandas import

I have a dataset created with pytables that I am trying to import into a pandas dataframe. I can't apply a where filter to the read_hdf step. I'm on pandas '0.12.0'
My sample pytables data:
import tables
import pandas as pd
import numpy as np
class BranchFlow(tables.IsDescription):
branch = tables.StringCol(itemsize=25, dflt=' ')
flow = tables.Float32Col(dflt=0)
filters = tables.Filters(complevel=8)
h5 = tables.openFile('foo.h5', 'w')
tbl = h5.createTable('/', 'BranchFlows', BranchFlow,
'Branch Flows', filters=filters, expectedrows=50e6)
for i in range(25):
element = tbl.row
element['branch'] = str(i)
element['flow'] = np.random.randn()
element.append()
tbl.flush()
h5.close()
Which I can import just fine into a dataframe:
store = pd.HDFStore('foo.h5')
print store
print pd.read_hdf('foo.h5', 'BranchFlows').head()
which shows:
In [10]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: foo.h5
/BranchFlows frame_table [0.0.0] (typ->generic,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])
In [11]: print pd.read_hdf('foo.h5', 'BranchFlows').head()
branch flow
0 0 -0.928300
1 1 -0.256454
2 2 -0.945901
3 3 1.090994
4 4 0.350750
But I can't get the filter to work on the flow column:
pd.read_hdf('foo.h5', 'BranchFlows', where=['flow>0.5'])
<snip traceback>
TypeError: passing a filterable condition to a non-table indexer [field->flow,op->>,value->[0.5]]

Reading from a PyTables directly created table only allows you to directly read the (entire) table. You must write it using pandas tools (in Table format) in order to use the pandas selection mechanism (because the meta data that pandas needs is not present - it could be done, but would take some work).
So, read your table in like above, then create a new one, and indicate table format. See here for docs
In [6]: df.to_hdf('foo.h5','BranchFlowsTable',data_columns=True,table=True)
In [24]: with pd.get_store('foo.h5') as store:
print(store)
....:
<class 'pandas.io.pytables.HDFStore'>
File path: foo.h5
/BranchFlows frame_table [0.0.0] (typ->generic,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])
/BranchFlowsTable frame_table (typ->appendable,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])
In [7]: pd.read_hdf('foo.h5','BranchFlowsTable',where='flow>0.5')
Out[7]:
branch flow
14 14 1.503739
15 15 0.660297
17 17 0.685152
18 18 1.156073
20 20 0.994792
21 21 1.266463
23 23 0.927678

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to parse DataFrame values - python

Related

Avoiding type <class 'method'> in a Pandas Dataframe

scraping table from web to excel using python pandas, storing numbers as text in excel. How to store as Value?

Creating a document term matrix using fit_transform

matching tuples in pandas and data processing

filtering a pytables table on pandas import

Categories

Resources