Convert `String Feature` DataFrame into Float in Azure ML Using Python Script - python

I am trying to understand how to convert azure ml String Feature data type into float using python script. my data set is contain "HH:MM" data time format. It recognized as String Feature like the following img:
I want to convert it into float type which will divide the timestamp by 84600 ( 24 hour) so 17:30 will be converted into 0,729166666666667, so I write python script to convert that. This is my script:
import pandas as pd
import numpy as np
def timeToFloat(x):
frt = [3600,60]
data = str(x)
result = float(sum([a*b for a,b in zip(frt, map(int,data.split(':')))]))/86400
return result if isNotZero(x) else 0.0
def isNotZero(x):
return (x is "0")
def azureml_main(dataframe1 = None):
df = pd.DataFrame(dataframe1)
df["Departure Time"] = pd.to_numeric(df["Departure Time"]).apply(timeToFloat)
print(df["Departure Time"])
return df,
When I run the script it was failed. Then I try to check whether it is str or not, but it returns None.
can we treat String Feature as String? or how should I covert this data correctly?

The to_numeric conversion seems to be the problem, as there's no default parsing from string to number.
Does it work if you just use pd.apply(timeToFloat) ?
Roope - Microsoft Azure ML Team

Related

Can't convert string to float - Python Dash

I'm trying to convert string into float in Dash callback but when I run my code I'm getting in my Dash app error: lati = float(lati[-1])
ValueError: could not convert string to float: 'float64) I'm not getting this error in terminal though.
First what I need to do is extract given latitude (and longitude) number. Therefore I need it to convert it to string and split it because I could not find better way to get this number from csv file using pandas.
Output:
# converting to string:
12 41.6796
Name: latitude, dtype: float64
# splitting:
['12', '', '', '', '41.6796']
# converting to float:
41.6796
This is the actual code:
#app.callback(Output('text-output', 'children'),
[Input('submit-val', 'n_clicks')],
[State('search-input', 'value')])
def updateText(n_clicks, searchVar):
df = pd.read_csv("powerplant.csv")
df = df[df.name == searchVar]
# converting to string
lati = str(df['latitude'])
longi = str(df['longitude'])
# splitting it
lati = lati.split('\n', 1)
lati = lati[0].split(' ', 4)
longi = longi.split('\n', 1)
longi = longi[0].split(' ', 4)
#converting to float
lati = float(lati[-1])
longi = float(longi[-1])
I actually tested this code in other script and it worked just fine. Is there any better way how could I extract latitude and longitude numbers?
The data can be downloaded from https://datasets.wri.org/dataset/globalpowerplantdatabase; here is an excerpt.
country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,estimated_generation_gwh
AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017,,,,,,
AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017,,,,,,
ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305,Hydro,,,,1963.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021238,,,,,,,79.22851153039832
ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936,Hydro,,,,1958.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021241,,,,,,,82.52969951083159
The issue is the way you are accessing the values in a dataframe. Pandas allows you to access the data without having to parse the string representation.
You can access the row and the column in one call to .loc
If you know you will have a single value, you can call the squeeze method
>>> import pandas as pd
>>> from io import StringIO
>>> # data shortened for brievity
>>> df = pd.read_csv(StringIO("""country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude
... AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190
... AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787
... ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305
... ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936"""))
>>> searchVar = "Ulez"
>>> df.loc[df["name"] == searchVar, "latitude"] # here you have a pd.Series
3 41.6796
Name: latitude, dtype: float64
>>> df.loc[df["name"] == searchVar, "latitude"].squeeze() # here you have a scalar
41.6796
>>> df.loc[df["name"] == searchVar, "longitude"].squeeze()
19.8936
If for some reason you have several rows with the same name, you will get a Series back and not a scalar. But maybe it is a case where failure is what you want rather than passing ambiguous data.
What you're looking at is a pandas.Series object, containing a single row of data, and you're trying to chop up its __repr__ to get at the value. There is no need for this. I'm not familiar with the Python version of plotly but I see that you have a callback so I've wrapped it up into a function (I'm not sure whether the case exists where the name can't be found):
import pandas as pd
def get_by_name(name):
df = pd.read_csv('powerplants.csv')
df = df[df['name'] == name]
if not df.empty:
return df[['latitude', 'longitude']].values.tolist()[0]
return None, None
lat, lon = get_by_name('Kajaki Hydroelectric Power Plant Afghanistan')

converting google datastore query result to pandas dataframe in python

I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.

Can google dataflow convert an input date to a bigquery timestamp

quite new to dataflow, I have been searching for days for a solution to my problem. I need to run a pipeline that reads a date from a csv file in the following format : 2019010420300033, passing it through the the different flows and end up in bigquery as a timestamp. Is there a way to do this or the input file must be converter first to a convertible date (I know a format like this works : 2019-01-01 20:30:00.331).
Or, is is possible to have dataflow output in some way a new pipeline with that date converted?
thanks
This is an easy job for Dataflow. You can use either a ParDo or a Map.
In the example below each line from the CSV will be passed to Map(convertDate). The function convertDate, which you need to modify to fit your date conversion, then returns the modified line. Then the entire converted CSV is written to the output file set.
Example (simplified) using Map:
def convertDate(line):
# convert date to desired format
# Split line into columns, change date format for desired column
# Rejoin columns into line and return
cols = line.split(',') # change for your column seperator
cols[2] = my_change_method_for_date(cols[2]) # code the date conversion here
return ",".join(cols)
with beam.Pipeline(argv=pipeline_args) as p:
lines = p | 'ReadCsvFile' >> beam.io.ReadFromText(args.input)
lines = lines | 'ConvertDate' >> beam.Map(convertDate)
lines | 'WriteCsvFile' >> beam.io.WriteToText(args.output)

How do I convert csv string to list in pandas?

I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Categories