Bokeh: Column DataSource part giving error - python

I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
ValueError: expected an element of ColumnData(String, Seq(Any)),got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})
I have tried two different data sets by importing data from excel and have been running out of issues on exactly why this happening.

As the name suggests, the ColumnDataSource is a data structure for storing columns of data. This means that the value of every key in .data must be a column, i.e. a Python list, a NumPy array, or a Pandas series. But you are trying to assign plain numbers as the values, which is what the error message is telling you:
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
expected an element of ColumnData(String, Seq(Any))
This is saying acceptable, expected values are dicts that map strings to sequences. But what you passed is clearly not that:
got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
The value for x for instance is just the number 6.794 and not an array or list, etc.

You can easily do this:
source = ColumnDataSource({str(c): v.values for c, v in df.items()})

This would be a solution. I think the problem is in getting the data from df.
source = ColumnDataSource(data={
'x' : data[data['Year'] == 1970]['fertility'],
'y' : data[data['Year'] == 1970]['life'],
'pop' : (data[data['Year'] == 1970]['population']/20000000) + 2,
'region' : data[data['Year'] == 1970]['region']
})

I had this same problem using this same dataset.
My solution was to import the csv in pandas using "Year" as index column.
data = pd.read_csv(csv_path, index_col='Year')

Related

Pandas Apply with multiple columns as input

For a dataframe which has 4 columns of coordinates (longitude, lattitude) I would like to create a 5th column which has the distance between both places for each column, below illustrates this:
dict = [{'x1': '1','y1': '1','x2': '3','y2': '2'},
{'x1': '1','y1': '1','x2': '3','y2': '2'}]
data = pd.DataFrame(dict)
As an outcome I would like to have this:
dict1 = [{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.6'},
{'x1': '1','y1': '1','x2': '3','y2': '2','distance': '2.9'}]
data2 = pd.DataFrame(dict)
Where distance is computed using from geopy.distance import great_circle:
This is what I tried:
data['distance']=data[['x1','y1','x2','y2']].apply(lambda x1,y1,x2,y2: great_circle(x1,y1,x2,y2).miles, axis=1)
But that gives me a type error:
TypeError: () missing 3 required positional arguments: 'y1', 'x2', and 'y2'
Any help is appreciated.
That is because the lambda function can only view the operand data[['x1','y1','x2','y2']], so you should modify it as follow. Hope this helps!
data['distance']=data[['x1','y1','x2','y2']].apply(lambda df: great_circle(df['x1'],df['y1'],df['x2'],df['y2']).miles, axis=1)

How to extract specific values from a list of dictionaries in python

I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.

Finding and replacing values in specific columns in a CSV file using dictionaries

My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

read_csv read in categorical values?

I was wondering if there was a way to read in Categorical values during the read_csv() process.
Normally you can do a convert after the fact with something like:
df.zone = df.zone.astype('category')
At this point the df takes up more memory and I'm looking for a way to reduce that.
I've tried things like:
parking_meters = pd.read_csv('parking_meter_data.csv',
converters={'zone': pd.Categorical(),
'sub_area': pd.Categorical(),
'area': pd.Categorical(),
'config_name': pd.Categorical(),
'pole' : str(),
'longitude' : np.float(),
'latitude' : np.float()
})
parking_meters.memory_usage(deep=True).sum()
However categorical data needs an initialization argument of the actual data, which is in CSV file.
Let's try with dtype:
parking_meters = pd.read_csv('parking_meter_data.csv',
dtype={'zone': 'category',
'sub_area': 'category',
'area': 'category',
'config_name': 'category'
})

Categories