so while trying to make some updates at local database, having issues the loop is just getting last row values.
The variables id_value and cost_value is getting only the last row value
How to get all the values? to be able to update old research records
Data:
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
My code:
for index, row in df.iterrows():
id_value = [row['id']]
cost_value = [row['Cost']]
# this row updates the data, however the variables is getting only the last value
#table.update().where(table.c.id== id_value).values(Cost=cost_value)
print(id_value)
print(cost_value)
out[12]:
['0075210657']
['65.5495495']
Desired output:
['09999900795', '00009991136', '000094801299', '000099900300', '0075210657']
['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
Python for loops will update the value of id_value and cost_value on every iteration. This is why you are only seeing the last value in the row.
If what you want is a Python list of every value in that column, you can do that more efficiently than looping by using df['id'].tolist().
Timing the difference of these with your (small) example dataset:
import timeit
setup_string = '''
import pandas as pd
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
'''
code_string1 = '''
id_values = []
cost_values = []
for _, row in df.iterrows():
id_values.append(row['id'])
cost_values.append(row['Cost'])
'''
code_string2 = '''
id_values = df['id'].tolist()
cost_values = df['Cost'].tolist()
'''
timeit.timeit(code_string1, setup_string, number=10000)
timeit.timeit(code_string2, setup_string, number=10000)
The first, iterative example gives 5.5589933570008725 seconds on my machine, while the second example gives 0.2375467009987915 seconds on my machine.
You need to append values to each list; you are simply defining a new list in each iteration.
id_values = []
cost_values = []
for _, row in df.iterrows():
id_values.append(row['id'])
cost_values.append(row['Cost'])
df = pd.DataFrame({
'id': ['09999900795', '00009991136', '000094801299', '000099900300', '0075210657'],
'Cost': ['157.05974458228403', '80.637745302714', '7', '13', '65.5495495']
})
id_value=df['id'].tolist()
cost_value=df['Cost'].tolist()
Related
I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)
I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:
# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number': ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)
What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.
Does anyone know where I should start with this? Which module I could use?
I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.
Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.
Thanks
Current data frame structure and 'pd.update'
With some preparation, you can use the pandas 'update' function.
First, the data frames must be indexed (this is often useful anyway).
Second, the source data frame must be extended by the new indices with dummy/NaN data so that it can be updated.
# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)
# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))
# update data
df4.update(df2)
# verify that new approach yields correct results
assert all(df3 == df4)
Current data frame structure and 'pd.concat'
A slightly easier approach joins the data frames and removes duplicate
rows (and sorts by index if wanted). However, the temporary concatenation requires
more memory which may limit the size of the data frames.
df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)
Alternative data structure
Given that 'Customer Number' is the crucial attribute of your data,
you may also consider restructuring your original dictionaries like that:
{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}
Then updating your data simply corresponds to (re)setting the key of the source data with the new data. Typically, working directly on dictionaries is faster than using data frames.
# define function to restructure data, for demonstration purposes only
def restructure(data):
# transpose original data
# https://stackoverflow.com/a/6473724/5350621
vals = data.values()
rows = list(map(list, zip(*vals)))
# create new restructured dictionary with customers as keys
restructured = dict()
for row in rows:
restructured[row[0]] = row[1:]
return restructured
# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)
# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
final_restructured[key] = val
# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)
PS: When setting 'df1 = pd.DataFrame(source_data, columns=[...])' you do not need the 'columns' argument because your dictionaries are nicely named and the keys are automatically taken as column names.
You can use set intersection to find the Customer Numbers to update and set difference to find new Customer Number to add.
Then you can first update the initial data frame rows iterating through the intersection of Costumer Number and then merge the initial data frame only with the new rows of the data frame with the new values.
# same name column for clarity
cn = 'Customer Number'
# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]
# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))
# update rows in initial data frame
for num in CusNum_to_update:
index_initial = df1.loc[df1[cn]==str(num)].index[0]
index_new = df2.loc[df2[cn]==str(num)].index[0]
for col in df1.columns:
df1.at[index_initial,col]= df2.loc[index_new,col]
# concatenate new rows to initial data frame
for num in CusNum_to_add:
df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)
out:
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
There are many ways, but in terms of readability, I would prefer to do this.
import pandas as pd
dict_source = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df_origin = pd.DataFrame.from_dict(dict_source)
dict_new = {'Customer Number': ['1', '4', ],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df_new = pd.DataFrame.from_dict(dict_new)
df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new) # update number 1
# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)
Customer Number Colour Flow
0 1 Blue Bad
1 2 Blue Bad
2 3 Green Good
3 4 Blue Bad
You can use 'Customer Number' as index and use update method:
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])
print(df2)
df3 = df1.reindex(index=df1.index.union(df2.index))
df3.update(df2)
print(df3)
Colour Flow
1 Blue Bad
2 Blue Bad
3 Green Good
4 Blue Bad
I've got an output from an API call as a list:
out = client.phrase_this(phrase='ciao', database='it')
out
[{'Keyword': 'ciao',
'Search Volume': '673000',
'CPC': '0.05',
'Competition': '0',
'Number of Results': '205000000'}]
type(out)
list
I'd like to to create a dataframe and loop-append to that dataframe a new row, starting the API output from multiple keywords.
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
df = pd.DataFrame(index=index, columns=columns)
For loop that is not working:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=index, database='it')
Thanks!
The reason this is not working is because you are trying to assign a dictionary inside of a list to the data frame row, rather than just a list.
You are receiving a list containing a dictionary. If you only want to use the first entry of this list the following solution should work:
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')[0].values()
[0] gets the first entry of the list.
values() returns a list of all the values in the dictionary. https://www.tutorialspoint.com/python/dictionary_values.htm
for keyword in index:
df.loc[keyword] = client.phrase_this(phrase=keyword, database='it')
This passes the keyword to the phrase_this function, instead of the entire index list.
Thanks for the answers, I found a workaround:
index = ['ciao', 'google', 'microsoft']
columns = ['Keyword', 'Search Volume', 'CPC', 'Competition', 'Number of Results']
out = []
for query in index:
out.append(client.phrase_this(phrase=query, database='it')[0].values())
out
[dict_values(['ciao', '673000', '0.05', '0', '205000000']),
dict_values(['google', '24900000', '0.66', '0', '13020000000']),
dict_values(['microsoft', '110000', '0.12', '0.06', '77'])]
df = pd.DataFrame(out, columns=columns).set_index('Keyword')
I need to create lookup tables in python from a csv. I have to do this, though, by unique values in my columns. The example is attached. I have a name column that is the name of the model. For reach model, I need a dictionary with the title from the variable column, the key from the level column and value from the value column. I'm thinking the best thing is a dictionary of dictionaries. I will use this look up table in the future to multiply the values together based on the keys.
Here is code to generate sample data set:
Name = ['model1', 'model1', 'model1', 'model2', 'model2',
'model2','model1', 'model1', 'model1', 'model1', 'model2', 'model2',
'model2','model2']
Variable = ['channel_model','channel_model','channel_model','channel_model','channel_model','channel_model', 'driver_age', 'driver_age', 'driver_age', 'driver_age',
'driver_age', 'driver_age', 'driver_age', 'driver_age']
channel_Level = ['Dir', 'IA', 'EA','Dir', 'IA', 'EA', '21','22','23','24', '21','22','23','24']
Value = [1.11,1.18,1.002, 2.2, 2.5, 2.56, 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4]
df= {'Name': Name, 'Variable': Variable, 'Level': channel_Level, 'Value':Value}
factor_table = pd.DataFrame(df)
I have read the following but it hasn't yielded great results:
Python Creating Dictionary from excel data
I've also tried:
import pandas as pd
factor_table = pd.read_excel('...\\factor_table_example.xlsx')
#define function to be used multiple times
def factor_tables(file, model_column, variable_column, level_column, value_column):
for i in file[model_column]:
for row in file[variable_column]:
lookup = {}
lookup = dict(zip(file[level_column], file[value,column]))
This yields the error:
`dict expected at most 1 arguments, got 2
What I would ultimately like is:
{{'model2':{'channel':{'EA':1.002, 'IA': 1.18, 'DIR': 1.11}}}, {'model1'::{'channel':{'EA':1.86, 'IA': 1.66, 'DIR': 1.64}}}}
Using collections.defaultdict, you can create a nested dictionary while iterating your dataframe. Then realign into a list of dictionaries via a list comprehension.
from collections import defaultdict
tree = lambda: defaultdict(tree)
d = tree()
for row in factor_table.itertuples(index=False):
d[(row.Name, row.Variable)].update({row.Level: row.Value})
res = [{k[0]: {k[1]: dict(v)}} for k, v in d.items()]
print(res)
[{'model1': {'channel_model': {'Dir': 1.110, 'EA': 1.002, 'IA': 1.180}}},
{'model2': {'channel_model': {'Dir': 2.200, 'EA': 2.560, 'IA': 2.500}}},
{'model1': {'driver_age': {'21': 1.100, '22': 1.200, '23': 1.300, '24': 1.400}}},
{'model2': {'driver_age': {'21': 2.100, '22': 2.200, '23': 2.300, '24': 2.400}}}]
It looks like your error could be comming from this line:
lookup = dict(zip(file[level_column], file[value,column]))
where file is a dict expecting one key, yet you give it value,column, thus it got two args. The loop you might be looking for is like so
def factor_tables(file, model_column, variable_column, level_column, value_column):
lookup = {}
for i in file[model_column]:
lookup[model_column] = dict(zip(file[level_column], file[value_column]))
return lookup
This will return to you a single dictionary with keys corresponding to individual (and unique) models:
{'model_1':{'level_col': 'val_col'}, 'model_2':...}
Allowing you to use:
lookups.get('model_1')
{'level_col': 'val_col'}
If you need the variable_column, you can wrap it one level deeper:
def factor_tables(file, model_column, variable_column, level_column, value_column):
lookup = {}
for i in file[model_column]:
lookup[model_column] = {variable_column: dict(zip(file[level_column], file[value_column]))}
return lookup
I have an array which looks like this,
[{'interval': '1',
'paramlist': [{'PARAMCODE': 'P7-3-5-2-0', 'UNIT': 'k', 'VALUE': '0'},
{'PARAMCODE': 'P2-1-3-4-0', 'UNIT': 'A', 'VALUE': '0'}]},
{'interval': '2',
'paramlist': [{'PARAMCODE': 'P7-3-5-2-0', 'UNIT': 'k', 'VALUE': '0'},
{'PARAMCODE': 'P2-1-3-4-0', 'UNIT': 'A', 'VALUE': '0'}]},
and it goes on till so many more interval.
How to iterate and put this value in dataframe in pandas having different columns as interval, paramcode ,unit and value ?
This is something I have done
D4 = root.find('UTILITYTYPE').find('D4')
dayProfileRequested = {'DATE': dateRequested, 'IPlist': None}
for dayprofile in D4:
if dayprofile.attrib['DATE'] != dateRequested:
continue
else:
ipList = []
for ip in dayprofile:
ipDict = {'interval': ip.attrib['INTERVAL']}
paramList = []
for param in ip:
paramDict = {'PARAMCODE': param.attrib['PARAMCODE'], 'VALUE': param.attrib['VALUE'],
'UNIT': param.attrib['UNIT']}
paramList.append(paramDict)
ipDict['paramlist'] = paramList
ipList.append(ipDict)
dayProfileRequested['IPlist'] = ipList
break pprint(dayProfileRequested)
Assuming your list is referenced by x, you can use json_normalize with the record_path and meta parameters -
df = pd.io.json.json_normalize(x, record_path=['paramlist'], meta=['interval'])
df
PARAMCODE UNIT VALUE interval
0 P7-3-5-2-0 k 0 1
1 P2-1-3-4-0 A 0 1
2 P7-3-5-2-0 k 0 2
3 P2-1-3-4-0 A 0 2
One recommendation I have for allcomers is that, if you're working with JSON, use pandas' JSON parser library (AKA, json_normalizr). Most JSON structures are simple enough to be work with an out-of the box usage of the API. Meanwhile, some other structures (such as this), need a little more work. And that's fine. You can figure it out easily enough through trial and error.