Python: append list in for-loop unexpected result - python

I am trying to create a new variable from a list ('provider') that checks if some ids are present in another column in the data frame:
import pandas as pd
xx = {'provider_id': [1, 2, 30, 8, 8, 7, 9]}
xx = pd.DataFrame(data=xx)
ids = [8,9,30]
names = ["netflix", "prime","sky"]
for id_,name in zip(ids,names):
provider = []
if id_ in xx["provider_id"]:
provider.append(name)
provider
excpected result:
['netflix', 'prime', 'sky']
actual result:
['sky']
So the for loop keeps overwriting the result of name inside the loop? This functionality seems weird to me and I honestly don't know how to prevent this other then to write three individual if statements.

Your loop keeps initialising the list. Move the list outside the loop:
provider = []
for id_,name in zip(ids,names):
if id_ in xx["provider_id"]:
provider.append(name)
print(provider)

Scrap the loops altogether and use the built-in pandas methods. It will work much faster.
df = pd.DataFrame({'ids': [8,9,30], 'names': ["netflix", "prime","sky"]})
cond = df.ids.isin(xx.provider_id)
df.loc[cond, 'names'].tolist()
['netflix', 'prime', 'sky']

One way to make this more efficient is using sets and isin to find the matching ids in the dataframe, and then a list comprehension with zip to keep the corresponding names.
The error as #quamrana points out is that you keep resetting the list inside the loop.
s = set(xx.loc[xx.isin(ids).values, 'provider_id'].values)
# {8, 9, 30}
[name for id_, name in zip(ids, names) if id_ in s]
# ['netflix', 'prime', 'sky']

Related

Pandas DF.output write to columns (current data is written all to one row or one column)

I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count +=1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html?highlight=append#pandas.DataFrame.append
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append?
I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
 Approach similar to #user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')
Adding all your items to the price list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
You were collecting the value, class and data-a-html-content and appending them to the same list price. Hence, the list becomes:
price = [value1, class1, data-a-html-content1, value2, class2, data-a-html-content2, ...]
Hence, within the dataframe it looks like:
Solution
To get value, class and data-a-html-content in seperate columns you can adopt any of the below two approaches:
Pass a dictionary to the dataframe.
Pass a list of lists to the dataframe.
While the #user17242583 and #h.devillefletcher suggests a dictionary, you can still achieve the same using list of lists as follows:
values = []
classes = []
data-a-html-contents = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.example.co.jp/dp/zzzzzzzzzz/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data-a-html-contents.append(element.get_attribute("data-a-html-content"))
df = pd.DataFrame(data=list(zip(values, classes, data-a-html-contents)), columns=['Value', 'Class', 'Data-a-Html-Content'])
output = pd.DataFrame(my_list)
output.to_csv("Data.csv", encoding='utf-8-sig')
References
You can find a couple of relevant detailed discussions in:
Selenium: Web-Scraping Historical Data from Coincodex and transform into a Pandas Dataframe
Python Selenium: How do I print the values from a website in a text file?

updating dictionary in a nested loop

In the code below, I would like to update the fruit_dict dictionary with the mean price of each row. But the code is not working as expected. Kindly help.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
price=np.array(range(20)).reshape(5,4) #sample data for illustration
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
# initialize a dictionary
fruit_dict = dict.fromkeys(fruit_keys)
fruit_list = []
# print(fruit_dict)
# update dictionary values
for i in range(price.shape[1]):
for key,value in fruit_dict.items():
for j in range(price.shape[0]):
fruit_dict[key] = np.mean(price[j])
fruit_list.append(fruit_dict)
fruit_df = pd.DataFrame(fruit_list)
print(fruit_df)
Instead of creating the dictionary with the string pattern you can append the values for the means of rows as a string pattern by iterating the rows only.
In case if you have a dictionary with a certain pattern you can update the value in a single loop by assigning the key as the pattern which you need for displaying. you don't need to create an additional list for creating a data frame instead you can refer the documentation for creating data frames from dictionary itself Here. I have provided a sample output which may be suitable for your requirement.
In case you need an output with mean value as a column and fruits as rows you can use the below implementation.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
row = 5
column = 4
price = np.array(range(20)).reshape(row, column) # sample data for illustration
# initialize a dictionary
fruit_dict = {}
for j in range(row):
fruit_dict['fruit_'+str(j)] = np.mean(price[j])
fruit_df = pd.DataFrame.from_dict(fruit_dict,orient='index',columns=['mean_value'])
print(fruit_df)
This will provide an output like below. As I already mentioned you can create the data frame as you wish from a dictionary by referring the above data frame documentation.
mean_value
fruit_0 1.5
fruit_1 5.5
fruit_2 9.5
fruit_3 13.5
fruit_4 17.5
`
You shouldn't nest the loop over the range and the dictionary items, you should iterate over them together. You can do this with enumerate().
You're also not using value, so there's no need to use items().
for i, key in enumerate(fruit_dict):
fruit_dict[key] = np.mean(price[j])
Could arrive on a solution based on the answer provided by Sangeerththan. Please find the same below.
#!/usr/bin/python3
fruit_dict = {}
fruit_list =[]
price=np.array(range(40)).reshape(4,10)
for i in range(price.shape[0]):
mark_price = np.square(price[i])
for j in range(mark_price.shape[0]):
fruit_dict['proj_fruit_price_'+str(j)] = np.mean(mark_price[j])
fruit_list.append(fruit_dict.copy())
fruit_df = pd.DataFrame(fruit_list)
You can use this instead of your loops:
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
out = {fruit_keys[index]: np.mean(price[index]) for index in range(price.shape[0])}
Output:
{'fruit_1': '1.5', 'fruit_2': '5.5', 'fruit_3': '9.5', 'fruit_4': '13.5', 'fruit_5': '17.5'}

Loop through list of dataframes and save as new dataframe name

I'm trying to loop through a list of dataframes and perform operations on them. In the final command I want to rename the dataframe as the original key plus '_rand_test'. I'm getting the error:
SyntaxError: cannot assign to operator
Is there a way to do this?
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
seg_name[i]+'_rand_test' = pd.concat([control,test])
The error is because you are trying to perform addition on the left side of an = sign, which you can never do. If you want to rename the dataframe you could just do it on the next line. I'm unsure of what exactly you're trying to rename based off of the code, but if it's just the corresponding string in the seg_name list then the next line would look like this:
seg_name[segments.index(i)] += 'rand_test'
The reason for the segments.index(i) is because you're looping over the elements in segments, not their indexes, so you need to get the index of the element.
Maybe this will work for you?
Create an empty list befor you run the loop and fill that list with append function. And then you rename all the elements of the new list.
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
new_list= []
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
new_list.append(df)
new_names_list=[item +'_rand_test' for item in new_list]

Looping at specified indexes

I have 2 large lists, each with about 100 000 elements each and one being larger than the other, that I want to iterate through. My loop looks like this:
for i in list1:
for j in list2:
function()
This current looping takes too long. However, list1 is a list that needs to be checked from list2 but, from a certain index, there are no more instances beyond in list2. This means that looping from indexes might be faster but the problem is I do not know how to do so.
In my project, list2 is a list of dicts that have three keys: value, name, and timestamp. list1 is a list of the timestamps in order. The function is one that takes the value based off the timestamp and puts it into a csv file in the appropriate name column.
This is an example of entries from list1:
[1364310855.004000, 1364310855.005000, 1364310855.008000]
This is what list2 looks like:
{"name":"vehicle_speed","value":2,"timestamp":1364310855.004000}
{"name":"accelerator_pedal_position","value":4,"timestamp":1364310855.004000}
{"name":"engine_speed","value":5,"timestamp":1364310855.005000}
{"name":"torque_at_transmission","value":-3,"timestamp":1364310855.008000}
{"name":"vehicle_speed","value":1,"timestamp":1364310855.008000}
In my final csv file, I should have something like this:
http://s000.tinyupload.com/?file_id=03563948671103920273
If you want this to be fast, you should restructure the data that you have in list2 in order to speedup your lookups:
# The following code converts list2 into a multivalue dictionary
from collections import defaultdict
list2_dict = defaultdict(list)
for item in list2:
list2_dict[item['timestamp']].append((item['name'], item['value']))
This gives you a much faster way to look up your timestamps:
print(list2_dict)
defaultdict(<type 'list'>, {
1364310855.008: [('torque_at_transmission', -3), ('vehicle_speed', 0)],
1364310855.005: [('engine_speed', 0)],
1364310855.004: [('vehicle_speed', 0), ('accelerator_pedal_position', 0)]})
Lookups will be much more efficient when using list2_dict:
for i in list1:
for j in list2_dict[i]:
# here j is a tuple in the form (name, value)
function()
You appear to only want to use the elements in list2 that correspond to i*2 and i*2+1, That is elements 0, 1 and 2, 3, ...
You only need one loop.
for i in range(len(list1)):
j = list[i*2]
k = list2[j+1]
# Process function using j and k
You will only process to the end of list one.
i think pandas module is a perfect match for your goals...
import ujson # 'ujson' (Ultra fast JSON) is faster than the standard 'json'
import pandas as pd
filter_list = [1364310855.004000, 1364310855.005000, 1364310855.008000]
def file2list(fn):
with open(fn) as f:
return [ujson.loads(line) for line in f]
# Use pd.read_json('data.json') instead of pd.DataFrame(load_data('data.json'))
# if you have a proper JSON file
#
# df = pd.read_json('data.json')
df = pd.DataFrame(file2list('data.json'))
# filter DataFrame with 'filter_list'
df = df[df['timestamp'].isin(filter_list)]
# convert UNIX timestamps to readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# pivot data frame
# fill NaN's with zeroes
df = df.pivot(index='timestamp', columns='name', values='value').fillna(0)
# save data frame to CSV file
df.to_csv('output.csv', sep=',')
#pd.set_option('display.expand_frame_repr', False)
#print(df)
output.csv
timestamp,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0.0,0.0,-3.0,1.0
PS i don't know where did you get [Latitude,Longitude] columns from, but it's pretty easy to add those columns to your result DataFrame - just add the following lines before calling df.to_csv()
df.insert(0, 'latitude', 0)
df.insert(1, 'longitude', 0)
which would result in:
timestamp,latitude,longitude,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,0,0,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0,0,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0,0,0.0,0.0,-3.0,1.0

Format an array of tuples in a nice "table"

Say I have an array of tuples which look like that:
[('url#id1', 'url#predicate1', 'value1'),
('url#id1', 'url#predicate2', 'value2'),
('url#id1', 'url#predicate3', 'value3'),
('url#id2', 'url#predicate1', 'value4'),
('url#id2', 'url#predicate2', 'value5')]
I would like be able to return a nice 2D array to be able to display it "as it" in my page through django.
The table would look like that:
[['', 'predicate1', 'predicate2', 'predicate3'],
['id1', 'value1', 'value2', 'value3'],
['id2', 'value4', 'value5', '']]
You will notice that the 2nd item of each tuple became the table "column's title" and that we now have rows with ids and columns values.
How would you do that? Of course if you have a better idea than using the table example I gave I would be happy to have your thoughts :)
Right now I am generating a dict of dict and display that in django. But as my pairs of keys, values are not always in the same order in my dicts, then it cannot display correctly my data.
Thanks!
Your dict of dict is probably on the right track. While you create that dict of dict, you could also maintain a list of ids and a list of predicates. That way, you can remember the ordering and build the table by looping through those lists.
using the zip function on your initial array wil give you three lists: the list of ids, the list of predicates and the list of values.
to get rid of duplicates, try the reduce function:
list_without_duplicates = reduce(
lambda l, x: (l[-1] != x and l.append(x)) or l, list_with_duplicates, [])
Ok,
At last I came up with that code:
columns = dict()
columnsTitles = []
rows = dict()
colIdxCounter = 1 # Start with 1 because the first col are ids
rowIdxCounter = 1 # Start with 1 because the columns titles
for i in dataset:
if not rows.has_key(i[0]):
rows[i[0]] = rowIdxCounter
rowIdxCounter += 1
if not columns.has_key(i[1]):
columns[i[1]] = colIdxCounter
colIdxCounter += 1
columnsTitles.append(i[1])
toRet = [columnsTitles]
for i in range(len(rows)):
toAppend = []
for j in range(colIdxCounter):
toAppend.append("")
toRet.append(toAppend)
for i in dataset:
toRet[rows[i[0]]][columns[i[1]]] = i[2]
for i in toRet:
print i
Please don't hesitate to comment/improve it :)

Categories