Wrong indexing while creating a pandas dataframe with a dicitonary iteratively - python

I am trying to create a pandas dataframe with selenium objects 'left' and 'right'.
left = driver.find_elements(by=By.CLASS_NAME, value='lc')
right = driver.find_elements(by=By.CLASS_NAME, value='rc')
These return strings as ojects which has a different number of values for each item in left and right. But left and right has same number of elements for an iteration. The strings from 'left' are column names and values from 'right' has to be appended to the corresponding column names. I tried the following:
for l, r in zip(left, right):
# Get the text from the left and right elements
l_text = l.text
r_text = r.text
# Create a dictionary for the row with the left text as the key and the right text as the value
row = {l_text: r_text}
# Append the dictionary to the list
data.append(row)
# Create the dataframe from the list of dictionaries
df = pd.DataFrame(data)
The df created out of it has a problem with the index such that each value is added to a new index instead of being added to the same row. How do I add all values from an iteration to the same row.
The 'left' values are attributes of brake disks and the 'right' refers to its corresponding values. These vary for each item sometimes there are more and sometimes less.

The following should do what you want:
Items are added to the row until it encounters the same header.
Once a duplicate header is discovered, the row variable is appended to data and then cleared for the next round / row.
data=[]
row={}
for l, r in zip(left, right):
# Get the text from the left and right elements
l_text = l.text
r_text = r.text
if l_text is not None and l_text != "":
if l_text in row:
data.append(row)
row = {}
row[l_text] = r_text
# This is required to append the last row
if len(row) > 0:
data.append(row)
# Create the dataframe from the list of dictionaries
df = pd.DataFrame(data)
print(df)

I made some adjustments to your code when I append each key, value in a dictionary then append it to the dataframe
data = pd.DataFrame()
dic = {}
for l, r in zip(left, right):
# Get the text from the left and right elements
dic[l.text] = r.text
# Create a dictionary for the row with the left text as the key and the right text as the value
# Append the dictionary to the list
data = data.append(dic,ignore_index=True)
#data is your final dataframe

Try do it this way:
left = driver.find_elements(by=By.CLASS_NAME, value='lc')
right = driver.find_elements(by=By.CLASS_NAME, value='rc')
# Create a dictionary with keys from the left and empty lists as values
data = {}
for element in left:
if element.text not in data.keys():
data[element.text] = list()
for l, r in zip(left, right):
# Add an element to list by key
data[l.text].append(r.text)
# Create the dataframe from the dictionary
df = pd.DataFrame.from_dict(data)
I have not worked with selenium so you may need to tweak the code a little (in terms of getting a text from left list values).

Related

Using OpenPyxl to dynamically create dictionaries?

I'm using Openpyxl to read an excel file, specifically one column, that looks like:
this excel snapshot
The number of main and sub classes in the source document can change, and my goal is to be able to iterate through and create perhaps a nested dictionary for each main class of the form:
main_Class1 = { 'subClass1': {'data': 'data_1'},
'subClass2': {'data': 'data_2'}}
I'm open to any data type, so long as the info is connected like such.
I've though to have the Classes in column B, merge Main Classes into column A and Subclasses into column C, then hide A and C so I can separate main and subs to more easily iterate like
this
and tried:
mainClassList = []
mainClassDict = defaultdict(list)
activeClassList=[]
for row in ws.iter_rows(min_row=2):
activeClass ="" #supposed to update this at every
#appropriate row
if row[0].value is not None:
activeClass=row[0].value
mainClassList.append(activeClass)
mainClassDict[activeClass]=[]
activeClassList.append(activeClass)
# add 2nd column entries to 1st column keys
# would be better if these were nested dicts
if row[0].value is None and row[1].value is not None:
mainClassDict[activeClass].append(row[1].value)
#check to see things are being added and updated as needed
print("main Class List:", mainClassList)
print("active classes;", activeClassList)
for key, value in mainClassDict.items():
print(key, ' : ', value)
I eventually solved with the following:
mainClassDict = {} #create an empty dictionary for each level
subClassDict = {}
for row in class_sheet.iter_rows(min_row=2):
#catch only row 1 values
if row[0].value is not None:
main = row[0].value
mainClassDict[main]={}
subClassDict[main]={}
# if none in row 1, select row 2 value instead
if row[0].value is None and row[1].value is not None:
subclasslist = []
subclasslist.append(row[1].value)
#create empty list for data
attributelist = []
# populate data list from row of choice
attributelist.append(row[...].value)
# populate dict from elements in list using nested for
for key in subclasslist:
for value in attributelist:
subClassDict[main][key] = value

python df code works outside loop but not inside

Other answers on stackoverflow do adress loop problems but none does address a df outside a loop so I have to ask this question.
I have below code which does exactly what it should grab a table, dataframe it and append it to final_df outside of the loop:
empty =[]
final_df= pd.DataFrame(empty, columns =['column_1', 'column_2', 'column_3',
'column_4', 'report'])
document = Document(targets_in_dir[1])
table = document.tables[2]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(targets_in_dir[1])
final_df = interim_df.append(df)
print (targets_in_dir[1])
Once I pack it into a loop (see below) which iterates through the filenames specified in target_in_dir list my final_df is always empty. How can I fix this? I want my final_df to contain all the rows extracted from the same table in all the files.
for idx, c in enumerate(targets_in_dir):
try:
document = Document(c)
table = document.tables[2]
processed_files.append(c)
except:
error_log.append(c)
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(c)
final_df.append(df)
final_df.append(df) does not change final_df in place.
Try changing to final_df = final_df.append(df), this will update final_df within the loop.
Pandas append documentation contains a note on this:
Unlike the append() method, which appends to the original list and returns None, append() here does not modify df1 and returns its copy with df2 appended.
If your code is exactly as it is represented in your question, the code which processes the data is indented such that it is executed only as part of the exception handling.
Moving the code to the left by one indent will ensure that it is executed as part of the for loop but outside of the try and exception handling blocks.

How do I change the value type (from float to integer) in a list of dictionaries?

I am trying to parse an Excel with the output as list of dictionaries. I wish to change the type of one of two columns:
Date : Any date format
Account - see pic)
from Float to integer (in the excel it has no decimal values)
How do I make it happen so that it is saved permanently for further code on reference of this list of dictionary?
Output of my code is as seen in the picture here:
I tried various options but unsuccessful in making the change and have it displayed as output.
My code:
import xlrd
workbook = xlrd.open_workbook('filelocation')
ws = workbook.sheet_by_index(0)
first_row = [] # The first row with values in column
for col in range(ws.ncols):
first_row.append(ws.cell_value(0, col))
# creating a list of dictionaries
data = []
for row in range(1, ws.nrows):
d = {}
for col in range(ws.ncols):
d[first_row[col]] = ws.cell_value(row, col)
data.append(d)
for i in data:
if i['Account'] in data:
i['Account'] = int(i['Account'])
print(int(i['Account']))
print(data)
I added the last part to make changes on Account column but it does save the changes in the output.
Your problem is with the condition if i['Acount'] in data:.
data is a list of dicts. i['Acount'] is a float. So the above condition is never met, and any value gets converted to int.
From what I understand from your code you can simply remove the condition:
for i in data:
i['Acount'] = int(i['Acount'])
If you want to generally change all floats to ints, you could change the part where you read the file to:
for col in range(ws.ncols):
value = es.cell_value(row, col)
try:
value = int(value)
finally:
d[first_row[col]] = value
You have the right idea, but the syntax is a little off.
for element in data: #iterate through list
for key, value in element.items(): #iterate through each dict
if type(value) == float:
element[key] = int(value)
This will cast all your floats to ints.

How to append the data into data frame generated by a loop in python and beutifulsoup

Below loop supposes to add multiple table's row (html page) in one dataframe. loop works fine, it creates a dataframe for each table one by one but it also replaces previous table's data from dataframe which is i want to fix. it should append each table's data into the same dataframe, It should not replace previous table's data from dataframe. Plase help me on this.
column_headers = ['state', 'sr_no', 'district_name', 'country']
headers = ['district_id']
district_link = [[li.get('href') for li in data_rows_link[i].findAll('a')]
for i in range(len(data_rows))]
district_data_02 = [] # create an empty list to hold all the data
for i in range(len(data_rows)): # for each table row
district_row = [] # create an empty list for each pick/player
district_row.append("a")
# for each table data element from each table row
for li in data_rows[i].findAll('li'):
# get the text content and append to the district_row
district_row.append(li.getText())
# then append each pick/player to the district_data matrix
district_data_02.append(district_row)
district_data == district_data_02
#dataframe - districtlist
districtlist = pd.DataFrame(district_data ,columns=column_headers)
districtid = pd.DataFrame(district_link, columns=headers)
#df_row_merged = pd.concat([df, df1])
#dataframe - districtid
final_districtlist =pd.concat([districtlist, districtid], axis=1)

How to append a dictionary to a pandas dataframe?

I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.

Categories