python df code works outside loop but not inside - python

Other answers on stackoverflow do adress loop problems but none does address a df outside a loop so I have to ask this question.
I have below code which does exactly what it should grab a table, dataframe it and append it to final_df outside of the loop:
empty =[]
final_df= pd.DataFrame(empty, columns =['column_1', 'column_2', 'column_3',
'column_4', 'report'])
document = Document(targets_in_dir[1])
table = document.tables[2]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(targets_in_dir[1])
final_df = interim_df.append(df)
print (targets_in_dir[1])
Once I pack it into a loop (see below) which iterates through the filenames specified in target_in_dir list my final_df is always empty. How can I fix this? I want my final_df to contain all the rows extracted from the same table in all the files.
for idx, c in enumerate(targets_in_dir):
try:
document = Document(c)
table = document.tables[2]
processed_files.append(c)
except:
error_log.append(c)
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(c)
final_df.append(df)

final_df.append(df) does not change final_df in place.
Try changing to final_df = final_df.append(df), this will update final_df within the loop.
Pandas append documentation contains a note on this:
Unlike the append() method, which appends to the original list and returns None, append() here does not modify df1 and returns its copy with df2 appended.

If your code is exactly as it is represented in your question, the code which processes the data is indented such that it is executed only as part of the exception handling.
Moving the code to the left by one indent will ensure that it is executed as part of the for loop but outside of the try and exception handling blocks.

Related

Wrong indexing while creating a pandas dataframe with a dicitonary iteratively

I am trying to create a pandas dataframe with selenium objects 'left' and 'right'.
left = driver.find_elements(by=By.CLASS_NAME, value='lc')
right = driver.find_elements(by=By.CLASS_NAME, value='rc')
These return strings as ojects which has a different number of values for each item in left and right. But left and right has same number of elements for an iteration. The strings from 'left' are column names and values from 'right' has to be appended to the corresponding column names. I tried the following:
for l, r in zip(left, right):
# Get the text from the left and right elements
l_text = l.text
r_text = r.text
# Create a dictionary for the row with the left text as the key and the right text as the value
row = {l_text: r_text}
# Append the dictionary to the list
data.append(row)
# Create the dataframe from the list of dictionaries
df = pd.DataFrame(data)
The df created out of it has a problem with the index such that each value is added to a new index instead of being added to the same row. How do I add all values from an iteration to the same row.
The 'left' values are attributes of brake disks and the 'right' refers to its corresponding values. These vary for each item sometimes there are more and sometimes less.
The following should do what you want:
Items are added to the row until it encounters the same header.
Once a duplicate header is discovered, the row variable is appended to data and then cleared for the next round / row.
data=[]
row={}
for l, r in zip(left, right):
# Get the text from the left and right elements
l_text = l.text
r_text = r.text
if l_text is not None and l_text != "":
if l_text in row:
data.append(row)
row = {}
row[l_text] = r_text
# This is required to append the last row
if len(row) > 0:
data.append(row)
# Create the dataframe from the list of dictionaries
df = pd.DataFrame(data)
print(df)
I made some adjustments to your code when I append each key, value in a dictionary then append it to the dataframe
data = pd.DataFrame()
dic = {}
for l, r in zip(left, right):
# Get the text from the left and right elements
dic[l.text] = r.text
# Create a dictionary for the row with the left text as the key and the right text as the value
# Append the dictionary to the list
data = data.append(dic,ignore_index=True)
#data is your final dataframe
Try do it this way:
left = driver.find_elements(by=By.CLASS_NAME, value='lc')
right = driver.find_elements(by=By.CLASS_NAME, value='rc')
# Create a dictionary with keys from the left and empty lists as values
data = {}
for element in left:
if element.text not in data.keys():
data[element.text] = list()
for l, r in zip(left, right):
# Add an element to list by key
data[l.text].append(r.text)
# Create the dataframe from the dictionary
df = pd.DataFrame.from_dict(data)
I have not worked with selenium so you may need to tweak the code a little (in terms of getting a text from left list values).

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

Copy unique row to pandas dataframe?

I have an excel workbook with multiple sheets with some sales data. I am trying to sort them so that each customer has a separate sheet(different workbook), and has the item details. I have created a dictionary with all customernames.
for name in cust_dict.keys():
cust_dict[name] = pd.DataFrame(columns=cols)
for sheet in sheets:
ws = sales_wb.sheet_by_name(sheet)
code = ws.cell(4, 0).value #This is the item code
df = pd.read_excel(sales_wb, engine='xlrd', sheet_name=sheet, skiprows=7)
df = df.fillna(0)
count = 0
for index,row in df.iterrows():
print('rotation '+str(count))
count+=1
if row['Particulars'] != 0 and row['Particulars'] not in no_cust:
cust_name = row['Particulars']
# try:
cust_dict[cust_name] = cust_dict[cust_name].append(df.loc[df['Particulars'] == cust_name],ignore_index=False)
cust_dict[cust_name] = cust_dict[cust_name].drop_duplicates()
cust_dict[cust_name]['Particulars'] = code
Right now I have to drop duplicates because the Particulars has the client name more than once and hence the cope appends the data say x number of times.
I would like to avoid this but I can't seem to figure out a good way to do it.
The second problem is that since the code changes to the code in the last sheet for all rows, but I want it to remain the same for the rows pulled from a particular sheet.
I can't seem to figure out a way around both the above problems.
Thanks

storing the results of a loop quickly Python

I have a function that I call on every row of a pandas DataFrame and I would like to store the result of each function call (each iteration). Below is an example of what I am trying to do.
data =[{'a':1,'b':2,'c':3},{'a':1,'b':2,'c':3}, {'a':1,'b':2,'c':3}]
InputData = pd.DataFrame(data)
ResultData = pd.DataFrame(columns = ['a', 'b', 'c'])
def SomeFunction(row):
#Function code goes here (not important to this question)#
##########################################################
##########################################################
return Temp
for index, row in InputData.iterrows():
# Temp will equal the result of the function (a DataFrame with 3 columns and 1 Row)
Temp = Somefunction(row)
# If ResultData is not empty append Temp to ResultData
if len(ResultData) != 0:
ResultData = ResultData.append(Temp, ignore_index = True)
# If ResultData is empty Result data = Temp
else:
ResultData = Temp
I hope my example is easy to follow.
In my real example I have about a million rows in the Input Data and this process is very slow and I think it is the appending of the DataFrame which is making it so slow. Is there maybe a different data structure I could use which could store the three values of the "Temp" DataFrame which could be appended at the end to form the "ResultData" DataFrame?
Any help would be much appreciated
Best to avoid any explicit loops in pandas. Using apply is still a little slow but probably faster than a loop.
df["newcol"] = df.apply(function, axis=1)
Maybe a list of lists will solve your problem:
Result_list = []
for ... :
...
Result_list.append([data1, data2, data3]);
To review the data:
for Current_data in Result_list:
data1 = Current_data[0]
data2 = Current_data[1]
data3 = Current_data[2]
Hope it helps!

How to append a dictionary to a pandas dataframe?

I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.

Categories