storing the results of a loop quickly Python - python

I have a function that I call on every row of a pandas DataFrame and I would like to store the result of each function call (each iteration). Below is an example of what I am trying to do.
data =[{'a':1,'b':2,'c':3},{'a':1,'b':2,'c':3}, {'a':1,'b':2,'c':3}]
InputData = pd.DataFrame(data)
ResultData = pd.DataFrame(columns = ['a', 'b', 'c'])
def SomeFunction(row):
#Function code goes here (not important to this question)#
##########################################################
##########################################################
return Temp
for index, row in InputData.iterrows():
# Temp will equal the result of the function (a DataFrame with 3 columns and 1 Row)
Temp = Somefunction(row)
# If ResultData is not empty append Temp to ResultData
if len(ResultData) != 0:
ResultData = ResultData.append(Temp, ignore_index = True)
# If ResultData is empty Result data = Temp
else:
ResultData = Temp
I hope my example is easy to follow.
In my real example I have about a million rows in the Input Data and this process is very slow and I think it is the appending of the DataFrame which is making it so slow. Is there maybe a different data structure I could use which could store the three values of the "Temp" DataFrame which could be appended at the end to form the "ResultData" DataFrame?
Any help would be much appreciated

Best to avoid any explicit loops in pandas. Using apply is still a little slow but probably faster than a loop.
df["newcol"] = df.apply(function, axis=1)

Maybe a list of lists will solve your problem:
Result_list = []
for ... :
...
Result_list.append([data1, data2, data3]);
To review the data:
for Current_data in Result_list:
data1 = Current_data[0]
data2 = Current_data[1]
data3 = Current_data[2]
Hope it helps!

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

What is the fastest way to filter over 2.5gb Json file?

I have 2.5 GB of JSON file with 25 columns and about 4 million rows. I try to filter the JSON with the following script it takes at least 10 minutes.
import json
product_list = ['Horse','Rabit','Cow']
year_list = ['2008','2009','2010']
country_list = ['USA','GERMANY','ITALY']
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = r.read()
result = json.loads(result)
for item in result[:]:
if (not str(item["Year"]) in year_list) or (not item["Name"] in product_list) or (not item["Country"] in country_list):
result.remove(item)
print(result)
I need to prepare the result in a max of 1 minute so what is your suggestion or the fastest way to filter JSON?
Removing from a list in a loop is slower, each remove is O(n) and that is done n times so O(n^2), appending to a new list is O(1) and doing this n times is O(n) in a loop. So you can try this
[item for item in result if str(item["Year"] in year_list) or (item["Name"] in product_list) or (item["Country"] in country_list)]
Filter based on the condition you need and add only those that match.
You need to read the json file using Pandas dataframes, and then filter on the required columns.
Why ? because Pandas is column-based and therefore it's super fast for working with columns, it's built upon Series which is a one-dimensional labeled array (basically the column).
So you need something like that: (Assuming column names in json file are consistent)
import pandas as pd
product_list = ['Horse','Rabit','Cow']
year_list = ['2008','2009','2010']
country_list = ['USA','GERMANY','ITALY']
df = pd.read_json('./products/animal_production.json')
# Change the condition if it's not the desired one
condition = (df["Year"].isin(year_list) | (df["Name"].isin(product_list) | (df["Country"].isin(country_list)
df = df[condition]
I can't reproduce it to estimate the time needed but I am sure it would be hundreds or even thousands of times faster!
I do not know if it will be much faster, but you might json.load rather than read-ing then json.loadsing i.e. rather than
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = r.read()
result = json.loads(result)
you might do
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = json.load(r)

python df code works outside loop but not inside

Other answers on stackoverflow do adress loop problems but none does address a df outside a loop so I have to ask this question.
I have below code which does exactly what it should grab a table, dataframe it and append it to final_df outside of the loop:
empty =[]
final_df= pd.DataFrame(empty, columns =['column_1', 'column_2', 'column_3',
'column_4', 'report'])
document = Document(targets_in_dir[1])
table = document.tables[2]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(targets_in_dir[1])
final_df = interim_df.append(df)
print (targets_in_dir[1])
Once I pack it into a loop (see below) which iterates through the filenames specified in target_in_dir list my final_df is always empty. How can I fix this? I want my final_df to contain all the rows extracted from the same table in all the files.
for idx, c in enumerate(targets_in_dir):
try:
document = Document(c)
table = document.tables[2]
processed_files.append(c)
except:
error_log.append(c)
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame(data)
df['report'] = str(c)
final_df.append(df)
final_df.append(df) does not change final_df in place.
Try changing to final_df = final_df.append(df), this will update final_df within the loop.
Pandas append documentation contains a note on this:
Unlike the append() method, which appends to the original list and returns None, append() here does not modify df1 and returns its copy with df2 appended.
If your code is exactly as it is represented in your question, the code which processes the data is indented such that it is executed only as part of the exception handling.
Moving the code to the left by one indent will ensure that it is executed as part of the for loop but outside of the try and exception handling blocks.

conditional with value from a particular cell in excel

I am trying to write a following matlab code in python:
function[x,y,z] = Testfunc(filename, newdata, a, b)
sheetname = 'Test1';
data = xlsread(filename, sheetname);
if data(1) == 1
newdata(1,3) = data(2);
newdata(1,4) = data(3);
newdata(1,5) = data(4);
newdata(1,6) = data(5)
else
....
....
....
It is very long function but this is the part where I am stuck and have no clue at all.
This is what I have written so far in python:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data[0] == 1:
I am stuck here guys and I am also even not sure if the 'if' statement is right or not. I am looking for suggestions and help.
Info: excel sheet has 1 row and 13 columns, newdata is also a 2-D Matrix
Try running that code and printing out your dataframe (print(data)). You will see that a dataframe is different than a MATLAB matrix. read_excel will try to infer your columns, so you will probably have no rows and just columns. To prevent pandas from reading the column use:
data = pd.read_excel(filepath, sheet_name='Test1', header=None)
Accessing data using an index will index that row. So your comparison is trying to find if the row is equal to 1 (which is never true in your case). To index a given cell, you must first index the row. To achieve what you are doing in MATLAB, use the iloc indexer on your dataframe: data.iloc[0,0]. What this does in accesses row 0, element 0. Your code should look like this:
import pandas as pd
def test_func(filepath, newdata, a, b):
data = pd.read_excel(filepath, sheet_name = 'Test1')
if data.iloc[0,0] == 1:
newdata.iloc[0,2:6] = data.iloc[0,1:5]
....
I suggest you read up on indexing in pandas.

Pandas: append dataframe to another df

I have a problem with appending of dataframe.
I try to execute this code
df_all = pd.read_csv('data.csv', error_bad_lines=False, chunksize=1000000)
urls = pd.read_excel('url_june.xlsx')
substr = urls.url.values.tolist()
df_res = pd.DataFrame()
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
df_res.append(res)
And when I try to save df_res I get empty dataframe.
df_all looks like
ID,"url","used_at","active_seconds"
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:25,1
b20f9412f914ad83b6611d69dbe3b2b4,"mobiguru.ru/phones/apple/comp/32gb/apple_iphone_5s.html",2015-10-01 00:00:31,30
f85ce4b2f8787d48edc8612b2ccaca83,"4pda.ru/forum/index.php?showtopic=634566&view=getnewpost",2015-10-01 00:01:49,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"shop.mts.ru/smartfony/mts/smartfon-smart-sprint-4g-sim-lock-white.html?utm_source=admitad&utm_medium=cpa&utm_content=300&utm_campaign=gde_cpa&uid=3",2015-10-01 00:03:19,34
078d388438ebf1d4142808f58fb66c87,"market.yandex.ru/product/12675734/spec?hid=91491&track=char",2015-10-01 00:03:48,2
d3b0ef7d85dbb4dbb75e8a5950bad225,"avito.ru/yoshkar-ola/telefony/mts",2015-10-01 00:04:21,4
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:25,1
d3b0ef7d85dbb4dbb75e8a5950bad225,"shoppingcart.aliexpress.com/order/confirm_order",2015-10-01 00:04:26,9
and urls looks like
url
shoppingcart.aliexpress.com/order/confirm_order
ozon.ru/?context=order_done&number=
lk.wildberries.ru/basket/orderconfirmed
lamoda.ru/checkout/onepage/success/quick
mvideo.ru/confirmation?_requestid=
eldorado.ru/personal/order.php?step=confirm
When I print res in a loop it doesn't empty. But when I try print in a loop df_res after append, it return empty dataframe.
I can't find my error. How can I fix it?
If you look at the documentation for pd.DataFrame.append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
(emphasis mine).
Try
df_res = df_res.append(res)
Incidentally, note that pandas isn't that efficient for creating a DataFrame by successive concatenations. You might try this, instead:
all_res = []
for df in df_all:
for i in substr:
res = df[df['url'].str.contains(i)]
all_res.append(res)
df_res = pd.concat(all_res)
This first creates a list of all the parts, then creates a DataFrame from all of them once at the end.
If we want append based on index:
df_res = pd.DataFrame(data = None, columns= df.columns)
all_res = []
d1 = df.ix[index-10:index-1,] #it will take 10 rows before i-th index
all_res.append(d1)
df_res = pd.concat(all_res)

Categories