First of all - i know that iterating over Pandas DataFrame is not a good idea, so any suggestions regarding the other possible solutions are welcome.
I am trying to write a little piece of code to compare two dataframes - one of which is a template to compare with.
The dataframes look like this (shortened version, of course):
Template:
Template1 | Template2 | Template3
----------------------+-----------+------------
Variable 1 | value | value | value
Variable 2 | value | value | value
Variable 3 | value | value | value
Variable 4 | value | value | value
And the file to compare (datafile):
Record 1 | Record 2 | Record 3 | Record 4
---------------------+----------+----------+----------
Variable 3 | value | value | value | value
Variable 1 | value | value | value | value
Variable 4 | value | value | value | value
Now, what the script should do:
take one, specific column from template file
compare every record in data file with the selected column
I managed to write a little piece of code and it even works for one record:
template = templatefile['Template2']
record_to_check = datafile[0]
errors_found = []
for a in template.index:
if a in record_to_check.index:
variable = {}
if template[a] == record_to_check[a]:
# equal
pass
else:
# unequal
variable['name'] = a
variable['value'] = template[a]
errors_found.append(variable)
else:
# not found
variable = {}
variable['name'] = a
variable['value'] = template[a]
errors_found.append(variable)
It returns dictionary of errors_found, containing pair of {variable:value}.
Problem starts when i am trying to put it in another loop (to iterate over the records in datafile:
template = templatefile['Template2']
for record_to_check in datafile.iteritems():
errors_found = []
for a in template.index:
if a in record_to_check.index:
variable = {}
if template[a] == record_to_check[a]:
# equal
pass
else:
# unequal
variable['name'] = a
variable['value'] = template[a]
errors_found.append(variable)
else:
# not found
variable = {}
variable['name'] = a
variable['value'] = template[a]
errors_found.append(variable)
result:
Traceback (most recent call last):
File "attributes.py", line 24, in <module>
if a in record_to_check.index:
TypeError: argument of type 'builtin_function_or_method' is not iterable
What am I doing wrong?
EDIT: expected output should be dictionary like this:
[{'name': 'variable2', 'value': value_from_template}, {'name': 'variable3', 'value': value_from_template}]
And i know that if i run it in the loop it would overwrite the dictionary for each iteration. I just wanted to be sure that it works with multiple records, so i can make function out of it.
As you point out yourself, looping over a pandas dataframe is not a good approach. Instead you should use a join, here are some ideas:
Assuming you have the reference table
template
template1 template2
index
var 1 1 5
var 2 2 4
var 3 3 3
var 4 4 2
and your data table
datafile
record1 record2
index
var 3 1 3
var 1 2 3
var 4 4 2
A left join on the indices will automatically match the variables, the ordering does not play a role: joined = template.join(datafile, how='left').
You can then easily create new columns that tell you whether the values in the template and data table match: joined['temp1=rec1'] = joined["template1"] == joined["record1"].
This column you can use to show only those rows where the values do not match: errors_found = joined[~joined['temp1=rec1']]
errors_found
template1 template2 record1 record2 temp1=rec1
index
var 1 1 5 2.0 3.0 False
var 2 2 4 NaN NaN False
var 3 3 3 1.0 3.0 False
You can now get a dictionary with the template values: errors_found = joined[~joined['temp1=rec1']]['template 1'].to_dict()
{'var 1': 1, 'var 2': 2, 'var 3': 3}
If you want this for more than just one column pair, you can put this code in a function and loop / map over the columns.
Hope this helps.
Related
I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?
Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)
I have a dictionary that uses log file names as keys. When reading through a text file if it contains a value I was searching for, it saves the entire line as a new value in the list of values for that key(log file name).
I.e Key: logFile0 Value: [Val1 5, Val2 3, Val3 72].
I want to output to a csv file the values as headers and which log file it was found and the value.
Key | Value
log1 | [Value_name Val_num], [Value1_name Val_num]
log2 | [Value1_name Val_num], [Value3_name Valu_num]
log3 | [Value2_name Val_num], [Value3_name Val_num]
I want it to be displayed in the csv file where for each time the value was found in the log file, it's value is displayed as:
Value_name | Value1_name | Value2_name | Value3_name
log1 val_num | val_num | |
log2 | val_num | | val_num
log3 | | val_num | val_num
Does anybody know how to do this? Or is there a better way to store all of this information and then display?
Here is my approach:
import csv
d = {"log1":[[0,"Val_num"],[1,"Val_num"]],
"log2":[[1,"Val_num"],[3,"Val_num"]],
"log3":[[2,"Val_num"],[3,"Val_num"]]}
I made a dictionary. Use yours but the "Val_num" are meant to be the actual numbers. The first number is the value_name e.g. 1 -> Value1_name
headers = ["Log File","Value_num","Value1_num","Value2_num","Value3_num"]
log1Row = ["log1","","","",""]
log2Row = ["log2","","","",""]
log3Row = ["log3","","","",""]
Here I simply initialised the rows to be written into the csv file.
for i in d["log1"]:
log1Row[i[0] + 1] = i[1]
for i in d["log2"]:
log2Row[i[0] + 1] = i[1]
for i in d["log3"]:
log3Row[i[0] + 1] = i[1]
Now I am going to iterate over each key in the dictionary and append them into the arrays. The + 1 is needed as the array position for where theses numbers begin from is index 1 not 0 as this is the word "log".
with open("file.csv","w") as file:
writer = csv.writer(file)
writer.writerow(headers)
writer.writerow(log1Row)
writer.writerow(log2Row)
writer.writerow(log3Row)
Just open the file to write to and write these rows.
Hopefully this works as long as your dictionary follows the way mine does.
Let me know of any erros.
Thank you.
I have an original dataset with informations stored as a list of dict, in a column (this is a mongodb extract). This is the column :
[{u'domain_id': ObjectId('A'), u'p': 1},
{u'domain_id': ObjectId('B'), u'p': 2},
{u'domain_id': ObjectId('B'), u'p': 3},
...
{u'domain_id': ObjectId('CG'), u'p': 101}]
I'm only interested in the first 10 dict ( 'p' value from 1 to 10). The output dataframe should look like this :
index | A | ... | B
------------------------
0 | 1 | ... | 2
1 | Nan | ... | Nan
2 | Nan | ... | 3
e.g : For each line of my original DataFrame, I create a column for each domain_id, and I associate it with the corresponding 'p' value. I can have the same domain_id for several 'p' value, in this case I only keep the first one (smaller 'p')
Here is my current code, which may be easier to understand :
first = True
for i in df.index[:]: # for each line of original Dataframe
temp_list = df["positions"][i] # this is the column with the list of dict inside
col_list = []
data_list = []
for j in range(10): # get the first 10 values
try:
if temp_list[j]["domain_id"] not in col_list: # check if domain_id already exist
col_list.append(temp_list[j]["domain_id"])
data_list.append(temp_list[j]["p"])
except IndexError as e:
print e
df_temp = pd.DataFrame([np.transpose(data_list)],columns = col_list) # create a temporary DataFrame for this line of the original DataFrame
if first:
df_kw = df_temp
first = False
else:
# pass
df_kw = pd.concat([df_kw,df_temp], axis=0, ignore_index=True) # concat all the temporary DataFrame : now I have my output Dataframe, with the same number of lines as my original DataFrame.
This is all working fine, but it is very very slow as I have 15k lines and end up with 10k columns.
I'm sure (or at least I hope very much) that there is a simpler an faster solution : any advice will be much appreciated.
I found a decent solution : the slow part is the concatenation, so it is way more efficient to first create the dataframe and then update the values.
Create the DataFrame:
for i in df.index[:]:
temp_list = df["positions"][i]
for j in range(10):
try:
# if temp_list[j]["domain_id"] not in col_list:
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
df_total = pd.DataFrame(index=df.index, columns=set(col_list))
Update the values :
for i in df.index[:]:
temp_list = df["positions"][i]
col_list = []
for j in range(10):
try:
if temp_list[j]["domain_id"] not in col_list: # avoid overwriting values
df_total.loc[i, temp_list[j]["domain_id"]] = temp_list[j]["p"]
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
Creating a 15k x 6k DataFrame took about 6 seconds on my computer, and filling it took 27 seconds.
I killed the former solution after more than 1 hour running, so this is really faster.
I'm attempting to output my database table data, which works aside from long table rows. The columns need to be as large as the longest database row. I'm having trouble implementing a calculation to correctly output the table proportionally instead of a huge mess when long rows are outputted (without using a third party library e.g. Print results in MySQL format with Python). Please let me know if you need more information.
Database connection:
connection = sqlite3.connect("test_.db")
c = connection.cursor()
c.execute("SELECT * FROM MyTable")
results = c.fetchall()
formatResults(results)
Table formatting:
def formatResults(x):
try:
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in x:
print(tavnit % row)
print(separator)
print ""
except:
showMainMenu()
pass
Output problem example:
+------+------+---------+
| Date | Name | LinkOrFile |
+------+------+---------+
| 03-17-2016 | hi.com | Locky |
| 03-18-2016 | thisisitsqq.com | None |
| 03-19-2016 | http://ohiyoungbuyff.com\69.exe?1 | None |
| 03-20-2016 | http://thisisitsqq..com\69.exe?1 | None |
| 03-21-2016 | %Temp%\zgHRNzy\69.exe | None |
| 03-22-2016 | | None |
| 03-23-2016 | E52219D0DA33FDD856B2433D79D71AD6 | Downloader |
| 03-24-2016 | microsoft.com | None |
| 03-25-2016 | 89.248.166.132 | None |
| 03-26-2016 | http://89.248.166.131/55KB5js9dwPtx4= | None |
If your main problem is making column widths consistent across all the lines, this python package could do the job: https://pypi.python.org/pypi/tabulate
Below you find a very simple example of a possible formatting approach.
The key point is to find the largest length of each column and then use format method of the string object:
#!/usr/bin/python
import random
import string
from operator import itemgetter
def randomString(minLen = 1, maxLen = 10):
""" Random string of length between 1 and 10 """
l = random.randint(minLen, maxLen)
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(l))
COLUMNS = 4
def randomTable():
table = []
for i in range(10):
table.append( [randomString() for j in range(COLUMNS)] )
return table
def findMaxColumnLengs(table):
""" Returns tuple of max column lengs """
maxLens = [0] * COLUMNS
for l in table:
lens = [len(s) for s in l]
maxLens = [max(maxLens[e[0]], e[1]) for e in enumerate(lens)]
return maxLens
if __name__ == '__main__':
ll = randomTable()
ml = findMaxColumnLengs(ll)
# tuple of formatting statements, see format docs
formatStrings = ["{:<%s}" % str(m) for m in ml ]
fmtStr = "|".join(formatStrings)
print "=================================="
for l in ll:
print l
print "=================================="
for l in ll:
print fmtStr.format(*l)
This prints the initial table packed in the list of lists and the formatted output.
==================================
['2U7Q', 'DZK8Z5XT', '7ZI0W', 'A9SH3V3U']
['P7SOY3RSZ1', 'X', 'Z2W', 'KF6']
['NO8IEY9A', '4FVGQHG', 'UGMJ', 'TT02X']
['9S43YM', 'JCUT0', 'W', 'KB']
['P43T', 'QG', '0VT9OZ0W', 'PF91F']
['2TEQG0H6A6', 'A4A', '4NZERXV', '6KMV22WVP0']
['JXOT', 'AK7', 'FNKUEL', 'P59DKB8']
['BTHJ', 'XVLZZ1Q3H', 'NQM16', 'IZBAF']
['G0EF21S', 'A0G', '8K9', 'RGOJJYH2P9']
['IJ', 'SRKL8TXXI', 'R', 'PSUZRR4LR']
==================================
2U7Q |DZK8Z5XT |7ZI0W |A9SH3V3U
P7SOY3RSZ1|X |Z2W |KF6
NO8IEY9A |4FVGQHG |UGMJ |TT02X
9S43YM |JCUT0 |W |KB
P43T |QG |0VT9OZ0W|PF91F
2TEQG0H6A6|A4A |4NZERXV |6KMV22WVP0
JXOT |AK7 |FNKUEL |P59DKB8
BTHJ |XVLZZ1Q3H|NQM16 |IZBAF
G0EF21S |A0G |8K9 |RGOJJYH2P9
IJ |SRKL8TXXI|R |PSUZRR4LR
The code that you used is for MySQL. The critical part is the line widths.append(max(cd[2], len(cd[0]))) where cd[2] gives the length of the longest data in that column. This works for MySQLdb.
However, you are using sqlite3, for which the value cd[2] is set to None:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.description
Thus, you will need to replace the following logic:
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
with your own. The rest of the code should be fine as long as widths is computed correctly.
The easiest way to get the widths variable correctly, would be to traverse through each row of the result and find out the max width of each column, then append it to widths. This is just some pseudo code:
for cd in c.description:
columns.append(cd[0]) # Get column headers
widths = [0] * len(c.description) # Initialize to number of columns.
for row in x:
for i in range(len(row)): # This assumes that row is an iterable, like list
v = row[i] # Take value of ith column
widths[i] = max(len(v), widths[i]) # Compare length of current value with value already stored
At the end of this, widths should contain the maximum length of each column.
I have two tables with the following structures where in table 1, the ID is next to Name while in table 2, the ID is next to Title 1. The one similarity between the two tables are that, the first person always has the ID next to their name. They are different for the subsequent people.
Table 1:
Name&Title | ID #
----------------------
Random_Name 1|2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| 2000
Title_2_1 | -
Title_2_2 | -
... |...
Table 2:
Name&Title | ID #
----------------------
Random_Name 1| 2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| -
Title_2_1 | 2000
Title_2_2 | -
... |...
I have the code to recognize table 1 but struggle to incorporate structure 2. The table is stored as a nested list of row (each row is a list). Usually, for one person there are only 1 row of name but multiple rows of titles. The pseudo-code is this:
set count = 0
find the ID next to the first name, set it to be a recognizer
for row_i,row in enumerate(table):
compare the ID of the next row until I found: row[1] == recognizer
set count = row i
slice the table to get the first person.
The actual code is this:
header_ind = 0 # something related to the rest of the code
recognizer = data[header_ind+1][1]
count = header_ind+1
result = []
result.append(data[0]) #this append the headers
for i, row in enumerate(data[header_ind+2:]):
if i <= len(data[header_ind+4:]):
if row[1] and data[i+1+header_ind+2][1] is recognizer:
print data[i+header_ind+3]
one_person = data[count:i+header_ind+3]
result.append(one_person)
count = i+header_ind+3
else:
if i == len(data[header_ind+3:]):
last_person = data[count:i+header_ind+3]
result.append(last_person)
count = i+header_ind+3
I have been thinking about this for a while and so I just want to know whether it is possible to get an algorithm to incorporate Table 2 given that the we cannot distinguish the row name and titles.
Going to stick this here
So these are your inputs assumption is you are restricted to this...:
# Table 1
data1 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','2000'],
['Title_2_1','-'],
['Title_2_2','-']]
# TABLE 2
data2 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','-'],
['Title_2_1','2000'],
['Title_2_2','-']]
And this is your desired output:
for x in data:
print x
['Random_Name2', '2000']
['Name&Title', 'ID#']
[['Random_Name1', '2000'], ['Title_1_1', '-'], ['Title_1_2', '-']]
[['Random_Name2', '2000'], ['Title_2_1', '-'], ['Title_2_2', '-']]