Parse Pandas Return As List - python

I run the following code:
df = pd.read_excel(excel_file, columns = ['DeviceNumber','DeviceAddress','DeviceCity','DeviceState','StoreNumber','StoreName','DeviceConnect','Keys'])
df.index.name = 'ID'
def srch_knums(knum_search):
get_knums = df.loc[df['DeviceNumber'] == knum_search]
return get_knums
test = srch_knums(int(13))
print(test)
The output is as follows:
DeviceNumber DeviceAddress DeviceCity DeviceState StoreNumber StoreName DeviceConnect Keys ID
12 13 135 Sesame Street Imaginary AZ 410 Verizon Here On Sit
e
btw, that looks prettier in terminal... haha
What I want to do is take the value test and use various aspects of it, i.e. print it in specific parts of a gui that I am creating. The question is, what is the syntax for accessing the various list values of test? TBH I would rather change the labels when I am presenting it in a gui, and want to know how to do that, for example, take test[0], which should be the value for device number (13), and be able to assign it to a variable. IE, make a label which says "kiosk number" and then prints a variable assigned test[0] beside it, etc. as I would rather format it myself than the weird printout from the return.

If you want return scalar values, first match by testing column col1 and output of column col2 then loc is necessary, also is added next with iter for return default value if no match:
def srch_knums(col1, knum_search, col2):
return next(iter(df.loc[df[col1] == knum_search, col2]), 'no match')
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
410
If want list:
def srch_knums(col1, knum_search, col2):
return df.loc[df[col1] == knum_search, col2].tolist()
test = srch_knums('DeviceNumber', int(13), 'StoreNumber')
print (test)
[410]

Change the line:
get_knums = df.loc[df['DeviceNumber'] == knum_search]
to
get_knums = df[df['DeviceNumber'] == knum_search]
you don't need to use loc.

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))
You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

Fuzzy Match List with Column in a data frame

I have a list of strings that I am trying to match to values in a column. If it is a low match (below 95) I want to return the current column value if it is above 95 then I want to return the best fuzzy match from the list . I am trying to put all returned values into a new column. I keep getting the error "tuple index out of range", I think this maybe because it wants to return a tuple with the score and name but I only want the name. Here is my current code:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
L = [ducks, frogs, doggies]
df
FOO PETS
a duckz
b frags
c doggies
def fuzz_m(column, pet_list, score_t):
for c in column:
new_name, score = process.extractOne(c, pet_list, score_t)
if score<95:
return c
else:
return new_name
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
Desired output:
FOO PETS NEW_PETS
a duckz ducks
b frags frogs
c doggies doggies
Several corrections.
Change
df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)
to
df['NEW_PETS'] = fuzz_m(df['PETS'], L, fuzz.ratio)
Make your list elements strings.
Fuzzywuzzy's extractOne method accepts both a processor and a scorer, in that order (link to source code.). Your positional argument of fuzz.ratio is mistakenly interpreted as a processor, when it's really a scorer. Change process.extractOne(c, pet_list, score_t) to process.extractOne(c, pet_list, scorer=score_t).
This loop-based code will not work as expected. fuzz_m is only called once, and its return value will be broadcast into all entries of the series df['NEW_PETS'].
A more pandas-friendly way:
L = ['ducks', 'frogs', 'doggies']
def fuzz_m(col, pet_list, score_t):
new_name, score = process.extractOne(col, pet_list, scorer=score_t)
if score<95:
return col
else:
return new_name
df['NEW_PETS'] = df['PETS'].apply(fuzz_m, pet_list=L, score_t=fuzz.ratio)

Converting Python statement into string

I got a situation I have to generate a if condition based on item configured in my configuration json file.
So once the if statement converted into a string I want that to run by eval .
This is what I am trying for that.
Here is what flags and set_value look like
My json
"EMAIL_CONDITION":
{
"TOXIC_THRESOLD":"50",
"TOXIC_PLATFORM_TODAY":"0",
"TOXIC_PRS_TODAY":"0",
"Explanation":"select any or all 1 for TOXIC_Thresold 2 for TOXIC_PLATFORM 3 for toxic_prs ",
"CONDITION_TYPE":["1","2"]
},
"email_list":{
"cc":["abc#def.t"],
"to":["abc#def.net"]
}
The Python
CONDITION_TYPE is a variable where values can be either all, any or 1,2 ,2,3, or 1,3
1 stands for toxic index 2 for platform and 3 for toxic prs
But the idea is going forward any number of parameters can be added so I wanted to make the if condition generic so that can take any number of conditions simply just wanted to avoid to many if else. all and any i have already handaled its straight forward but this is variable number so only else part snippet given
flags['1'] = toxic_index
flags['2'] = toxic_platform
flags['3'] = toxic_prs
set_value['1'] = toxic_index_condition
set_value['2'] = toxic_platform_condition
set_value['3'] = toxic_pr_condition
else:
condition_string = 'if '
for val,has_more in lookahead(conditions):
if has_more:
condition_string = str(condition_string+ str(flags[val] >= set_value[val])+ str('and') )
else:
condition_string = str(condition_string+ str(flags[val] >= set_value[val]) + str(':') )
print str(condition_string)
I do understand that most of them are variable so I am getting the response like
if False and False:
Instead of False I wanted to get the real condition(basically like condition_string+ str(flags[val] >= set_value[val]) ) based on that I can send mail.
I am not able to do that as I am getting False and False.
Please suggest me a best solution for it.

Python: Summing a pig tuple containing float values

I'm fairly new to Pig/Python and in need of help. Trying to write a Pig Script that reconciles financial data. The parameters used follow a syntax like (grand_tot, x1, x2,... xn), meaning that the first value should equal the sum of remaining values.
I don't know of a way to accomplish this using Pig alone, so I've been trying to write a Python UDF. Pig passes a tuple to Python; if the sum of x1:xn equals grand_tot, then Python should return a "1" to Pig to show that the numbers match, otherwise it returns a "0".
Here is what I have so far:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d);
A1 = GROUP A ALL;
B = FOREACH A1 GENERATE TOTUPLE($recon1) as flds;
C = FOREACH B GENERATE myfuncs.isReconciled(flds) AS res;
DUMP C;
$recon1 is passed as a parameter, and defined as:
grand_tot, west_region, east_region
I will later pass $recon2 as:
grand_tot, prod_line_a, prod_line_b, prod_line_c, prod_line_d
Sample row of data (in $file_nm) looks like:
grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d
10000,4500,5500,900,2200,450,3700,2750
12500,7500,5000,3180,2770,300,3950,2300
9900,7425,2475,1320,460,3070,4630,1740
Lastly... here is what I'm trying to do with Python UDF code:
#outputSchema("result")
def isReconciled(arrTuple):
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
The error message I receive is:
unsupported operand type(s) for +: 'int' and 'array.array'
I've tried numerous things attempting to convert the array values into numeric and convert to float so that I can sum, but with no success.
Any ideas??? Thanks for looking!
You can do this in PIG itself.
First, specify the datatype in the schema. PigStorage will use bytearray as default data type.Hence your python script is throwing the error.Looks like your sample data has int but in your question you have mentioned float.
Second, add the fields starting from the second field or the fields of your choice.
Third, use the bincond operator to check the first field value with the sum.
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot:float,west_region:float,east_region:float,prod_line_a:float,prod_line_b:float, prod_line_c:float, prod_line_d:float);
A1 = FOREACH A GENERATE grand_tot,SUM(TOBAG(prod_line_a,prod_line_b,prod_line_c,prod_line_d)) as SUM_ALL;
B = FOREACH A1 GENERATE (grand_tot == SUM_ALL ? 1 : 0);
DUMP B;
It is very likely, that your arrTuple is not an array of numbers, but some item is an array.
To check it, modify your code by adding some checks:
#outputSchema("result")
def isReconciled(arrTuple):
# some checks
tmpl = "Item # {i} shall be a number (has value {itm} of type {tp})"
for i, num in enumerate(arrTuple):
msg = templ.format(i=i, itm=itm, tp=type(itm))
assert isinstance(arrTuple[0], (int, long, float)), msg
# end of checks
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
It is very likely, that it will throw an AssertionFailed exception on one of the items. Read the
assertion message to learn, which item is making the troubles.
Anyway, if you want to return 0 or 1 if first number equals sum of the rest of the array, following
would work too:
#outputSchema("result")
def isReconciled(arrTuple):
if arrTuple[0] == sum(arrTuple[1:]):
return 1
else:
return 0
and in case, you would live happy with getting True instead of 1 and False instead of 0:
#outputSchema("result")
def isReconciled(arrTuple):
return arrTuple[0] == sum(arrTuple[1:])

Handle modifications in a diff file

I've got a diff file and I want to handle adds/deletions/modifications to update an SQL database.
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
With a Python script, I first detect two following lines with a regular expressions to handle modifications like B.
re.compile(r"^-(.+?)\|(.*?)\|(.+?)\n\+(.+?)\|(.*?)\|(.+?)(?:\n|\Z)", re.MULTILINE)
I delete all the matching lines, and then rescan my file and then handle all of them like additions/deletions.
My problem is with lines like D & E. For the moment I treat them like two deletions, then two additions, and I've got consequences of CASCADE DELETE in my SQL database, as I should treat them as modifications.
How can I handle such modifications D & E?
The diff file is generated by a bash script before, I could handle it differently if needed.
Try this:
>>> a = '''
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
'''
>>> diff = {}
>>> for row in a.splitlines():
if not row:
continue
s = row.split('|')
name = s[0][1:]
data = s[1:]
if row.startswith('+'):
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'added'
else:
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'removed'
diff[name] = change
>>> def print_by_status(status=None):
for item, value in diff.items():
if status is not None and status == value['status'] or status is None:
print '\nStatus: %s\n%s' % (value['status'], '\n'.join(value['rows']))
>>> print_by_status(status='added')
Status: added
+NameA|InfoA1|InfoA2
>>> print_by_status(status='modified')
Status: modified
-NameD|InfoD1|InfoD2
+NameD|InfoD1|InfoD3
Status: modified
-NameE|InfoE1|InfoE2
+NameE|InfoE3|InfoE2
Status: modified,
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
In this case you will have dictionary with all collected data with diff status and rows. You can do with current dict whatever you want.

Categories