Python .split() function - python

I am using split to separate the M/D/Y values from one field to make them in their own respective fields. My script in bombing out on the NULL values in the original date field for the Day field.
10/27/1990 ----> M:10 D:27 Y:1990
# Process: Calculate Field Month
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Month",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[0]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Day
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Day",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[1]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Year
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Year",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[-1]""",expression_type="PYTHON_9.3",code_block="#")
I am unsure how I should fix this issue; any suggestions would be greatly appreciated!

Something like this should work (to calculate the year where possible):
in_table = "Assess_Template"
field = "Assess_Template.Year"
expression = "get_year(!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!)"
codeblock = """def get_year(date):
try:
return date.split("/")[-1]
except:
return date"""
arcpy.CalculateField_management(in_table, field, expression, "PYTHON_9.3", codeblock)
Good luck!
Tom

Related

if column years is >=10, user personal details should be replaced with his id (pandas)

I am new to pandas.
Here I am iterating through each row and checking the exit date of a user if his exit date is >= 10 his personal details should be replaced with his id.
I am stuck please help.
for edate in pd.to_datetime(df1['EXIT_DATE']):
rdelt = relativedelta(datetime.today(),edate)
df1['years'] = rdelt.years
# its modifying each row in a DataFrame.
#df1.loc[flag,['first_name','middel_name','email']] = df1['user_id']
+++++++++++++++++++
EDIT:
Added link to an answer from #Arvind Kumar Avinash explaining "Filtering on dataframe"
++++++++++++++++++++
Taking #Emi OB comment and adding explanation;
You can create a flag/mask by using the usual "<,>,<=,>=" operators e.g
age = pd.Series([20,23,22,19,30])
age>22 # Series([False,True,False,False,True])
thus you can use that mask to operate on all the True indexes i.e if we want to replace all the age where age>22 (i.e all the index' where we have the True value) with the value 22, we do it simply by
age = pd.Series([20,23,22,19,30])
mask = age>22 # Series([False,True,False,False,True])
age.loc[mask] = 22
age # pd.Series([20,22,22,19,22])
the exact same logic can be used on data-frames
You can try the code below to avoid the loop:
# Ensure EXIT_DATE dtype is a datetime64
df1['EXIT_DATE'] = pd.to_datetime(df['EXIT_DATE'])
df1['years'] = pd.Timestamp.today().year - df1['EXIT_DATE'].dt.year
df1.loc[df1['years'] >= 10, ['first_name','middle_name','email']] = df['user_id']

Python: Extract unique sentences from string and place them in new column concatenated with ;

Afternoon Stackoverflowers,
I have been challenged with some extract/, as I am trying to prepare data for some users.
As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).
Problem statement:
My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).
Field example without any modification
INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]
Desired output:
Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order
Python code:
import pandas
excel_data_df = pandas.read_excel('rejections.xlsx')
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
excel_data_df['Invoice_Issues'].split(":", 1)
Output:
INVOICE_CREATION_FAILED[Invalid Address information:
I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line.
I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.
In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...
Many thanks,
Greg
This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.
Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.
import pandas
# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
'Invoice_Issues': [
'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})
# build a dictionary
# (just like "words" to their "meanings", this maps
# "error keys" to their "error descriptions".
errors_dictionary = {
'E01': 'Invalid Address information',
'E02': 'Incorrect Purchase order',
'E03': 'Invalid VAT ID',
# ....
'E39': 'No tax line available'
}
def extract_errors(invoice_issue):
# using the entry in the Invoice_Issues column,
# then loop over the dictionary to see if this error is in there.
# If so, add it to the list_of_errors.
list_of_errors = []
for error_number, error_description in errors_dictionary.items():
if error_description in invoice_issue:
list_of_errors.append(error_description)
return ';'.join(list_of_errors)
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)
# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

SPSS Modeler using python for doing loop through months from SQL code

I have a Node in SPSS Modeler with SQL code provided below.
It's selecting a month and calculating a count for one month.
I created a parameter '$P-p_ly_parameter' and assigned a value 201807 to it.
What I want to do is to run a loop through the months from 201807 to 201907.
I use a Python code putting it into Tools, Stream Properties, Execution.
But when I run it won't let me to get the results that I expect.
In fact I'm not getting any results.
Obviously, I'm missing something.
I suppose the result of the loop is not assigning to the month_id for each month.
Could you please help me with the way to do the loop in the right way?
Should I use Select node and include something like this?
-- SQL
SELECT
cust.month_id,
count(*) as AB_P1_TOTAL
FROM tab1 cust
JOIN tab2 dcust ON dcust.month_id=cust.month_id and
dcust.cust_srcid=cust.cust_srcid
WHERE
cust.month_id ='$P-p_ly_parameter'
group by cust.month_id
order by cust.month_id
# Python
import modeler.api
# boilerplate definitions
stream = modeler.script.stream()
taskrunner = modeler.script.session().getTaskRunner()
# variables for starting year
startYear = 2018
# gets us to 2019
yearsToLoop = 1
# get the required node by Id
# double click on node, go to annotations and get ID from bottom right
selectNode = stream.findByID('id5NBVZYS3XT2')
runNode = stream.findByID('id3N3V6JXBQU2')
# loop through our years
for year in range(0, yearsToLoop):
# loop through months
for month in range(1,13):
#month_id = str(startYear + year) + str(month).rjust(2,'0')#ar
p_ly_parameter = str(startYear + year) + str(month).rjust(2,'0')#ar
#debug
#print month_id
print p_ly_parameter
# set the condition in the select node
#selectNode.setPropertyValue('condition', 'month_id = ' + month_id)
#selectNode.setPropertyValue("condition", "'month_id = '$P-p_ly_parameter'")
#selectNode.setPropertyValue('mode', 'Include')
# run the stream
runNode.run(None)
I expect the results by month for example 201807 500, 201808 1000 etc.
But now I'm getting nothing
The missing piece is to set the value of the stream parameter.
The line of code that says:
p_ly_parameter = str(startYear + year) + str(month).rjust(2,'0')
only sets the value of a variable in the Python script itself, but does not change the value of the stream parameter with the same name.
You need to add a line immediately following that, which explicitly sets the value of the stream parameter such as:
stream.setParameterValue("p_ly_parameter", p_ly_parameter)

Retrieve data from GenBank with Bio.Entrez module

I am trying to solve one of the Rosalind challenges and I can't seem to find a way to retrieve data, within a specific time frame.
http://rosalind.info/problems/gbk/
Do/How Do I modify Entrez.esearch() to specify a time frame?
Question:
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
Test Data:
Anthoxanthum
2003/7/25
2005/12/27
Answer: 7
Thanks a lot to #Kayvee for the pointer! It works like a charm!
Here is a format for searching the organism by 'posted between start-end':
(Anthoxanthum[Organism]) AND ("2003/7/25"[Publication Date] : "2005/12/27"[Publication Date])
Here is Python code:
# GenBank gene database
geneName = "Anthoxanthum"
pubDateStart = "2003/7/25"
pubDateEnd = "2005/12/27"
searchTerm = f'({geneName}[Organism]) AND("{pubDateStart}"[Publication Date]: "{pubDateEnd}"[Publication Date])'
print(f"\n[GenBank gene database]:")
Entrez.email = "please#pm.me"
handle = Entrez.esearch(db="nucleotide", term=searchTerm)
record = Entrez.read(handle)
print(record["Count"])

django annotation and filtering

Hopefully this result set is explanatory enough:
title text total_score already_voted
------------- ------------ ----------- -------------
BP Oil spi... Recently i... 5 0
J-Lo back ... Celebrity ... 7 1
Don't Stop... If there w... 9 0
Australian... The electi... 2 1
My models file describes article (author, text, title) and vote (caster, date, score). I can get the first three columns just fine with the following:
articles = Article.objects.all().annotate(total_score=Sum('vote__score'))
but calculating the 4th column, which is a boolean value describing whether the current logged in user has placed any of the votes in column 3, is a bit beyond me at the moment! Hopefully there's something that doesn't require raw sql for this one.
Cheers,
Dave
--Trindaz on Fedang #django
I cannot think of a way to include the boolean condition. Perhaps others can answer that better.
How about thinking a bit differently? If you don't mind executing two queries you can filter your articles based on whether the currently logged in user has voted on them or not. Something like this:
all_articles = Article.objects.all()
articles_user_has_voted_on = all_articles.filter(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
other_articles = all_articles.exclude(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
Update
After some experiments I was able to figure out how to add a boolean condition for a column in the same model (Article in this case) but not for a column in another table (Vote.caster).
If Article had a caster column:
Article.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})
In the present state this can be applied for the Vote model:
Vote.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})

Categories