Retrieve data from GenBank with Bio.Entrez module - python

I am trying to solve one of the Rosalind challenges and I can't seem to find a way to retrieve data, within a specific time frame.
http://rosalind.info/problems/gbk/
Do/How Do I modify Entrez.esearch() to specify a time frame?
Question:
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
Test Data:
Anthoxanthum
2003/7/25
2005/12/27
Answer: 7

Thanks a lot to #Kayvee for the pointer! It works like a charm!
Here is a format for searching the organism by 'posted between start-end':
(Anthoxanthum[Organism]) AND ("2003/7/25"[Publication Date] : "2005/12/27"[Publication Date])
Here is Python code:
# GenBank gene database
geneName = "Anthoxanthum"
pubDateStart = "2003/7/25"
pubDateEnd = "2005/12/27"
searchTerm = f'({geneName}[Organism]) AND("{pubDateStart}"[Publication Date]: "{pubDateEnd}"[Publication Date])'
print(f"\n[GenBank gene database]:")
Entrez.email = "please#pm.me"
handle = Entrez.esearch(db="nucleotide", term=searchTerm)
record = Entrez.read(handle)
print(record["Count"])

Related

Python: Extract unique sentences from string and place them in new column concatenated with ;

Afternoon Stackoverflowers,
I have been challenged with some extract/, as I am trying to prepare data for some users.
As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).
Problem statement:
My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).
Field example without any modification
INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]
Desired output:
Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order
Python code:
import pandas
excel_data_df = pandas.read_excel('rejections.xlsx')
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
excel_data_df['Invoice_Issues'].split(":", 1)
Output:
INVOICE_CREATION_FAILED[Invalid Address information:
I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line.
I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.
In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...
Many thanks,
Greg
This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.
Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.
import pandas
# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
'Invoice_Issues': [
'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})
# build a dictionary
# (just like "words" to their "meanings", this maps
# "error keys" to their "error descriptions".
errors_dictionary = {
'E01': 'Invalid Address information',
'E02': 'Incorrect Purchase order',
'E03': 'Invalid VAT ID',
# ....
'E39': 'No tax line available'
}
def extract_errors(invoice_issue):
# using the entry in the Invoice_Issues column,
# then loop over the dictionary to see if this error is in there.
# If so, add it to the list_of_errors.
list_of_errors = []
for error_number, error_description in errors_dictionary.items():
if error_description in invoice_issue:
list_of_errors.append(error_description)
return ';'.join(list_of_errors)
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)
# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

How to import data from CSV file containing certain words?

I have a CSV file containing daily data on yields of different government bonds of varying maturities. The headers are formatted as by the country followed by the maturity of the bond, for eg UK 10Y. What I would like to do is just import all the yields for one government bond at all maturities for one date, so for example import all the UK government bond yields at a particular date. The first date is 07/01/2021.
I know I can use Pandas, but all the codes I have seen require to use usecols function when importing. I'd like to just create a function and import only the data that I want without using usecols.
Snapshot of data, UK data is further right, but format is the same
You can try:
import time
import datetime
col_to_check = "UK government bond yields"
get_after = "07/01/2021"
get_after = time.mktime(datetime.datetime.strptime(get_after, "%d/%m/%Y").timetuple())
with open("yourfile.csv", "r") as msg:
data = msg.readlines()
index_to_check = data[0].split(",").index(col_to_check)
for i, v in enumerate(data):
if i == 0:
pass
else:
date = time.mktime(datetime.datetime.strptime(v.split(",")[index_to_check], "%d/%m/%Y").timetuple())
if date > get_after:
pass
else:
data[i] = ""
print ([x for x in data if x])
This is untested code as you did not provide a sample input but in principle it should work.
You have the header name of the column you want to check, the limit date.
You get the index of the first in your csv row. You convert the limit date to timestamp integer.
Then you read your data line by line and check. If the date/timestamp is greater than your limit you pass, else you assign empty value at the corresponding index of data.
Finally you remove empty elements to get the final list.

How to match asset price data from a csv file to another csv file with relevant news by date

I am researching the impact of news article sentiment related to a financial instrument and its potenatial effect on its instruments's price. I have tried to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of an instrument at that time, and at several itervals after that time, in our case t+2. However, program created twoM to the file, but does not return any calculated price changes
Previously, I used Reuters Eikon and its functions to conduct the research, described in the article below.
https://developers.refinitiv.com/article/introduction-news-sentiment-analysis-eikon-data-apis-python-example
However, instead of using data available from Eikon, I would like to use my own csv news file with my own price data from another csv file. I am trying to match the
excel_file = 'C:\\Users\\Artur\\PycharmProjects\\JRA\\sentimenteikonexcel.xlsx'
df = pd.read_excel(excel_file)
sentiment = df.Sentiment
print(sentiment)
start = df['GMT'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
end = df['GMT'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
spot_data = 'C:\\Users\\Artur\\Desktop\\stocksss.csv'
spot_price_10 = pd.read_csv(spot_data)
print(spot_price_10)
df['twoM'] = np.nan
for idx, newsDate in enumerate(df['GMT'].values):
sTime = df['GMT'][idx]
sTime = sTime.replace(second=0, microsecond=0)
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df['twoM'][idx] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
print(df)
However, the programm is not able to return the twoM price change values
I assume that you got a warning because you are trying to make changes on views. As soon as you have 2 [] (one for the column, one for the row) you can only read. You must use loc or iloc to write a value:
...
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df.loc[idx,'twoM'] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
...

From 10-K -- extract SIC, CIK, create metadata table

I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do.
The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.
This is consistent across all filings.
To do's:
Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align.
Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe?
Ultimately I will be using this table to help me subset the data between SIC industries.
Thank you for taking a look. Please let me know if I can provide additional documentation.
Here are some codes I just wrote for doing something similar. You can output the results to a CSV file. As the first step, you need to walk through the folder and get a list of all the 10-Ks and iterate over it.
year_end = ""
sic = ""
with open(txtfile, 'r', encoding='utf-8', errors='replace') as rawfile:
for cnt, line in enumerate(rawfile):
#print(line)
if "CONFORMED PERIOD OF REPORT" in line:
year_end = line[-9:-1]
#print(year_end)
if "STANDARD INDUSTRIAL CLASSIFICATION" in line:
match = re.search(r"\d{4}", line)
if match:
sic = match.group(0)
#print(sic)
#print(sic)
if (year_end and sic) or cnt > 100:
#print(year_end, sic)
break

Python CSV search for specific values after reading into dictionary

I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name

Categories