From 10-K -- extract SIC, CIK, create metadata table - python

I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do.
The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.
This is consistent across all filings.
To do's:
Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align.
Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe?
Ultimately I will be using this table to help me subset the data between SIC industries.
Thank you for taking a look. Please let me know if I can provide additional documentation.

Here are some codes I just wrote for doing something similar. You can output the results to a CSV file. As the first step, you need to walk through the folder and get a list of all the 10-Ks and iterate over it.
year_end = ""
sic = ""
with open(txtfile, 'r', encoding='utf-8', errors='replace') as rawfile:
for cnt, line in enumerate(rawfile):
#print(line)
if "CONFORMED PERIOD OF REPORT" in line:
year_end = line[-9:-1]
#print(year_end)
if "STANDARD INDUSTRIAL CLASSIFICATION" in line:
match = re.search(r"\d{4}", line)
if match:
sic = match.group(0)
#print(sic)
#print(sic)
if (year_end and sic) or cnt > 100:
#print(year_end, sic)
break

Related

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

newbie python learner here!
I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false.
Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? :
trialnum,colourtext,colourname,condition,response,rt,correct
1,blue,red,incongruent,red,0.767041,True
2,yellow,yellow,congruent,yellow,0.647259,True
3,green,blue,incongruent,blue,0.990185,True
4,green,green,congruent,green,0.720116,True
5,yellow,yellow,congruent,yellow,0.562909,True
6,yellow,yellow,congruent,yellow,0.538918,True
7,green,yellow,incongruent,yellow,0.693017,True
8,yellow,red,incongruent,red,0.679368,True
9,yellow,blue,incongruent,blue,0.951432,True
10,blue,blue,congruent,blue,0.633367,True
11,blue,green,incongruent,green,1.289047,True
12,green,green,congruent,green,0.668142,True
13,blue,red,incongruent,red,0.647722,True
14,red,blue,incongruent,blue,0.858307,True
15,red,red,congruent,red,1.820112,True
16,blue,green,incongruent,green,1.118404,True
17,red,red,congruent,red,0.798532,True
18,red,red,congruent,red,0.470939,True
19,red,blue,incongruent,blue,1.142712,True
20,red,red,congruent,red,0.656328,True
21,red,yellow,incongruent,yellow,0.978830,True
22,green,red,incongruent,red,1.316182,True
23,yellow,yellow,congruent,green,0.964292,False
24,green,green,congruent,green,0.683949,True
25,yellow,green,incongruent,green,0.583939,True
26,green,blue,incongruent,blue,1.474140,True
27,green,blue,incongruent,blue,0.569109,True
28,green,green,congruent,blue,1.196470,False
29,red,red,congruent,red,4.027546,True
30,blue,blue,congruent,blue,0.833177,True
31,red,red,congruent,red,1.019672,True
32,green,blue,incongruent,blue,0.879507,True
33,red,red,congruent,red,0.579254,True
34,red,blue,incongruent,blue,1.070518,True
35,blue,yellow,incongruent,yellow,0.723852,True
36,yellow,green,incongruent,green,0.978838,True
37,blue,blue,congruent,blue,1.038232,True
38,yellow,green,incongruent,yellow,1.366425,False
39,green,red,incongruent,red,1.066038,True
40,blue,red,incongruent,red,0.693698,True
41,red,blue,incongruent,blue,1.751062,True
42,blue,blue,congruent,blue,0.449651,True
43,green,red,incongruent,red,1.082267,True
44,blue,blue,congruent,blue,0.551023,True
45,red,blue,incongruent,blue,1.012258,True
46,yellow,green,incongruent,yellow,0.801443,False
47,blue,blue,congruent,blue,0.664119,True
48,red,green,incongruent,yellow,0.716189,False
49,green,green,congruent,yellow,0.630552,False
50,green,yellow,incongruent,yellow,0.721917,True
51,red,red,congruent,red,1.153943,True
52,blue,red,incongruent,red,0.571019,True
53,yellow,yellow,congruent,yellow,0.651611,True
54,blue,blue,congruent,blue,1.321344,True
55,green,green,congruent,green,1.159240,True
56,blue,blue,congruent,blue,0.861646,True
57,yellow,red,incongruent,red,0.793069,True
58,yellow,yellow,congruent,yellow,0.673190,True
59,yellow,red,incongruent,red,1.049320,True
60,red,yellow,incongruent,yellow,0.773447,True
61,red,yellow,incongruent,yellow,0.693554,True
62,red,red,congruent,red,0.933901,True
63,blue,blue,congruent,blue,0.726794,True
64,green,green,congruent,green,1.046116,True
65,blue,blue,congruent,blue,0.713565,True
66,blue,blue,congruent,blue,0.494177,True
67,green,green,congruent,green,0.626399,True
68,blue,blue,congruent,blue,0.711896,True
69,blue,blue,congruent,blue,0.460420,True
70,green,green,congruent,yellow,1.711978,False
71,blue,blue,congruent,blue,0.634218,True
72,yellow,blue,incongruent,yellow,0.632482,False
73,yellow,yellow,congruent,yellow,0.653813,True
74,green,green,congruent,green,0.808987,True
75,blue,blue,congruent,blue,0.647117,True
76,green,red,incongruent,red,1.791693,True
77,red,yellow,incongruent,yellow,1.482570,True
78,red,red,congruent,red,0.693132,True
79,red,yellow,incongruent,yellow,0.815830,True
80,green,green,congruent,green,0.614441,True
81,yellow,red,incongruent,red,1.080385,True
82,red,green,incongruent,green,1.198548,True
83,blue,green,incongruent,green,0.845769,True
84,yellow,blue,incongruent,blue,1.007089,True
85,green,blue,incongruent,blue,0.488701,True
86,green,green,congruent,yellow,1.858272,False
87,yellow,yellow,congruent,yellow,0.893149,True
88,yellow,yellow,congruent,yellow,0.569597,True
89,yellow,yellow,congruent,yellow,0.483542,True
90,yellow,red,incongruent,red,1.669842,True
91,blue,green,incongruent,green,1.158416,True
92,blue,red,incongruent,red,1.853055,True
93,green,yellow,incongruent,yellow,1.023785,True
94,yellow,blue,incongruent,blue,0.955395,True
95,yellow,yellow,congruent,yellow,1.303260,True
96,blue,yellow,incongruent,yellow,0.737741,True
97,yellow,green,incongruent,green,0.730972,True
98,green,red,incongruent,red,1.564596,True
99,yellow,yellow,congruent,yellow,0.978911,True
100,blue,yellow,incongruent,yellow,0.508151,True
101,red,green,incongruent,green,1.821969,True
102,red,red,congruent,red,0.818726,True
103,yellow,yellow,congruent,yellow,1.268222,True
104,yellow,yellow,congruent,yellow,0.585495,True
105,green,green,congruent,green,0.673404,True
106,blue,yellow,incongruent,yellow,1.407036,True
107,red,red,congruent,red,0.701050,True
108,red,green,incongruent,red,0.402334,False
109,red,green,incongruent,green,1.537681,True
110,green,yellow,incongruent,yellow,0.675118,True
111,green,green,congruent,green,1.004550,True
112,yellow,blue,incongruent,blue,0.627439,True
113,yellow,yellow,congruent,yellow,1.150248,True
114,blue,yellow,incongruent,yellow,0.774452,True
115,red,red,congruent,red,0.860966,True
116,red,red,congruent,red,0.499595,True
117,green,green,congruent,green,1.059725,True
118,red,red,congruent,red,0.593180,True
119,green,yellow,incongruent,yellow,0.855915,True
120,blue,green,incongruent,green,1.335018,True
But I am only interested in the 'condition', 'rt', and 'correct' columns.
I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table:
Participant
Stimulus Type
Mean Reaction Time
Percentage Correct
01
Congruent
0.560966
80
01
Incongruent
0.890556
64
02
Congruent
0.460576
89
02
Incongruent
0.956556
55
Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice!
I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants?
Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table:
import os
import glob
import pandas as pd
#set working directory
os.chdir('data')
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table?
Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes?
I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this:
from pathlib import Path
# Use the Path class to represent a path. It offers more
# functionalities when perform operations on paths
path = Path("./data").resolve()
# Create a dictionary whose keys are the Participant ID
# (the `01` in `P01.csv`, etc), and whose values are
# the data frames initialized from the CSV
data = {
p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv")
}
# Create a master data frame by combining the individual
# data frames from each CSV file
df = pd.concat(data, keys=data.keys(), names=["participant", None])
# Calculate the statistics
result = (
df.groupby(["participant", "condition"]).agg(**{
"Mean Reaction Time": ("rt", "mean"),
"correct": ("correct", "sum"),
"size": ("trialnum", "size")
}).assign(**{
"Percentage Correct": lambda x: x["correct"] / x["size"]
}).drop(columns=["correct", "size"])
.reset_index()
)

Python: Extract unique sentences from string and place them in new column concatenated with ;

Afternoon Stackoverflowers,
I have been challenged with some extract/, as I am trying to prepare data for some users.
As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).
Problem statement:
My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).
Field example without any modification
INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]
Desired output:
Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order
Python code:
import pandas
excel_data_df = pandas.read_excel('rejections.xlsx')
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
excel_data_df['Invoice_Issues'].split(":", 1)
Output:
INVOICE_CREATION_FAILED[Invalid Address information:
I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line.
I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.
In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...
Many thanks,
Greg
This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.
Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.
import pandas
# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
'Invoice_Issues': [
'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})
# build a dictionary
# (just like "words" to their "meanings", this maps
# "error keys" to their "error descriptions".
errors_dictionary = {
'E01': 'Invalid Address information',
'E02': 'Incorrect Purchase order',
'E03': 'Invalid VAT ID',
# ....
'E39': 'No tax line available'
}
def extract_errors(invoice_issue):
# using the entry in the Invoice_Issues column,
# then loop over the dictionary to see if this error is in there.
# If so, add it to the list_of_errors.
list_of_errors = []
for error_number, error_description in errors_dictionary.items():
if error_description in invoice_issue:
list_of_errors.append(error_description)
return ';'.join(list_of_errors)
# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())
# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)
# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

Retrieve data from GenBank with Bio.Entrez module

I am trying to solve one of the Rosalind challenges and I can't seem to find a way to retrieve data, within a specific time frame.
http://rosalind.info/problems/gbk/
Do/How Do I modify Entrez.esearch() to specify a time frame?
Question:
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
Test Data:
Anthoxanthum
2003/7/25
2005/12/27
Answer: 7
Thanks a lot to #Kayvee for the pointer! It works like a charm!
Here is a format for searching the organism by 'posted between start-end':
(Anthoxanthum[Organism]) AND ("2003/7/25"[Publication Date] : "2005/12/27"[Publication Date])
Here is Python code:
# GenBank gene database
geneName = "Anthoxanthum"
pubDateStart = "2003/7/25"
pubDateEnd = "2005/12/27"
searchTerm = f'({geneName}[Organism]) AND("{pubDateStart}"[Publication Date]: "{pubDateEnd}"[Publication Date])'
print(f"\n[GenBank gene database]:")
Entrez.email = "please#pm.me"
handle = Entrez.esearch(db="nucleotide", term=searchTerm)
record = Entrez.read(handle)
print(record["Count"])

Parsing CSVs for only one value

I am trying to parse data from CSV files. The files are in a folder and I want to extract data and write them to the db. However the csvs are not set up in a table format. I know how to import csvs into the db with the for each loop container, adding data flow tasks, and importing with OLE DB Destination.
The problem is just getting one value out of these csvs. The format of the file is as followed:
Title Title 2
Date saved ##/##/#### ##:## AM
Comment
[ Main ]
No. Measure Output Unit of measure
1 Name 8 µm
Count 0 pcs
[ XY Measure ]
X
Y
D
[ Area ]
No. Area Unit Perimeter Unit
All I want is just the output which is "8", to snatch the name of the file to make it name of the result or add it to a column, and the date and time to add to their own columns.
I am not sure which direction to head into and i hope someone has some things for me to look into. Originally, I wasn't sure if I should do the parsing externally (python) before using SQL server. If anyone knows another way I should use to get this done please let me know. Sorry for the unclear post earlier.
The expect outcome:
Filename Date Time Outcome
jnnnnnnn ##/##/#### ##:## 8
I'd try this:
filename = # from the from the path of the file you're parsing
# define appropriate vars
for row in csv_file:
if row.find('Date saved') > 0:
row = row.replace('Date saved ')
date_saved = row[0:row.find(' ')]
row = row.replace(date_saved + ' ')
time = row[0:row.find(' ')]
elif row.find(u"\u03BC"):
split_row = row.split(' ')
outcome = split_row[2]
# add filename,date_saved,time,outcome to data that will go in DB

Python CSV search for specific values after reading into dictionary

I need a little help reading specific values into a dictionary using Python. I have a csv file with User numbers. So user 1,2,3... each user is within a specific department 1,2,3... and each department is in a specific building 1,2,3... So I need to know how can I list all the users in department one in building 1 then department 2 in building 1 so on. I have been trying and have read everything into a massive dictionary using csv.ReadDict, but this would work if I could search through which entries I read into each dictionary of dictionaries. Any ideas for how to sort through this file? The CSV has over 150,000 entries for users. Each row is a new user and it lists 3 attributes, user_name, departmentnumber, department building. There are a 100 departments and 100 buildings and 150,000 users. Any ideas on a short script to sort them all out? Thanks for your help in advance
A brute-force approach would look like
import csv
csvFile = csv.reader(open('myfile.csv'))
data = list(csvFile)
data.sort(key=lambda x: (x[2], x[1], x[0]))
It could then be extended to
import csv
import collections
csvFile = csv.reader(open('myfile.csv'))
data = collections.defaultdict(lambda: collections.defaultdict(list))
for name, dept, building in csvFile:
data[building][dept].append(name)
buildings = data.keys()
buildings.sort()
for building in buildings:
print "Building {0}".format(building)
depts = data[building].keys()
depts.sort()
for dept in depts:
print " Dept {0}".format(dept)
names = data[building][dept]
names.sort()
for name in names:
print " ",name

Categories