Parsing CSVs for only one value - python
I am trying to parse data from CSV files. The files are in a folder and I want to extract data and write them to the db. However the csvs are not set up in a table format. I know how to import csvs into the db with the for each loop container, adding data flow tasks, and importing with OLE DB Destination.
The problem is just getting one value out of these csvs. The format of the file is as followed:
Title Title 2
Date saved ##/##/#### ##:## AM
Comment
[ Main ]
No. Measure Output Unit of measure
1 Name 8 µm
Count 0 pcs
[ XY Measure ]
X
Y
D
[ Area ]
No. Area Unit Perimeter Unit
All I want is just the output which is "8", to snatch the name of the file to make it name of the result or add it to a column, and the date and time to add to their own columns.
I am not sure which direction to head into and i hope someone has some things for me to look into. Originally, I wasn't sure if I should do the parsing externally (python) before using SQL server. If anyone knows another way I should use to get this done please let me know. Sorry for the unclear post earlier.
The expect outcome:
Filename Date Time Outcome
jnnnnnnn ##/##/#### ##:## 8
I'd try this:
filename = # from the from the path of the file you're parsing
# define appropriate vars
for row in csv_file:
if row.find('Date saved') > 0:
row = row.replace('Date saved ')
date_saved = row[0:row.find(' ')]
row = row.replace(date_saved + ' ')
time = row[0:row.find(' ')]
elif row.find(u"\u03BC"):
split_row = row.split(' ')
outcome = split_row[2]
# add filename,date_saved,time,outcome to data that will go in DB
Related
Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?
newbie python learner here! I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false. Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? : trialnum,colourtext,colourname,condition,response,rt,correct 1,blue,red,incongruent,red,0.767041,True 2,yellow,yellow,congruent,yellow,0.647259,True 3,green,blue,incongruent,blue,0.990185,True 4,green,green,congruent,green,0.720116,True 5,yellow,yellow,congruent,yellow,0.562909,True 6,yellow,yellow,congruent,yellow,0.538918,True 7,green,yellow,incongruent,yellow,0.693017,True 8,yellow,red,incongruent,red,0.679368,True 9,yellow,blue,incongruent,blue,0.951432,True 10,blue,blue,congruent,blue,0.633367,True 11,blue,green,incongruent,green,1.289047,True 12,green,green,congruent,green,0.668142,True 13,blue,red,incongruent,red,0.647722,True 14,red,blue,incongruent,blue,0.858307,True 15,red,red,congruent,red,1.820112,True 16,blue,green,incongruent,green,1.118404,True 17,red,red,congruent,red,0.798532,True 18,red,red,congruent,red,0.470939,True 19,red,blue,incongruent,blue,1.142712,True 20,red,red,congruent,red,0.656328,True 21,red,yellow,incongruent,yellow,0.978830,True 22,green,red,incongruent,red,1.316182,True 23,yellow,yellow,congruent,green,0.964292,False 24,green,green,congruent,green,0.683949,True 25,yellow,green,incongruent,green,0.583939,True 26,green,blue,incongruent,blue,1.474140,True 27,green,blue,incongruent,blue,0.569109,True 28,green,green,congruent,blue,1.196470,False 29,red,red,congruent,red,4.027546,True 30,blue,blue,congruent,blue,0.833177,True 31,red,red,congruent,red,1.019672,True 32,green,blue,incongruent,blue,0.879507,True 33,red,red,congruent,red,0.579254,True 34,red,blue,incongruent,blue,1.070518,True 35,blue,yellow,incongruent,yellow,0.723852,True 36,yellow,green,incongruent,green,0.978838,True 37,blue,blue,congruent,blue,1.038232,True 38,yellow,green,incongruent,yellow,1.366425,False 39,green,red,incongruent,red,1.066038,True 40,blue,red,incongruent,red,0.693698,True 41,red,blue,incongruent,blue,1.751062,True 42,blue,blue,congruent,blue,0.449651,True 43,green,red,incongruent,red,1.082267,True 44,blue,blue,congruent,blue,0.551023,True 45,red,blue,incongruent,blue,1.012258,True 46,yellow,green,incongruent,yellow,0.801443,False 47,blue,blue,congruent,blue,0.664119,True 48,red,green,incongruent,yellow,0.716189,False 49,green,green,congruent,yellow,0.630552,False 50,green,yellow,incongruent,yellow,0.721917,True 51,red,red,congruent,red,1.153943,True 52,blue,red,incongruent,red,0.571019,True 53,yellow,yellow,congruent,yellow,0.651611,True 54,blue,blue,congruent,blue,1.321344,True 55,green,green,congruent,green,1.159240,True 56,blue,blue,congruent,blue,0.861646,True 57,yellow,red,incongruent,red,0.793069,True 58,yellow,yellow,congruent,yellow,0.673190,True 59,yellow,red,incongruent,red,1.049320,True 60,red,yellow,incongruent,yellow,0.773447,True 61,red,yellow,incongruent,yellow,0.693554,True 62,red,red,congruent,red,0.933901,True 63,blue,blue,congruent,blue,0.726794,True 64,green,green,congruent,green,1.046116,True 65,blue,blue,congruent,blue,0.713565,True 66,blue,blue,congruent,blue,0.494177,True 67,green,green,congruent,green,0.626399,True 68,blue,blue,congruent,blue,0.711896,True 69,blue,blue,congruent,blue,0.460420,True 70,green,green,congruent,yellow,1.711978,False 71,blue,blue,congruent,blue,0.634218,True 72,yellow,blue,incongruent,yellow,0.632482,False 73,yellow,yellow,congruent,yellow,0.653813,True 74,green,green,congruent,green,0.808987,True 75,blue,blue,congruent,blue,0.647117,True 76,green,red,incongruent,red,1.791693,True 77,red,yellow,incongruent,yellow,1.482570,True 78,red,red,congruent,red,0.693132,True 79,red,yellow,incongruent,yellow,0.815830,True 80,green,green,congruent,green,0.614441,True 81,yellow,red,incongruent,red,1.080385,True 82,red,green,incongruent,green,1.198548,True 83,blue,green,incongruent,green,0.845769,True 84,yellow,blue,incongruent,blue,1.007089,True 85,green,blue,incongruent,blue,0.488701,True 86,green,green,congruent,yellow,1.858272,False 87,yellow,yellow,congruent,yellow,0.893149,True 88,yellow,yellow,congruent,yellow,0.569597,True 89,yellow,yellow,congruent,yellow,0.483542,True 90,yellow,red,incongruent,red,1.669842,True 91,blue,green,incongruent,green,1.158416,True 92,blue,red,incongruent,red,1.853055,True 93,green,yellow,incongruent,yellow,1.023785,True 94,yellow,blue,incongruent,blue,0.955395,True 95,yellow,yellow,congruent,yellow,1.303260,True 96,blue,yellow,incongruent,yellow,0.737741,True 97,yellow,green,incongruent,green,0.730972,True 98,green,red,incongruent,red,1.564596,True 99,yellow,yellow,congruent,yellow,0.978911,True 100,blue,yellow,incongruent,yellow,0.508151,True 101,red,green,incongruent,green,1.821969,True 102,red,red,congruent,red,0.818726,True 103,yellow,yellow,congruent,yellow,1.268222,True 104,yellow,yellow,congruent,yellow,0.585495,True 105,green,green,congruent,green,0.673404,True 106,blue,yellow,incongruent,yellow,1.407036,True 107,red,red,congruent,red,0.701050,True 108,red,green,incongruent,red,0.402334,False 109,red,green,incongruent,green,1.537681,True 110,green,yellow,incongruent,yellow,0.675118,True 111,green,green,congruent,green,1.004550,True 112,yellow,blue,incongruent,blue,0.627439,True 113,yellow,yellow,congruent,yellow,1.150248,True 114,blue,yellow,incongruent,yellow,0.774452,True 115,red,red,congruent,red,0.860966,True 116,red,red,congruent,red,0.499595,True 117,green,green,congruent,green,1.059725,True 118,red,red,congruent,red,0.593180,True 119,green,yellow,incongruent,yellow,0.855915,True 120,blue,green,incongruent,green,1.335018,True But I am only interested in the 'condition', 'rt', and 'correct' columns. I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table: Participant Stimulus Type Mean Reaction Time Percentage Correct 01 Congruent 0.560966 80 01 Incongruent 0.890556 64 02 Congruent 0.460576 89 02 Incongruent 0.956556 55 Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice! I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants? Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table: import os import glob import pandas as pd #set working directory os.chdir('data') #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig') Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table? Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes? I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this: from pathlib import Path # Use the Path class to represent a path. It offers more # functionalities when perform operations on paths path = Path("./data").resolve() # Create a dictionary whose keys are the Participant ID # (the `01` in `P01.csv`, etc), and whose values are # the data frames initialized from the CSV data = { p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv") } # Create a master data frame by combining the individual # data frames from each CSV file df = pd.concat(data, keys=data.keys(), names=["participant", None]) # Calculate the statistics result = ( df.groupby(["participant", "condition"]).agg(**{ "Mean Reaction Time": ("rt", "mean"), "correct": ("correct", "sum"), "size": ("trialnum", "size") }).assign(**{ "Percentage Correct": lambda x: x["correct"] / x["size"] }).drop(columns=["correct", "size"]) .reset_index() )
Python Panda's row selection
I've tried doing some searching, but I'm having troubles finding what I specifically need. I currently have this. location = 'Location' data = pd.read_csv('testbook.csv') df = pd.DataFrame(data) search = 'OR' # This will be replaced with an input row = (df[df.eq(search).any(1)]) print(row) Location = row.at[0, location] print(Location) This outputs this row print out Location City Price Etc 0 FL OR 50 123 Location print out FL this is the CSV information that it's pull the data from. My main question and issue is what I'm trying to find out is at this specific line of code Location = row.at[0, location] for Location what I'm trying to do and see if possible is in the brackets [0, location]. I want it to automate in the future since for example if I need to find instead of 'OR' I need to find what data is in 'OR1'. The issue is that the [0] is related to the Row # hence this(this is the entire df). Location City Price Etc 0 FL OR 50 123 1 FL1 OR1 501 1231 2 FL2 OR2 502 1232 I would have to manually change the code every single time which of course is unfeasible with what I'm trying to accomplish. My main question is, how do I pull specific row numbers all the way on the left and take that output and make it a variable that I can input anywhere?
I'm having a bit of trouble figuring out what you are looking for but this is my best guess import pandas as pd data = {'Location':['FL', 'FL1', 'FL2'], 'City': ['OR', 'OR1', 'OR2'], 'Price':[50, 501, 502], 'Etc': [123,1231,1232]} data = pd.DataFrame(data) df = pd.DataFrame(data) # Given search term -> find location search = 'OR' # Outputs 'FL' df['Location'][df['City'] == search].any()
Reading txt file into Anaconda?
Trying to read a .txt file into my Jupyter notebook. This is my code: acm = pd.read_csv('outputacm.txt', header=None, error_bad_lines=False) print(acm) Here is a sample of my txt file: 2244018 #*OQL[C++]: Extending C++ with an Object Query Capability. ##José A. Blakeley #year1995 #confModern Database Systems #citation14 #index0 #arnetid2 #*Transaction Management in Multidatabase Systems. ##Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz #year1995 #confModern Database Systems #citation22 #index1 #arnetid3 #*Overview of the ADDS System. ##Yuri Breitbart,Tom C. Reyes #year1995 #confModern Database Systems #citation-1 #index2 #arnetid4 And the different symbols are supposed to correspond to: #* --- paperTitle ## --- Authors #year ---- Year #conf --- publication venue #citation --- citation number (both -1 and 0 means none) #index ---- index id of this paper #arnetid ---- pid in arnetminer database #% ---- the id of references of this paper (there are multiple lines, with each indicating a reference) #! --- Abstract Not sure how to set this up so the data gets read correctly. Ideally, I would want a data frame where each category is a different column, and then all the entries in the document are rows. Thanks!
My regex is not as up to speed as it should be, but the below might work so long as the data remains in the same form and the column names aren't duplicated in other lines: import re import pandas as pd path = r"filepath.txt" f = open(path, 'r') year = [] confModern = [] #continue for all columns for ele in f: if len(re.findall('year', ele)) > 0: year.append(ele[5:]) if len(re.findall('confModern', ele)) > 0: year.append(ele[12:]) # continue for all columns with the needed string df = pd.DataFrame(data={'year' : year ...#continue for each list})
How to calculate a formula in Excel and pull the calculated value with OpenPyXL
I am trying to copy/paste quantities from one excel form into another excel form (that i use as a template) that calculates COGS, and then take the sum of the COGS and assign it to a variable. After some research on OpenPyXL, it looks like you cant calculate the formula in real-time, so i am attempting to save the file with the quantities copied, then re open it, grab my COGS sum, and clear out the quantities to leave it as a blank template. However, when i do this, i can see that it is pasting the quantity values and clearing them out, but the total COGS value remains 'None.' This is my first attempt at a Python application after reading through most of Automate the Boring, so I am quite new to this. Any help would be appreciated! # TODO: Calculate total sell using order form and total cost using ' Costing.xlsx # Create variables to hold qty of each trim from the total_sheet costing_wb = openpyxl.load_workbook('Z:\\Costing.xlsx') costing_sheet = costing_wb['COSTING'] for i in range(3, 46, 1): costing_sheet.cell(row=i+2, column=4).value = total_sheet.cell(row=i, column=3).value # Save Costing Sheet costing_wb.save('Z:\\Costing.xlsx') # TODO: total_COGS not grabbing value, returning as 'None' # Reopen costing sheet and set COGS value costing_wb = openpyxl.load_workbook('Z:\\Costing.xlsx', data_only=True) costing_sheet = costing_wb['COSTING'] total_COGS = costing_sheet['I6'].value # Empty Column D contents and save as blank costing_wb = openpyxl.load_workbook('Z:\\Costing.xlsx') costing_sheet = costing_wb['COSTING'] for i in range(5, 48, 1): costing_sheet.cell(row=i, column=4).value = None costing_wb.save('Z:\\Costing.xlsx')
From 10-K -- extract SIC, CIK, create metadata table
I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do. The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry. This is consistent across all filings. To do's: Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align. Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe? Ultimately I will be using this table to help me subset the data between SIC industries. Thank you for taking a look. Please let me know if I can provide additional documentation.
Here are some codes I just wrote for doing something similar. You can output the results to a CSV file. As the first step, you need to walk through the folder and get a list of all the 10-Ks and iterate over it. year_end = "" sic = "" with open(txtfile, 'r', encoding='utf-8', errors='replace') as rawfile: for cnt, line in enumerate(rawfile): #print(line) if "CONFORMED PERIOD OF REPORT" in line: year_end = line[-9:-1] #print(year_end) if "STANDARD INDUSTRIAL CLASSIFICATION" in line: match = re.search(r"\d{4}", line) if match: sic = match.group(0) #print(sic) #print(sic) if (year_end and sic) or cnt > 100: #print(year_end, sic) break