I am writing a script that produces histograms of specific columns in a tab-delimited text file. Currently, the program will create a single graph from a hard coded column number that I am using as a placeholder.
The input table looks something like this:
SAMPID TRAIT COHORT AGE BMI WEIGHT WAIST HEIGHT LDL HDL
123 LDL STUDY1 52 32.2 97.1 102 149 212.5 21.4
456 LDL STUDY1 33 33.7 77.0 101 161 233.2 61.2
789 LDL STUDY2 51 25.1 67.1 107 162 231.1 21.3
abc LDL STUDY2 76 33.1 80.4 99 134 220.5 21.2
...
And I have the following code:
import csv
import numpy
from matplotlib import pyplot
r = csv.reader(open("path",'r'), delimiter = '\t')
input_table=[]
for row in r:
input_table.append(row)
column=[]
missing=0
nonmissing=0
for E in input_table[1:3635]: # the number of rows in the input table
if E[8] == "": missing+=1 # [8] is hard coded now, want to change this to column header name "LDL"
else:
nonmissing +=1
column.append(float(E[8]))
pyplot.hist(column, bins=20, label="the label") # how to handle multiple histogram outputs if multiple column headers are specified?
print "n = ", nonmissing
print "numer of missing values: ", missing
pyplot.show()
Can anyone offer suggestions that would allow me to expand/improve my program to do any of the following?
graph data from columns specified by header name, not the column number
iterate over a list containing multiple header names to create/display several histograms at once
Create a graph that only includes a subset of the data, as specified by a specific value in a column (ie, for a specific sample ID, or a specific COHORT value)
One component not shown here is that I will eventually have a separate input file that will contain a list of headers (ie "HDL", "LDL", "HEIGHT") needing to be graphed separately, but then displayed together in a grid-like manner.
I can additional information if needed.
Well, I have a few comments and suggestions, hope it helps.
In my opinion, the first thing you should do to get all those things you want is to structure your data.
Try to create, for each row from the file, a dictionary like
{'SAMPID': <value_1>, 'TRAIL': <value_2>, ...}
And then you will have a list of such dict objects, and you will be able to iterate it and filter by any field you wish.
That is the first and most important point.
After you do that, modularize your code, do not just create a single script to get all the job done. Identify the pieces of code that will be redundant (as a filtering loop), put it into a function and call it, passing all necessary args.
One aditional detail: You don't need to hadcode the size of your list as in
for E in input_table[1:3635]:
Just write
for E in input_table[1:-1]
And it should do for every list. Of course, if you stop treating you data as raw text, that won't be necessary. Just iterate your list of dicts normally.
If you have more doubts, let me know.
Francisco
Related
To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.
Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).
This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
I have data in a txt file and need to separate the data. Apologizes but i am really finding this hard (and maybe hard to explain aswell). the below is the top few lines of the txt file (there are 1000 lines). I need all the data between the first * in row 0 and the last * which is in row 700. I dont want to select by row number as the numbers can change but I want a code which will select the data between the *. Secondly the data is NOT separated into columns and it is one big row. I want a second piece of code which can separate the data into columns ie Latter REPORT, Calculation Date, Index Code are columns (I cant separate on space because it splits Calculation and Date into separate columns when they should be one column.) Please can someone help me and thank you!
0
0 *
1 #124 Latter REPORT D51D ...
2 # 1 Calculation Date calc_da...
3 # 2 Index Code modes2_in...
4 # 3 Index Name index_n...
120 #120 5 Years ADPS Growth Rate 5_years...
121 #121 1 Year ADPS Growth Rate 1_year_...
122 #122 Payout Ratio payout_...
123 #123 Reserved 26 reserve...
124 #124 Reserved 27 reserve...
125 *
Assuming the dataframe is called dat, for the first part to find the asterisks:
asterisk_location = dat[0] == '*'
asterisk_location = asterisk_location[asterisk_location]
start, finish = asterisk_location.index
dat = dat.iloc[start+1:finish]
This also assumes you want to get the region between the first two asterisks. If there's more, you'll have to adjust a bit.
I have a pandas dataframe that looks like:
cleanText.head()
source word count
0 twain_ess 988
1 twain_ess works 139
2 twain_ess short 139
3 twain_ess complete 139
4 twain_ess would 98
5 twain_ess push 94
And a dictionary that contains the total word count for each source:
titles
{'orw_ess': 1729, 'orw_novel': 15534, 'twain_ess': 7680, 'twain_novel': 60004}
My goal is to normalize the word counts for each source by the total number of words in that source, i.e. turn them into a percentage. This seems like it should be trivial but python seems to make it very difficult (if anyone could explain the rules for inplace operations to me that would be great).
The caveat comes from needing to filter the entries in cleanText to just those from a single source, and then I attempt to inplace divide the counts for this subset by the value in the dictionary.
# Adjust total word counts and normalize
for key, value in titles.items():
# This corrects the total words for overcounting the '' entries
overcounted= cleanText[cleanText.iloc[:,0]== key].iloc[0,2]
titles[key]= titles[key]-overcounted
# This is where I divide by total words, however it does not save inplace, or at all for that matter
cleanText[cleanText.iloc[:,0]== key].iloc[:,2]= cleanText[cleanText.iloc[:,0]== key]['count']/titles[key]
If anyone could explain how to alter this division statement so that the output is actually saved in the original column that would be great.
Thanks
If I understand Correctly:
cleanText['count']/cleanText['source'].map(titles)
Which gives you:
0 0.128646
1 0.018099
2 0.018099
3 0.018099
4 0.012760
5 0.012240
dtype: float64
To re-assign these percentage values into your count column, use:
cleanText['count'] = cleanText['count']/cleanText['source'].map(titles)
Let's say I have a (9000x9000) table like the following:
zone 304 305 306 307 308 ...
001 1 2 8 9 12 ...
002 6 8 3 7 1 ...
003 4 8 1 12 9 ...
004 2 7 3 16 34 ...
...
The main data table looks like this:
package # weight origin destination zone
123 2oz 004 305 7 to be inputted here
.
.
.
I need SAS to output the "zone" corresponding to a given ordered pair. I fear the only way would be with some type of loop. For instance, in the example above, the orgin value is from the row labels and the destination from the column labels. The intersection is the target value I need assigned to "zone".
A solution using python data wrangling libraries would work also.
Also, the 9000x9000 table is an Excel CSV file.
You could use pandas, it has a built in function to read from an excel document: pandas.read_excel()
So for this file:
import pandas as pd
df = pd.read_excel('test.xlsx')
print(df[101][502])
Output:
67
My approach:
Load the data set into a temporary array (9000x9000) and then lookup each element as needed. Could be memory intensive, but 9000*9000 seems small enough to me.
Another safe approach, transpose the data to be in a long format:
Key1 Key2 Value
001 304 1
001 305 2
...
Then, in any language, it becomes a join/merge instead of lookup.
You can also use PROC IML, which loads the data as a matrix and then you can access using the indexes.
There are also ways in SAS to do this lookup via a merge, primarily using VVALUEX.
Without knowing how you're going to use it, I can't provide any more information.
EDIT: added 3'rd option which is IML. Basically there are many ways to do this, the best depends on how you're planning to use it overall.
EDIT2:
1. Import first data set into SAS (PROC IMPORT)
2. Transpose using PROC TRANSPOSE
3. Merge either data step or PROC SQL, by ORIGIN DESTINATION, which will be straight forward. At this point it's really a standard lookup with 2 keys.
Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!
Here is the csv sample (has 200 rows total):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
Here is what I have so far:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!
Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']