I have been given an ambiguous task of automating a data extraction from various Visual FoxPro tables.
There are several pairs of .DBF and .CDX files. With the Python dbf package, I seem to be able to work with them. I have two files, an ABC.DBF and an ABC.CDX. I can load the table file using,
>>> import dbf
>>> table = dbf.Table('ABC.DBF')
>>> print(table[3])
0 - table_key : '\x00\x00\x04'
1 - field_1 : -1
2 - field_2 : 0
3 - field_3 : 34
4 - field_ 4 : 2
...
>>>
It's my understanding that .cdx files are indexes. I suspect that corresponds to the table_key field. According to the author, dbf can read indexes:
I can read IDX files, but not update them. My day job changed and dbf
files are not a large part of the new one. – Ethan Furman May 26 '16
at 21:05
Reading is all I need to do. I see that four classes exist, Idx, Index, IndexFile, and IndexLocation. These seem like good candidates.
The Idx class reads in a table and filename, which is promising.
>>> index = dbf.Idx(table, 'ABC.CDX')
I'm not sure how to make use of this object, though. I see that it has some generators, backward and forward, but when I try to use them I get an error
>>> print(list(index.forward()))
dbf.NotFoundError: 'Record 67305477 is not in table ABC.DBF'
How does one associate the .cdx index file to the .dbf table?
.idx and .cdx are not the same, and dbf cannot currently read .cdx files.
If you need the table to be sorted, you can create an in-memory index:
my_index = table.create_index(key=lambda r: r.table_key)
You can also create a full-fledged function:
def active(rec):
# do not show deleted records
if is_deleted(rec):
return DoNotIndex
return rec.table_key
my_index = table.create_index(active)
and then loop through the index instead of the table:
for record in my_index:
...
Related
I have several hundred text files. I want to extract a specific column with a set number of rows. The files are exactly the same the only thing different is the data values. I want to put that data into a new text file with each new column preceding the previous one.
The file is a .sed basically the same as a .txt file. this is what it looks like. The file actually goes from Wvl 350-2150.
Comment:
Version: 2.2
File Name: C:\Users\HyLab\Desktop\Curtis
Bernard\PSR+3500_1596061\PSR+3500_1596061\2019_Feb_16\Contact_00186.sed
<Metadata>
Collected By:
Sample Name:
Location:
Description:
Environment:
</Metadata>
Instrument: PSR+3500_SN1596061 [3]
Detectors: 512,256,256
Measurement: REFLECTANCE
Date: 02/16/2019,02/16/2019
Time: 13:07:52.66,13:29:17.00
Temperature (C): 31.29,8.68,-5.71,31.53,8.74,-5.64
Battery Voltage: 7.56,7.20
Averages: 10,10
Integration: 2,2,2,10,8,2
Dark Mode: AUTO,AUTO
Foreoptic: PROBE {DN}, PROBE {DN}
Radiometric Calibration: DN
Units: None
Wavelength Range: 350,2500
Latitude: n/a
Longitude: n/a
Altitude: n/a
GPS Time: n/a
Satellites: n/a
Calibrated Reference Correction File: none
Channels: 2151
Columns [5]:
Data:
Chan.# Wvl Norm. DN (Ref.) Norm. DN (Target) Reflect. %
0 350.0 1.173460E+002 1.509889E+001 13.7935
1 351.0 1.202493E+002 1.529762E+001 13.6399
2 352.0 1.232869E+002 1.547818E+001 13.4636
3 353.0 1.264006E+002 1.563467E+001 13.2665
4 354.0 1.294906E+002 1.578425E+001 13.0723
I've taken some coding classes but that was a long time ago. I figured this is a pretty straightforward problem for even a novice coder which I am not but I can't seem to find anything like this so was hoping for help on here.
I honestly don't need anything fancy just something like this would be amazing so I don't have to copy and paste each file!
12.3 11.3 etc...
12.3 11.3 etc...
12.3 11.3 etc...
etc.. etc.. etc...
In MATLAB R2016b or later, the easiest way to do this would be using readtable:
t = readtable('file.sed', delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
where
file.sed is the name of the file
'NumVariables', 5 means there are 5 columns of data
'DataLines', 36 means the data starts on the 36th line and continues to the end of the file
'Delimiter', ' ' means the character that separates the columns is a space
'ConsecutiveDelimitersRule', 'join' means treat more than one space as if they were just one (rather than as if they separate empty columns of data).
This assumes that the example file you've posted is in the exact format of your real data. If it's different you may have to modify the parameters above, possibly with reference to the help for delimitedTextImportOptions or as an alternative, fixedWidthImportOptions.
Now you have a MATLAB table t with five columns, of which column 2 is the wavelengths and column 5 is the reflectances - I assume that's the one you want? You can access that column with
t(:,5)
So to collect all the reflectance columns into one table you would do something like
fileList = something % get the list of files from somewhere - say as a string array or a cell array of char
resultTable = table;
for ii = 1:numel(fileList)
sedFile = fileList{ii};
t = readtable(sedFile, delimitedTextImportOptions( ...
'NumVariables', 5, 'DataLines', 36, ...
'Delimiter', ' ', 'ConsecutiveDelimitersRule', 'join'));
t.Properties.VariableNames{5} = sprintf('Reflectance%d', ii);
resultTable = [resultTable, t(:,5)];
end
The t.Properties.VariableNames ... line is there because column 5 of t will be called Var5 every time, but in the result table each variable name needs to be unique. Here we're renaming the output table variables Reflectance1, Reflectance2 etc but you could change this to whatever you want - perhaps the name of the actual file from sedFile - as long as it's a valid unique variable name.
Finally you can save the result table to a text file using writetable. See the MATLAB help for how to use that.
In Python 3.x with numpy:
import numpy as np
file_list = something # filenames in a Python list
result_array = None
for sed_file in file_list:
reflectance_column = np.genfromtxt(sed_file, skip_header=35, usecols=4)
result_array = (reflectance_column if result_array is None else
np.column_stack((result_array, reflectance_column)))
np.savetxt('outputfile.txt', result_array)
Here
skip_header=35 ignores the first 35 lines
usecols=4 only returns column 5 (Python uses zero-based indexing)
see the help for savetxt for further details
I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines
Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover
I have data in python that looks like the following (there are sometimes many entries of these long strings) I want to load it into a single database table with three fields:
2 63668772 Human_STR_738862 AAAAAAAAAAAA AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
ACACACACACACACACACACACACACACAC
...
I want it to look like this to import into sqlite3
2 63668772 Human_STR_738862 AAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
2 63668772 Human_STR_738862 AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACACAC
started working with scikit-allel. This allowed me to read the vcf directly into a dataframe(df) where I could specify the number # of ALTs. I melted the df by the keys I wanted and now can load the data directly from the df into sqlite.
I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).
So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.
Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.
from operator import itemgetter as iget
d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
.drop_duplicates('NPI')\
.set_index('NPI')[['LastName', 'FirstName']]\
.to_dict(orient='index')
# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
# 1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}
df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))
# DOS Readmit SurgeonNPI LastName FirstName
# 0 1-1-2018 1 1184809691 GOLDMAN SALUJA
# 1 2-2-2018 0 1184809691 GOLDMAN SALUJA
# 2 2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If your final dataframe is too large to store in memory, then I would consider these options:
Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.
I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.
It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.
Sampling!
import pandas as pd
import random
n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)
Then this should work fine
summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.
I have a CSV file contains data like this:
I have write down a code which is able to retrieve the rows which contains "Active" at second column "outcome":
Data:
No,Outcome,target,result
1,Active,PGS2,positive
2,inactive,IM2,negative
3,inactive,IGI,positive
4,Active,IIL,positive
5,Active,P53,negative
Code:
new_file = open(my_file)
lines = new_file.readlines()
for line in lines:
if "Active" in line:
print line,
Outcome:
No,Outcome,target,result
1,Active,PGS2,positive
4,Active,IIL,positive
5,Active,P53,negative
How can i write down this code using pandas library so that i can make this code shorter if i am using pandas functionality after retrieving the rows.
Also this code is not suitable when you have "Active" key word same where else in yor row because that can retrieve a false row. I found after previewing some posts that "pandas" is very suitable library for CSV Handling.
Why not just filter this aftewards, it will be faster than parsing line by line. Just do this:
In [172]:
df[df['Outcome']=='Active']
Out[172]:
No Outcome target result
0 1 Active PGS2 positive
3 4 Active IIL positive
4 5 Active P53 negative