How to flatten data in python to load into sql db - python

I have data in python that looks like the following (there are sometimes many entries of these long strings) I want to load it into a single database table with three fields:
2 63668772 Human_STR_738862 AAAAAAAAAAAA AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
ACACACACACACACACACACACACACACAC
...
I want it to look like this to import into sqlite3
2 63668772 Human_STR_738862 AAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
2 63668772 Human_STR_738862 AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACACAC

started working with scikit-allel. This allowed me to read the vcf directly into a dataframe(df) where I could specify the number # of ALTs. I melted the df by the keys I wanted and now can load the data directly from the df into sqlite.

Related

Convert csv files into a single JSON (w/ arrays) using Python

I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]

Python JSON to a dataframe

I am using a Yahoo finance Python library to grab accounting financial data to do some basic analysis. All of the financial statement data comes in JSON format. I want the data to be in a tabular format as I typically see in a Python dataframe. Hello there are several wrappers around the data and I'm not sure how to remove those so that I can get my data into a simple columns and rows dataframe. Here is what the Python looks like:
{
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",
you don't have the full response so it's difficult to tell if this will be what you want
d = {
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",}}]}}
pd.json_normalize(d['incomeStatementHistory']['F'])
Output:
2019-12-31.researchDevelopment 2019-12-31.effectOfAccountingCharges 2019-12-31.incomeBeforeTax ... 2019-12-31.otherOperatingExpenses 2019-12-31.interestExpense 2019-12-31.extraordinaryItems
0 None None -640000000 ... None -1049000000 None
[1 rows x 12 columns]
You should use Pandas
Here its a tutorial of how to do that with pandas
Also you could check this question

convert comment (list) to dataframe ,pandas

I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines
Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover

Python DBF: How to associate a .cdx index with a .dbf table

I have been given an ambiguous task of automating a data extraction from various Visual FoxPro tables.
There are several pairs of .DBF and .CDX files. With the Python dbf package, I seem to be able to work with them. I have two files, an ABC.DBF and an ABC.CDX. I can load the table file using,
>>> import dbf
>>> table = dbf.Table('ABC.DBF')
>>> print(table[3])
0 - table_key : '\x00\x00\x04'
1 - field_1 : -1
2 - field_2 : 0
3 - field_3 : 34
4 - field_ 4 : 2
...
>>>
It's my understanding that .cdx files are indexes. I suspect that corresponds to the table_key field. According to the author, dbf can read indexes:
I can read IDX files, but not update them. My day job changed and dbf
files are not a large part of the new one. – Ethan Furman May 26 '16
at 21:05
Reading is all I need to do. I see that four classes exist, Idx, Index, IndexFile, and IndexLocation. These seem like good candidates.
The Idx class reads in a table and filename, which is promising.
>>> index = dbf.Idx(table, 'ABC.CDX')
I'm not sure how to make use of this object, though. I see that it has some generators, backward and forward, but when I try to use them I get an error
>>> print(list(index.forward()))
dbf.NotFoundError: 'Record 67305477 is not in table ABC.DBF'
How does one associate the .cdx index file to the .dbf table?
.idx and .cdx are not the same, and dbf cannot currently read .cdx files.
If you need the table to be sorted, you can create an in-memory index:
my_index = table.create_index(key=lambda r: r.table_key)
You can also create a full-fledged function:
def active(rec):
# do not show deleted records
if is_deleted(rec):
return DoNotIndex
return rec.table_key
my_index = table.create_index(active)
and then loop through the index instead of the table:
for record in my_index:
...

How to read through large csv or database and join columns when memory is an issue?

I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).
So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.
Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.
from operator import itemgetter as iget
d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
.drop_duplicates('NPI')\
.set_index('NPI')[['LastName', 'FirstName']]\
.to_dict(orient='index')
# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
# 1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}
df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))
# DOS Readmit SurgeonNPI LastName FirstName
# 0 1-1-2018 1 1184809691 GOLDMAN SALUJA
# 1 2-2-2018 0 1184809691 GOLDMAN SALUJA
# 2 2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If your final dataframe is too large to store in memory, then I would consider these options:
Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.
I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.
It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.
Sampling!
import pandas as pd
import random
n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)
Then this should work fine
summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.

Categories