Convert csv files into a single JSON (w/ arrays) using Python - python

I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]

Related

Read file using Python/Pandas

I have a tab-delimited file with data like:
id Name address dept sal
1 abc "bangalore,
Karnataka,
Inida" 10 500
2 xyz "Hyderabad
Inida" 20 500
Here the columns are id,Name,address,dept, and sal.
The issue is with address columns that can contain a new line character. I tried different methods to read the file using Pandas and Python but instead of two rows, I am getting multiple rows as output.
Here are the few commands I tried:
file1 = open('C:/dummy/dummy.csv', 'r')
lines = file1.readlines()
for i in lines:
print(i)
and
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
Can anyone please help?
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
The corresponding output is, in case the columns are tab-delimited in the csv-file, as you say
id Name address dept sal
0 1 abc bangalore,\r\nKarnataka,\r\nInida 10 500
1 2 xyz Hyderabad\r\nInida 20 500
If you like to remove the CR-LF within the string, you can remove them via post-processing.
Additionally you could define the index-column via
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"',index_col=0)
What is your desired/expected output?

Writing pandas column to csv without merging integers

I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)

Neo4j import csv with spliting values in a column

I want to the following data to a neo4j database. The column "friends" is a string of ids separated by ",". So, there should be 10 nodes (id of 1-10) where only 5 of them I know about their ages. And I want to have relationship between each id and their friends.
Example dataframe
id age friends
1 10 "3,2"
2 20 "1,6"
3 15 "4,5,10"
4 13 "2,8,9"
5 25 "1,4,7"
My code using is
LOAD CSV WITH HEADERS FROM 'file:///df.csv' AS line
MERGE (User {id: line.id, age: line.age})
LOAD CSV WITH HEADERS FROM 'file:///df.csv' AS line
UNWIND split(line.friends, ',') AS friends
MERGE (u:User {id: friends})
LOAD CSV WITH HEADERS FROM 'file:///df.csv' AS line
UNWIND split(line.friends, ',') AS friends
With line, friends
MERGE (User1{id: line.id})-[:FRIENDS]->(User2{id: friends})
Is it the correct way to do it? And how to simplify the code?
A foreach should do it:
LOAD CSV WITH HEADERS FROM 'file:///df.csv' AS line
MERGE (u:User {id: line.id}) SET u.age=toInteger(line.age)
WITH u,line,split(line.friends, ',') AS friends
FOREACH (f in friends | merge (friend:User {id: f}) merge (u)-[:FRIENDS]->(friend))
All values are Strings, so you will need to convert to other data types if required (see age).

How to flatten data in python to load into sql db

I have data in python that looks like the following (there are sometimes many entries of these long strings) I want to load it into a single database table with three fields:
2 63668772 Human_STR_738862 AAAAAAAAAAAA AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
ACACACACACACACACACACACACACACAC
...
I want it to look like this to import into sqlite3
2 63668772 Human_STR_738862 AAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACAC
2 63668772 Human_STR_738862 AAAAAAAAAAAAAA
2 63675572 Human_STR_738864 ACACACACACACACACACACACACACACAC
started working with scikit-allel. This allowed me to read the vcf directly into a dataframe(df) where I could specify the number # of ALTs. I melted the df by the keys I wanted and now can load the data directly from the df into sqlite.

How to divide a dbf table to two or more dbf tables by using python

I have a dbf table. I want to automatically divide this table into two or more tables by using Python. The main problem is, that this table consists of more groups of lines. Each group of lines is divided from the previous group by empty line. So i need to save each of groups to a new dbf table. I think that this problem could be solved by using some function from Arcpy package and FOR cycle and WHILE, but my brain cant solve it :D :/ My source dbf table is more complex, but i attach a simple example for better understanding. Sorry for my poor english.
Source dbf table:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
4
5 D 2
6 E 3
I want get dbf1:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
I want get dbf2:
ID NAME TEAM
1 D 2
2 E 3
Using my dbf package it could look something like this (untested):
import dbf
source_dbf = '/path/to/big/dbf_file.dbf'
base_name = '/path/to/smaller/dbf_%03d'
sdbf = dbf.Table(source_dbf)
i = 1
ddbf = sdbf.new(base_name % i)
sdbf.open()
ddbf.open()
for record in sdbf:
if not record.name: # assuming if 'name' is empty, all are empty
ddbf.close()
i += 1
ddbf = sdbf.new(base_name % i)
continue
ddbf.append(record)
ddbf.close()
sdbf.close()

Categories