Related
I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]
I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)
I've converted a nested JSON file to a pandas DataFrame. Some of the columns now contain lists. They look like this:
0 [BikeParking: True, BusinessAcceptsBitcoin: Fa...
1 [BusinessAcceptsBitcoin: False, BusinessAccept...
2 [Alcohol: none, Ambience: {'romantic': False, ...
3 [AcceptsInsurance: False, BusinessAcceptsCredi...
4 [BusinessAcceptsCreditCards: True, Restaurants...
5 [BusinessAcceptsCreditCards: True, ByAppointme...
6 [BikeParking: True, BusinessAcceptsCreditCards...
7 [Alcohol: none, Ambience: {'romantic': False, ...
8 [BusinessAcceptsCreditCards: True]
9 [BikeParking: True, BusinessAcceptsCreditCards...
10 None
.
.
.
144070 [Alcohol: none, Ambience: {'romantic': False, ...
144071 [BikeParking: True, BusinessAcceptsCreditCards...
Name: attributes, dtype: object
and this:
0 [Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...
1 [Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...
2 [Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...
3 [Tuesday 10:0-21:0, Wednesday 10:0-21:0, Thurs...
4 None
144066 None
144067 [Tuesday 8:0-16:0, Wednesday 8:0-16:0, Thursda...
144068 [Tuesday 10:0-17:30, Wednesday 10:0-17:30, Thu...
144069 None
144070 [Monday 11:0-20:0, Tuesday 11:0-20:0, Wednesda...
144071 [Monday 10:0-21:0, Tuesday 10:0-21:0, Wednesda...
Name: hours, dtype: object
Is there any way for me to automatically extract the tags (BikeParking, AcceptsInsurance etc.) and use them as column names while filling the cells with the true/false values. For the Ambience dict I want to do something like Ambience_romantic and true/false in the cells. Similarly, I want to extract the days of the week as Column names and use the hours to fill the cells.
Or is there a way to flatten the json data before? I have tried passing the json data line by line to json_normalize and creating a dataframe from the output but it produces the same result. Maybe I'm doing something wrong?
Format of Original json (yelp_academic_dataset_business.json):
{
"business_id":"encrypted business id",
"name":"business name",
"neighborhood":"hood name",
"address":"full address",
"city":"city",
"state":"state -- if applicable --",
"postal code":"postal code",
"latitude":latitude,
"longitude":longitude,
"stars":star rating, rounded to half-stars,
"review_count":number of reviews,
"is_open":0/1 (closed/open),
"attributes":["an array of strings: each array element is an attribute"],
"categories":["an array of strings of business categories"],
"hours":["an array of strings of business hours"],
"type": "business"
}
My inital attempt with json_normalize:
with open('yelp_academic_dataset_business.json') as f:
#Normalize the json data to flatten it and store output in a dataframe
frame= json_normalize([json.loads(line) for line in f])
#write the dataframe to a csv file
frame.to_csv('yelp_academic_dataset_business.csv', encoding='utf-8', index=False)
What I'm currently trying:
with open(json_filename) as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
df = read_json(data_json_str)
#Now Looking to expand df['attributes'] and others here
And I should also mention my aim is to convert it to .csv to load it into a Database. I don't want lists in my database columns.
You can get the original json data from the Yelp Dataset Challenge site:
https://www.yelp.ca/dataset_challenge/dataset
You're trying to convert "documents" (semi-structured data) into a table. This could be problematic if one record contains e.g. 100 attributes which no other records have--you probably don't want to add 100 columns to a master table and have empty cells for all other records.
But in the end you have explained that you intend to do this:
Load JSON.
Convert to Pandas.
Export CSV.
Import into a database.
And I am here to tell you that this is all entirely pointless. Mashing the data through all these intermediate formats will only cause problems.
Instead, let's get back to basics:
Load JSON.
Write to database.
Now the first step is coming up with a schema. Or, if you're using a NoSQL database, you can directly load the JSON with no other steps required.
I have a CSV file contains data like this:
I have write down a code which is able to retrieve the rows which contains "Active" at second column "outcome":
Data:
No,Outcome,target,result
1,Active,PGS2,positive
2,inactive,IM2,negative
3,inactive,IGI,positive
4,Active,IIL,positive
5,Active,P53,negative
Code:
new_file = open(my_file)
lines = new_file.readlines()
for line in lines:
if "Active" in line:
print line,
Outcome:
No,Outcome,target,result
1,Active,PGS2,positive
4,Active,IIL,positive
5,Active,P53,negative
How can i write down this code using pandas library so that i can make this code shorter if i am using pandas functionality after retrieving the rows.
Also this code is not suitable when you have "Active" key word same where else in yor row because that can retrieve a false row. I found after previewing some posts that "pandas" is very suitable library for CSV Handling.
Why not just filter this aftewards, it will be faster than parsing line by line. Just do this:
In [172]:
df[df['Outcome']=='Active']
Out[172]:
No Outcome target result
0 1 Active PGS2 positive
3 4 Active IIL positive
4 5 Active P53 negative
Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.
I have an excel spreadsheet that is setup the following way:
unique ID person_1 person _2 person_3 person_4 ... person_20
12 Tom Sally Mike
16 Joe Mike
5 Joe Sally
1 Sally Mike Tom
6 Sally Tom Mike
2 Jared Joe Mike John ... Carl
I want to do a few things:
1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.
2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.
3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).
Thanks for the help and suggestions
The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.
import csv
fi = open('xgroup.csv')
fo = open('xgroup3.csv', 'w')
w = csv.writer(fo)
r = csv.reader(fi)
li = 0
print "Opened reader and writer"
for row in r:
gname = '/'.join(row[1:])
sname = '/'.join(sorted(row[1:]))
tname = '/'.join([row[1]]+sorted(row[2:]))
w.writerow([row[0], gname, sname, tname]+row[1:])
li += 1
fi.close()
fo.close()
print "Closed reader and writer after",li,"lines"
File xgroup.csv is shown next.
unique-ID,person_1,person,_2,person_3,person_4,...,person_20
12,Tom,Sally,Mike
16,Joe,Mike
5,Joe,Sally
1,Sally,Mike,Tom
6,Sally,Tom,Mike
2,Jared,Joe,Mike,John,...,Carl
Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.
unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
Note, given a data line like
5,Joe,Sally,,,,,
instead of
5,Joe,Sally
the program as above produces
5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
instead of
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
If that's a problem, filter out empty entries. For example, if
row=['5', 'Joe', 'Sally', '', '', '', '', ''], then
'/'.join(row[1:]) produces
'Joe/Sally/////', while
'/'.join(filter(lambda x: x, row[1:])) and
'/'.join(x for x in row[1:] if x) and
'/'.join(filter(len, row[1:])) produce
'Joe/Sally' .
You could do the following:
Export your file to a .csv file from Excel
Open that input file using python's csv module, using csv.reader
Open another file (output) to write to it using csv.writer
Iterate over each row in your reader, do your treatment, and write that using your writer
Import the output file in Excel