I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a small excel file that contains prices for our online store & I am trying to automate this process, however, I don't fully trust the stuff to properly qualify the data, so I wanted to use Pandas to quickly check over certain fields, I have managed to achieve everything I need so far, however, I am only a beginner and I cannot think of the proper way for the next part.
So basically I need to qualify 2 columns on the same row, we have one column MARGIN, if this column is >60, then I need to check that the MARKDOWN column on the same row is populated == YES.
So my question is, how can I code it to basically say-
Below is an example of the way I have been doing my other checks, I realise it is quite beginner-ish, but I am only a beginner.
sku2 = df['SKU_2']
comp_at = df['COMPARE AT PRICE']
sales_price = df['SALES PRICE']
dni_act = df['DO NOT IMPORT - action']
dni_fur = df['DO NOT IMPORT - further details']
promo = df['PROMO']
replacement = df['REPLACEMENT']
go_live_date = df['go live date']
markdown = df['markdown']
# sales price not blank check
for item in sales_price:
if pd.isna(item):
with open('document.csv', 'a', newline="") as fd:
writer = csv.writer(fd)
writer.writerow(['It seems there is a blank sales price in here', str(file_name)])
fd.close
break
Example:
df = pd.DataFrame([
['a',1,2],
['b',3,4],
['a',5,6]],
columns=['f1','f2','f3'])
# | represents or
print(df[(df['f1'] == 'a') & (df['f2'] > 1)])
Output:
f1 f2 f3
2 a 5 6
The following scenario is given.
I have 2 dataframes called orders and customers.
I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.
The orders dataframe contains approximately 5.800.000 items.
The customer dataframe contains approximately 180 000 items.
I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?
# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})
def getMergeCustomerID(row):
customerOrderId = row['custId']
searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
searchMasterCustomer = searchMasterCustomer['id']
if len(searchMasterCustomer) > 0:
return searchMasterCustomer
else:
return customerOrderId
orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)
# expected result
custId orderId newId
1 2 5
2 3 5
3 4 7
4 5 6
I think that in some circumstances this approach can solve your problem:
Build a dictionary first,
myDict = {}
for i,j in customers.iterrows():
for j2 in j[1]:
myDict[j2]=j[0]
then use the dictionary to create the new column:
orders['newId'] = [myDict[i] for i in orders['custId']]
IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!
There is a table which contains data from 2014.The structure is as follows:
Each user can issue different number of book category.
User-id|Book-Category
1 |Thrill
2 |Thrill
3 |Mystery
3 |Mystery
The requirement is to find for each user,each type of book category issued.This data is already there in csv files but it is year wise available.
I have to add all those values.
eg:
data for 2014
u-id|book|count
1 |b1 |2
1 |b2 |4
... ... ...
data for 2015
u-id|book|count
1 |b1 |21
2 |b3 |12
//like the above format,available till 2018.(user1 with book b1 should have a count of 23
Now,I wrote a python script in which I just made a dictionary and iterated each row,if the key(u-id+book-category) was present,added the values of count otherwise,inserted key-value pair in that dictionary,did this for every year wise file in that script,since some files have size>1.5GB,the script kept on running for 7/8 hours,had to stop it.
Code:
import requests
import csv
import pandas as pd
Dict = {}
with open('data_2012.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['a']+row['b'] not in Dict:
Dict[row['a']+row['b']] = row['c']
##like this,iterating over the year wise files and finally writing the data to a different file.'a' and 'b' are mentioned at the first line of the data files for an easy access.
Is there any way with which we can achieve this functionality more elegantly in python or write a Map-Reduce job?
My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice.
F,D
B
F
F,X,C
C,F,D
Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter -- like a defaultdict(int), which you could use if your Python doesn't have Counter, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.
If I were to keep this project around for a while, I'd implement a database and then create the matrix you're talking about from a query against that database.
You have a Project table (let's say) with one record per project, an Actor table with one row per person, and a Participant table with a record per project for each actor that was in that project. (Each record would have an ID, a ProjectID, and an ActorID.)
From your example, you'd have 14 Project records, 7 Actor records (A through F, and X), and 31 Participant records.
Now, with this set up, each cell is a query against this database.
To reconstruct the matrix, first you'd add/update/remove the appropriate records in your database, and then rerun the query.
I guess that you don't have thousands of people working together in these projects. This implementation is pretty simple.
fp = open('projects.cvs')
# counts how many times each pair worked together
pairs = {}
# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
project.sort()
# someone is alone here
if len(project) == 1:
continue
# iterate over each pair
for i in range(len(project)):
for j in range(i+1, len(project)):
pair = (project[i], project[j])
# increase `pairs` counter
pairs[pair] = pairs.get(pair, 0) + 1
from pprint import pprint
pprint(pairs)
It outputs:
{('A', 'B'): 1,
('B', 'C'): 2,
('B', 'D'): 1,
('B', 'E'): 1,
('B', 'F'): 2,
('C', 'E'): 1,
('C', 'F'): 1,
('D', 'F'): 1}
I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.
I assume that the example data you listed is saved into a file called projects_data.csv (it doesn't need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.
Here's my code for this:
# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
lines = tmp.readlines()
lines = [line.split('\n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))
# Actual work.
df = pandas.DataFrame(
np.zeros((len(s),len(s))),
columns=sorted(list(s)),
index=sorted(list(s))
)
for line in lines:
if len(line) == 1:
df.ix[line[0],line[0]] += 1 # Single-person projects
elif len(line) > 1:
# Get all pairs in multi-person project.
tmp_pairs = list(itertools.combinations(line, 2))
# Append pair reversals to update (i,j) and (j,i) for each pair.
tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
for pair in tmp_pairs:
df.ix[pair[0], pair[1]] +=1
# Uncomment below if you don't want the list
# comprehension method for getting the reverals.
#df.ix[pair[1], pair[0]] +=1
# Final product
print df.to_string()
A B C D E F X
A 1 2 1 1 0 0 1
B 2 1 3 1 2 1 0
C 1 3 0 1 2 2 1
D 1 1 1 0 0 3 1
E 0 2 2 0 0 0 0
F 0 1 2 3 0 1 1
X 1 0 1 1 0 1 0
Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:
>>> df.sum()
A 6
B 10
C 10
D 7
E 4
F 8
X 4