Get unique of a column ordered by date in 800million rows - python

Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.

So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

How to create a single json file from two DataFrames?

I have two DataFrames, and I want to post these DataFrames as json (to the web service) but first I have to concatenate them as json.
#first df
input_df = pd.DataFrame()
input_df['first'] = ['a', 'b']
input_df['second'] = [1, 2]
#second df
customer_df = pd.DataFrame()
customer_df['first'] = ['c']
customer_df['second'] = [3]
For converting to json, I used following code for each DataFrame;
df.to_json(
path_or_buf='out.json',
orient='records', # other options are (split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’)
date_format='iso',
force_ascii=False,
default_handler=None,
lines=False,
indent=2
)
This code gives me the table like this: For ex, input_df export json
[
{
"first":"a",
"second":1
},
{
"first":"b",
"second":2
}
]
my desired output is like that:
{
"input": [
{
"first": "a",
"second": 1
},
{
"first": "b",
"second": 2
}
],
"customer": [
{
"first": "d",
"second": 3
}
]
}
How can I get this output like this? I couldn't find the way :(
You can concatenate the DataFrames with appropriate key names, then groupby the keys and build dictionaries at each group; finally build a json string from the entire thing:
out = (
pd.concat([input_df, customer_df], keys=['input', 'customer'])
.droplevel(1)
.groupby(level=0).apply(lambda x: x.to_dict('records'))
.to_json()
)
Output:
'{"customer":[{"first":"c","second":3}],"input":[{"first":"a","second":1},{"first":"b","second":2}]}'
or a dict by replacing the last to_json() to to_dict().

Converting CSV do JSON with Pandas

I have the data in CSV format, for example:
First row is column number, let's ignore that.
Second row, starting at Col_4, are number of days
Third row on: Col_1 and Col_2 are coordinates (lon, lat), Col_3 is a statistical value, Col_4 onwards are measurements.
As you can see, this format is a confusing mess. I would like to convert this to JSON the following way, for example:
{"points":{
"dates": ["20190103", "20190205"],
"0":{
"lon": "-8.072557",
"lat": "41.13702",
"measurements": ["-0.191792","-10.543130"],
"val": "-1"
},
"1":{
"lon": "-8.075557",
"lat": "41.15702",
"measurements": ["-1.191792","-2.543130"],
"val": "-9"
}
}
}
To summarise what I've done till now, I read the CSV to a Pandas DataFrame:
df = pandas.read_csv("sample.csv")
I can extract the dates into a Numpy Array with:
dates = df.iloc[0][3:].to_numpy()
I can extract measurements for all points with:
measurements_all = df.iloc[1:,3:].to_numpy()
And the lon and lat and val, respectively, with:
lon_all = df.iloc[1:,0:1].to_numpy()
lat_all = df.iloc[1:,1:2].to_numpy()
val_all = df.iloc[1:,2:3].to_numpy()
Can anyone explain how I can format this info a structure identical to the .json example?
With this dataframe
df = pd.DataFrame([{"col1": 1, "col2": 2, "col3": 3, "col4":3, "col5":5},
{"col1": None, "col2": None, "col3": None, "col4":20190103, "col5":20190205},
{"col1": -8.072557, "col2": 41.13702, "col3": -1, "col4":-0.191792, "col5":-10.543130},
{"col1": -8.072557, "col2": 41.15702, "col3": -9, "col4":-0.191792, "col5":-2.543130}])
This code does what you want, although it is not converting anything to strings as in you example. But you should be able to easily do it, if necessary.
# generate dict with dates
final_dict = {"dates": [list(df["col4"])[1],list(df["col5"])[1]]}
# iterate over relevant rows and generate dicts
for i in range(2,len(df)):
final_dict[i-2] = {"lon": df["col1"][i],
"lat": df["col2"][i],
"measurements": [df[cname][i] for cname in ["col4", "col5"]],
"val": df["col3"][i]
}
this leads to this output:
{0: {'lat': 41.13702,
'lon': -8.072557,
'measurements': [-0.191792, -10.54313],
'val': -1.0},
1: {'lat': 41.15702,
'lon': -8.072557,
'measurements': [-0.191792, -2.54313],
'val': -9.0},
'dates': [20190103.0, 20190205.0]}
Extracting dates from data and then eliminating first row from data frame:
dates =list(data.iloc[0][3:])
data=data.iloc[1:]
Inserting dates into dict:
points={"dates":dates}
Iterating through data frame and adding elements to the dictionary:
i=0
for index, row in data.iterrows():
element= {"lon":row["Col_1"],
"lat":row["Col_2"],
"measurements": [row["Col_3"], row["Col_4"]]}
points[str(i)]=element
i+=1
You can convert dict to string object using json.dumps():
points_json = json.dumps(points)
It will be string object, not json(dict) object. More about that in this post Converting dictionary to JSON
I converted the pandas dataframe values to a list, and then loop through one of the lists, and add the lists to a nested JSON object containing the values.
import pandas
import json
import argparse
import sys
def parseInput():
parser = argparse.ArgumentParser(description="Convert CSV measurements to JSON")
parser.add_argument(
'-i', "--input",
help="CSV input",
required=True,
type=argparse.FileType('r')
)
parser.add_argument(
'-o', "--output",
help="JSON output",
type=argparse.FileType('w'),
default=sys.stdout
)
return parser.parse_args()
def main():
args = parseInput()
input_file = args.input
output = args.output
dataframe = pandas.read_csv(input_file)
longitudes = dataframe.iloc[1:,0:1].T.values.tolist()[0]
latitudes = dataframe.iloc[1:,1:2].T.values.tolist()[0]
averages = dataframe.iloc[1:,2:3].T.values.tolist()[0]
measurements = dataframe.iloc[1:,3:].values.tolist()
dates=dataframe.iloc[0][3:].values.tolist()
points={"dates":dates}
for index, val in enumerate(longitudes):
entry = {
"lon":longitudes[index],
"lat":latitudes[index],
"measurements":measurements[index],
"average":averages[index]
}
points[str(index)] = entry
json.dump(points, output)
if __name__ == "__main__":
main()

Convert json dictionary to dataframe in Python

My API gives me a json file as output with the following structure:
{
"results": [
{
"statement_id": 0,
"series": [
{
"name": "PCJeremy",
"tags": {
"host": "001"
},
"columns": [
"time",
"memory"
],
"values": [
[
"2021-03-20T23:00:00Z",
1049911288
],
[
"2021-03-21T00:00:00Z",
1057692712
],
]
},
{
"name": "PCJohnny",
"tags": {
"host": "002"
},
"columns": [
"time",
"memory"
],
"values": [
[
"2021-03-20T23:00:00Z",
407896064
],
[
"2021-03-21T00:00:00Z",
406847488
]
]
}
]
}
]
}
I want to transform this output to a pandas dataframe so I can create some reports from it. I tried using the pdDataFrame.from_dict method:
with open(fn) as f:
data = json.load(f)
print(pd.DataFrame.from_dict(data))
But as a resulting set, I just get one column and one row with all the data back:
results
0 {'statement_id': 0, 'series': [{'name': 'Jerem...
The structure is just quite hard to understand for me as I am no professional. I would like to get a dataframe with 4 columns: name, host, time and memory with a row of data for every combination of values in the json file. Example:
name host time memory
JeremyPC 001 "2021-03-20T23:00:00Z" 1049911288
JeremyPC 001 "2021-03-21T00:00:00Z" 1049911288
Is this in any way possible? Thanks a lot in advance!
First extract the data from json you are interested in
extracted_data = []
for series in data['results'][0]['series']:
d = {}
d['name'] = series['name']
d['host'] = series['tags']['host']
d['time'] = [value[0] for value in series['values']]
d['memory'] = [value[1] for value in series['values']]
extracted_data.append(d)
df = pd.DataFrame(extracted_data)
# print(df)
name host time memory
0 PCJeremy 001 [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z] [1049911288, 1057692712]
1 PCJohnny 002 [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z] [407896064, 406847488]
Second, explode multiple columns into rows
df1 = pd.concat([df.explode('time')['time'], df.explode('memory')['memory']], axis=1)
df_ = df.drop(['time','memory'], axis=1).join(df1).reset_index(drop=True)
# print(df_)
name host time memory
0 PCJeremy 001 2021-03-20T23:00:00Z 1049911288
1 PCJeremy 001 2021-03-21T00:00:00Z 1057692712
2 PCJohnny 002 2021-03-20T23:00:00Z 407896064
3 PCJohnny 002 2021-03-21T00:00:00Z 406847488
With carefully constructing the dict, it could be done without exploding.
extracted_data = []
for series in data['results'][0]['series']:
d = {}
d['name'] = series['name']
d['host'] = series['tags']['host']
for values in series['values']:
d_ = d.copy()
for column, value in zip(series['columns'], values):
d_[column] = value
extracted_data.append(d_)
df = pd.DataFrame(extracted_data)
You could jmespath to extract the data; it is quite a handy tool for such nested json data. You can read the docs for more details; I will summarize the basics: If you want to access a key, use a dot, if you want to access values in a list, use []. Combination of these two will help in traversing the json paths. There are more tools; these basics should get you started.
Your json is wrapped in a data variable:
data
{'results': [{'statement_id': 0,
'series': [{'name': 'PCJeremy',
'tags': {'host': '001'},
'columns': ['time', 'memory'],
'values': [['2021-03-20T23:00:00Z', 1049911288],
['2021-03-21T00:00:00Z', 1057692712]]},
{'name': 'PCJohnny',
'tags': {'host': '002'},
'columns': ['time', 'memory'],
'values': [['2021-03-20T23:00:00Z', 407896064],
['2021-03-21T00:00:00Z', 406847488]]}]}]}
Let's create an expression to parse the json, and get the specific values:
expression = """{name: results[].series[].name,
host: results[].series[].tags.host,
time: results[].series[].values[*][0],
memory: results[].series[].values[*][-1]}
"""
Parse the expression to the json data:
expression = jmespath.compile(expression).search(data)
expression
{'name': ['PCJeremy', 'PCJohnny'],
'host': ['001', '002'],
'time': [['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z'],
['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z']],
'memory': [[1049911288, 1057692712], [407896064, 406847488]]}
Note the time and memory are nested lists, and match the values in data:
Create dataframe and explode relevant columns:
pd.DataFrame(expression).apply(pd.Series.explode)
name host time memory
0 PCJeremy 001 2021-03-20T23:00:00Z 1049911288
0 PCJeremy 001 2021-03-21T00:00:00Z 1057692712
1 PCJohnny 002 2021-03-20T23:00:00Z 407896064
1 PCJohnny 002 2021-03-21T00:00:00Z 406847488

How to display json without column names pandas/python?

We query a df by below code:
json.loads(df.reset_index(drop=True).to_json(orient='table'))
The output is:
{"index": [ 0, 1 ,2, 3, 4],
"col1": [ "250"],
"col2": [ "1"],
"col3": [ "223"],
"col4": [ "2020-06-12 14:55"]
}
We need the output should be like this:
[ "250", "1", "223", "2020-06-12 14:55"],[.....][.....]
json.loads(df.reset_index(drop=True).to_json(orient='values'))
change table into values solved my problem.
What you call a "json" (there is no such data type) is a Python dictionary. Extract the values for the keys of interest using list comprehension:
x = .... # Your dictionary
[x[col][0] for col in x if col.startswith("col")]
#['250', '1', '223', '2020-06-12 14:55']
We convert json to dataframe and we remove column name.
pd.Dataframe(json_source,header='False')
Then we convert it to json formate
df.to_json(orient='table')

Categories