We query a df by below code:
json.loads(df.reset_index(drop=True).to_json(orient='table'))
The output is:
{"index": [ 0, 1 ,2, 3, 4],
"col1": [ "250"],
"col2": [ "1"],
"col3": [ "223"],
"col4": [ "2020-06-12 14:55"]
}
We need the output should be like this:
[ "250", "1", "223", "2020-06-12 14:55"],[.....][.....]
json.loads(df.reset_index(drop=True).to_json(orient='values'))
change table into values solved my problem.
What you call a "json" (there is no such data type) is a Python dictionary. Extract the values for the keys of interest using list comprehension:
x = .... # Your dictionary
[x[col][0] for col in x if col.startswith("col")]
#['250', '1', '223', '2020-06-12 14:55']
We convert json to dataframe and we remove column name.
pd.Dataframe(json_source,header='False')
Then we convert it to json formate
df.to_json(orient='table')
Related
I have pandas dataframe with multiple columns. On the the column called request_headers is in a format of list of dictionaries, example:
[{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}]
I would like to remove only those elements from that list which do contains specific name. For example with:
blacklist = "name2"
I should get the same dataframe, with all the columns including request_headers, but it's value (based on the example above) should be:
[{"name": "name1", "value": "value1"}]
How to achieve it ? I've tried first to explode, then filter, but was not able to "implode" correctly.
Thanks,
Exploding is expensive, rather us a list comprehension:
blacklist = "name2"
df['request_headers'] = [[d for d in l if 'name' in d and d['name'] != blacklist]
for l in df['request_headers']]
Output:
request_headers
0 [{'name': 'name1', 'value': 'value1'}]
can use a .apply function:
blacklist = 'name2'
df['request_headers'] = df['request_headers'].apply(lambda x: [d for d in x if blacklist not in d.values()])
df1=pd.DataFrame([{"name": "name1", "value": "value1"}, {"name": "name2", "value": "value2"}])
blacklist = "name2"
col1=df1.name.eq(blacklist)
df1.loc[col1]
out:
name value
1 name2 value2
Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.
So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)
I have two DataFrames, and I want to post these DataFrames as json (to the web service) but first I have to concatenate them as json.
#first df
input_df = pd.DataFrame()
input_df['first'] = ['a', 'b']
input_df['second'] = [1, 2]
#second df
customer_df = pd.DataFrame()
customer_df['first'] = ['c']
customer_df['second'] = [3]
For converting to json, I used following code for each DataFrame;
df.to_json(
path_or_buf='out.json',
orient='records', # other options are (split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’)
date_format='iso',
force_ascii=False,
default_handler=None,
lines=False,
indent=2
)
This code gives me the table like this: For ex, input_df export json
[
{
"first":"a",
"second":1
},
{
"first":"b",
"second":2
}
]
my desired output is like that:
{
"input": [
{
"first": "a",
"second": 1
},
{
"first": "b",
"second": 2
}
],
"customer": [
{
"first": "d",
"second": 3
}
]
}
How can I get this output like this? I couldn't find the way :(
You can concatenate the DataFrames with appropriate key names, then groupby the keys and build dictionaries at each group; finally build a json string from the entire thing:
out = (
pd.concat([input_df, customer_df], keys=['input', 'customer'])
.droplevel(1)
.groupby(level=0).apply(lambda x: x.to_dict('records'))
.to_json()
)
Output:
'{"customer":[{"first":"c","second":3}],"input":[{"first":"a","second":1},{"first":"b","second":2}]}'
or a dict by replacing the last to_json() to to_dict().
I need to convert this json file to a data frame in python:
print(resp2)
{
"totalCount": 1,
"nextPageKey": null,
"result": [
{
"metricId": "builtin:tech.generic.cpu.usage",
"data": [
{
"dimensions": [
"process_345678"
],
"dimensionMap": {
"dt.entity.process_group_instance": "process_345678"
},
"timestamps": [
1642021200000,
1642024800000,
1642028400000
],
"values": [
10,
15,
12
]
}
]
}
]
}
Output needs to be like this:
metricId dimensions timestamps values
builtin:tech.generic.cpu.usage process_345678 1642021200000 10
builtin:tech.generic.cpu.usage process_345678 1642024800000 15
builtin:tech.generic.cpu.usage process_345678 1642028400000 12
I have tried this:
print(pd.json_normalize(resp2, "data"))
I get invalid syntax, any ideas?
Take a look at the examples of json_normalize, and you'll see a list of dictionaries that have the key names of the columns you want, unique to each row. When you have nested lists/objects, then the columns will be flatten to have dot-notation, but nested arrays will not end up duplicated across rows.
Therefore, parse the data into a flat list, then you can use from_records.
data = []
for r in resp2['result']:
metricId = r['metricId']
for d in r['data']:
dimension = d['dimensions'][0] # unclear why this is an array
timestamps = d['timestamps']
values = d['values']
for t, v in zip(timestamps, values):
data.append({'metricId': metricId, 'dimensions': dimension, 'timestamps': t, 'values': v})
df = pd.DataFrame.from_records(data)
I have the data in CSV format, for example:
First row is column number, let's ignore that.
Second row, starting at Col_4, are number of days
Third row on: Col_1 and Col_2 are coordinates (lon, lat), Col_3 is a statistical value, Col_4 onwards are measurements.
As you can see, this format is a confusing mess. I would like to convert this to JSON the following way, for example:
{"points":{
"dates": ["20190103", "20190205"],
"0":{
"lon": "-8.072557",
"lat": "41.13702",
"measurements": ["-0.191792","-10.543130"],
"val": "-1"
},
"1":{
"lon": "-8.075557",
"lat": "41.15702",
"measurements": ["-1.191792","-2.543130"],
"val": "-9"
}
}
}
To summarise what I've done till now, I read the CSV to a Pandas DataFrame:
df = pandas.read_csv("sample.csv")
I can extract the dates into a Numpy Array with:
dates = df.iloc[0][3:].to_numpy()
I can extract measurements for all points with:
measurements_all = df.iloc[1:,3:].to_numpy()
And the lon and lat and val, respectively, with:
lon_all = df.iloc[1:,0:1].to_numpy()
lat_all = df.iloc[1:,1:2].to_numpy()
val_all = df.iloc[1:,2:3].to_numpy()
Can anyone explain how I can format this info a structure identical to the .json example?
With this dataframe
df = pd.DataFrame([{"col1": 1, "col2": 2, "col3": 3, "col4":3, "col5":5},
{"col1": None, "col2": None, "col3": None, "col4":20190103, "col5":20190205},
{"col1": -8.072557, "col2": 41.13702, "col3": -1, "col4":-0.191792, "col5":-10.543130},
{"col1": -8.072557, "col2": 41.15702, "col3": -9, "col4":-0.191792, "col5":-2.543130}])
This code does what you want, although it is not converting anything to strings as in you example. But you should be able to easily do it, if necessary.
# generate dict with dates
final_dict = {"dates": [list(df["col4"])[1],list(df["col5"])[1]]}
# iterate over relevant rows and generate dicts
for i in range(2,len(df)):
final_dict[i-2] = {"lon": df["col1"][i],
"lat": df["col2"][i],
"measurements": [df[cname][i] for cname in ["col4", "col5"]],
"val": df["col3"][i]
}
this leads to this output:
{0: {'lat': 41.13702,
'lon': -8.072557,
'measurements': [-0.191792, -10.54313],
'val': -1.0},
1: {'lat': 41.15702,
'lon': -8.072557,
'measurements': [-0.191792, -2.54313],
'val': -9.0},
'dates': [20190103.0, 20190205.0]}
Extracting dates from data and then eliminating first row from data frame:
dates =list(data.iloc[0][3:])
data=data.iloc[1:]
Inserting dates into dict:
points={"dates":dates}
Iterating through data frame and adding elements to the dictionary:
i=0
for index, row in data.iterrows():
element= {"lon":row["Col_1"],
"lat":row["Col_2"],
"measurements": [row["Col_3"], row["Col_4"]]}
points[str(i)]=element
i+=1
You can convert dict to string object using json.dumps():
points_json = json.dumps(points)
It will be string object, not json(dict) object. More about that in this post Converting dictionary to JSON
I converted the pandas dataframe values to a list, and then loop through one of the lists, and add the lists to a nested JSON object containing the values.
import pandas
import json
import argparse
import sys
def parseInput():
parser = argparse.ArgumentParser(description="Convert CSV measurements to JSON")
parser.add_argument(
'-i', "--input",
help="CSV input",
required=True,
type=argparse.FileType('r')
)
parser.add_argument(
'-o', "--output",
help="JSON output",
type=argparse.FileType('w'),
default=sys.stdout
)
return parser.parse_args()
def main():
args = parseInput()
input_file = args.input
output = args.output
dataframe = pandas.read_csv(input_file)
longitudes = dataframe.iloc[1:,0:1].T.values.tolist()[0]
latitudes = dataframe.iloc[1:,1:2].T.values.tolist()[0]
averages = dataframe.iloc[1:,2:3].T.values.tolist()[0]
measurements = dataframe.iloc[1:,3:].values.tolist()
dates=dataframe.iloc[0][3:].values.tolist()
points={"dates":dates}
for index, val in enumerate(longitudes):
entry = {
"lon":longitudes[index],
"lat":latitudes[index],
"measurements":measurements[index],
"average":averages[index]
}
points[str(index)] = entry
json.dump(points, output)
if __name__ == "__main__":
main()