I a dataframe of below format
I want to send each row separately as below:
{ 'timestamp': 'A'
'tags': {
'columnA': '1',
'columnB': '11',
'columnC': '21'
.
.
.
.}}
The columns vary and I cannot hard code it. Then Send it to firestore collection
Then second row in above format to firestore collection and so on
How can I do this?
and don't mark the question as duplicate without comparing questions
I am not clear on the firebase part, but I think this might be what you want
import json
import pandas as pd
# Data frame to work with
x = pd.DataFrame(data={'timestamp': 'A', 'ca': 1, 'cb': 2, 'cc': 3}, index=[0])
x = x.append(x, ignore_index=True)
# rearranging
x = x[['timestamp', 'ca', 'cb', 'cc']]
def new_json(row):
return json.dumps(
dict(timestamp=row['timestamp'], tag=dict(zip(row.index[1:], row[row.index[1:]].values.tolist()))))
print x.apply(new_json, raw=False, axis=1)
Output
Output is a pandas series with each entry being a str in the json format as needed
0 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
1 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
Related
Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.
So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)
How can I get the json format from pandas, where each rows are separated with new line. For example if I have a dataframe like:
import pandas as pd
data = [{'a': 1, 'b': 2},
{'a': 3, 'b': 4}]
df = pd.DataFrame(data)
print("First:\n", df.to_json(orient="records"))
print("Second:\n", df.to_json(orient="records", lines=True))
Output:
First:
[{"a":1,"b":2},{"a":3,"b":4}]
Second:
{"a":1,"b":2}
{"a":3,"b":4}
But I really want an output like so:
[{"a":1,"b":2},
{"a":3,"b":4}]
or
[
{"a":1,"b":2},
{"a":3,"b":4}
]
I really just want each line to be separated by new line, but still a valid json format that can be read. I know I can use to_json with lines=True and just split by new line then .join, but wondering if there is a more straight forward/faster solution just using pandas.
Here you go:
import json
list(json.loads(df.to_json(orient="index")).values())
Use indent parameter
import pandas as pd
data = [{'a': 1, 'b': 2},
{'a': 3, 'b': 4}]
df = pd.DataFrame(data)
print(df.to_json(orient="records", indent=1))
#output :
[
{
"a":1,
"b":2
},
{
"a":3,
"b":4
}
]
Why don't you just add the brackets:
print(f"First:\n[{df.to_json(orient='records', lines=True)}]")
print(f"Second:\n[\n{df.to_json(orient='records', lines=True)}\n]")
We query a df by below code:
json.loads(df.reset_index(drop=True).to_json(orient='table'))
The output is:
{"index": [ 0, 1 ,2, 3, 4],
"col1": [ "250"],
"col2": [ "1"],
"col3": [ "223"],
"col4": [ "2020-06-12 14:55"]
}
We need the output should be like this:
[ "250", "1", "223", "2020-06-12 14:55"],[.....][.....]
json.loads(df.reset_index(drop=True).to_json(orient='values'))
change table into values solved my problem.
What you call a "json" (there is no such data type) is a Python dictionary. Extract the values for the keys of interest using list comprehension:
x = .... # Your dictionary
[x[col][0] for col in x if col.startswith("col")]
#['250', '1', '223', '2020-06-12 14:55']
We convert json to dataframe and we remove column name.
pd.Dataframe(json_source,header='False')
Then we convert it to json formate
df.to_json(orient='table')
The issue :
I want to apply conditional formatting icon_set using xlsx to a column but do not get the right arrows for the right values
This is my desired output :
This is my current output:
This is my code:
writer.sheets[sheet].conditional_format('J54:K200', {'type': 'icon_set',
'icon_style': '3_arrows',
'icons': [
{'criteria': '>=', 'type': 'number', 'value': 1},
{'criteria': '>=', 'type': 'number', 'value': 0},
{'criteria': '<', 'type': 'number', 'value': -1}
]}
)
This is what I have looked at :
Besides similar questions here, this is what I have done :
I looked at Excel for the formula and compared to my own work, to start from my output, and figure out the correct rule.
The closest I got so far is that when I change my icons 'value'to 2, 1, 0 respectively, I get the 1 to have the middle orange arrow:
This tells me that my equality must be correct, yet it doesn't produce the expected result.
Thanks for any help provided!
If you eliminate the {'criteria': '>=', 'type': 'number', 'value': 0}, from your code it should work fine. I have a reproducible example of this below with the expected output.
import pandas as pd
import numpy as np
#Creating a sample dataframe for example demonstration
df = pd.DataFrame({'col1': [0, -1, 10, -2], 'col2': [-11, 0, -3, 1]})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.sheets['Sheet1'].conditional_format('A2:B5', {'type': 'icon_set',
'icon_style': '3_arrows',
'icons': [
{'criteria': '>=', 'type': 'number', 'value': 0.001},
{'criteria': '<=', 'type': 'number', 'value': -0.001}
]}
)
writer.save()
Expected test.xlsx:
I am trying to pull multiple values from consul.
after pulling data using the following code:
import consul
c = consul.Consul("consulServer")
index, data = c.kv.get("key",recurese=False)
print data
I am getting the following json in my data list:
[ {
'LockIndex': 0,
'ModifyIndex': 54,
'Value': '1',
'Flags': 0,
'Key': 'test/one',
'CreateIndex': 54
}, {
'LockIndex': 0,
'ModifyIndex': 69,
'Value': '2',
'Flags': 0,
'Key': 'test/two',
'CreateIndex': 69
}]
I want to transform this output to key:value json file. for this example it should look like:
{
"one": "1",
"two": "2"
}
I have two questions:
1. Is there a better way to get multiple values from consul kv?
2. Assuming there is no better way, what is the best way to convert the json from the first example to the second one?
Thanks,