I am using Python 3.6.4. I am running an experiment to find out a faster alternate to pandas.to_csv() and using numpy.savetxt() to store the dataframe as csv in a file. By default, this function puts nan string for np.nan values in the dataframe. I want to put '' (empty string) for np.nan values in the csv output.
I tried np.set_printoptions(nanstr=''), which doesn't seem to make any difference. I tried changing nanstr to various string values but looks like the option is not honored at all. I do see the option being set correctly in np.get_printoptions()
df0 = pd.DataFrame({'id': ['1_node', '2_node', '3_node', '4_node'],
'prop1': [np.nan,np.nan,'ABC','DEF'],
'prop2': [1,np.nan,2,np.nan]})
print("Numpy version: {}".format(np.__version__))
np.set_printoptions(nanstr='')
print(np.get_printoptions())
np.savetxt( 'temp.op', df0.values, fmt="%s", comments='',delimiter=",")
Output:
Numpy version: 1.14.0
{'edgeitems': 3, 'threshold': 1000, 'floatmode': 'maxprec', 'precision': 8, 'suppress': False, 'linewidth': 75, 'nanstr': '', 'infstr': 'inf', 'sign': '-', 'formatter': None, 'legacy': False}
temp.op
1_node,nan,1.0
2_node,nan,nan
3_node,ABC,2.0
4_node,DEF,nan
Expected output:
1_node,,1.0
2_node,,
3_node,ABC,2.0
4_node,DEF,
Related
My goal: Automate the operation of executing a query and output the results into a csv.
I have been successful in obtaining the query results using Python (this is my first project ever in Python). I am trying to format these results as a csv but am completely lost. It's basically just creating 2 massive rows with all the data not parsed out. The .txt and .csv results are attached (I obtained these by simply calling the query and entering "file name > results.txt" or "file name > results.csv".
txt results: {'data': {'get_result': {'job_id': None, 'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', '__typename': 'get_result_response'}}} {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', 'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c', 'error': None, 'runtime': 157, 'generated_at': '2022-04-07T20:14:36.693419+00:00', 'columns': ['project_name', 'leaderboard_date', 'volume_30day', 'transactions_30day', 'floor_price', 'median_price', 'unique_holders', 'rank', 'custom_sort_order'], '__typename': 'query_results'}], 'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA', 'floor_price': 0.375, 'leaderboard_date': '2022-04-07', 'median_price': 343.4, 'project_name': 'Terraforms by Mathcastles', 'rank': 1, 'transactions_30day': 2774, 'unique_holders': 2179, 'volume_30day': 744611.6252}, '__typename': 'get_result_template'}, {'data': {'custom_sort_order': 'AB', 'floor_price': 4.69471, 'leaderboard_date': '2022-04-07', 'median_price': 6.5, 'project_name': 'Meebits', 'rank': 2, 'transactions_30day': 4153, 'unique_holders': 6200, 'volume_30day': 163520.7377371168}, '__typename': 'get_result_template'}, etc. (repeats for 100s of rows)..
Your results text string actually contains two dictionaries separated by a space character.
Here's a formatted version of what's in each of them:
dict1 = {'data': {'get_result': {'job_id': None,
'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'__typename': 'get_result_response'}}}
dict2 = {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c',
'error': None,
'runtime': 157,
'generated_at': '2022-04-07T20:14:36.693419+00:00',
'columns': ['project_name',
'leaderboard_date',
'volume_30day',
'transactions_30day',
'floor_price',
'median_price',
'unique_holders',
'rank',
'custom_sort_order'],
'__typename': 'query_results'}],
'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA',
'floor_price': 0.375,
'leaderboard_date': '2022-04-07',
'median_price': 343.4,
'project_name': 'Terraforms by Mathcastles',
'rank': 1,
'transactions_30day': 2774,
'unique_holders': 2179,
'volume_30day': 744611.6252},
'__typename': 'get_result_template'},
{'data': {'custom_sort_order': 'AB',
'floor_price': 4.69471,
'leaderboard_date': '2022-04-07',
'median_price': 6.5,
'project_name': 'Meebits',
'rank': 2,
'transactions_30day': 4153,
'unique_holders': 6200,
'volume_30day': 163520.7377371168},
'__typename': 'get_result_template'},
]}}
(BTW I formatting them using the pprint module. This is often a good first step when dealing with these kinds of problems — so you know what you're dealing with.)
Ignoring the first one completely and all but the repetitive data in the second — which is what I assume is all you really want — you could create a CSV file from the nested dictionary values in the dict2['data']['get_result_by_result_id'] list. Here's how that could be done using the csv.DictWriter class:
import csv
from pprint import pprint # If needed.
output_filepath = 'query_results.csv'
# Determine CSV fieldnames based on keys of first dictionary.
fieldnames = dict2['data']['get_result_by_result_id'][0]['data'].keys()
with open(output_filepath, 'w', newline='') as outp:
writer = csv.DictWriter(outp, delimiter=',', fieldnames=fieldnames)
writer.writeheader() # Optional.
for result in dict2['data']['get_result_by_result_id']:
# pprint(result['data'], sort_dicts=False)
writer.writerow(result['data'])
print('fini')
Using the test data, here's the contents of the 'query_results.csv' file it created:
custom_sort_order,floor_price,leaderboard_date,median_price,project_name,rank,transactions_30day,unique_holders,volume_30day
AA,0.375,2022-04-07,343.4,Terraforms by Mathcastles,1,2774,2179,744611.6252
AB,4.69471,2022-04-07,6.5,Meebits,2,4153,6200,163520.7377371168
It appears you have the data in a python dictionary. The google sheet says access denied so I can't see the whole data.
But essentially you want to convert the dictionary data to a csv file.
At the bare bones you can use code like this to get where you need to. For your example you'll need to drill down to where the rows actually are.
import csv
new_path = open("mytest.csv", "w")
file_dictionary = {"oliva":199,"james":145,"potter":187}
z = csv.writer(new_path)
for new_k, new_v in file_dictionary.items():
z.writerow([new_k, new_v])
new_path.close()
This guide should help you out.
https://pythonguides.com/python-dictionary-to-csv/
if I understand your question right, you should construct a dataframe format with your results and then save the dataframe in .csv format. Pandas library is usefull and easy to use.
I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format.
Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.
I found two ways to that.
Given the data:
data = {
'int': [1, 2, 3, 4, None],
'float': [1.1, None, 3.4, 4.0, 5.5],
'str': [None, 'two', 'three', 'four', 'five'],
'boolean': [True, None, True, False, False],
'date': [
date(2000, 1, 1),
date(2000, 1, 2),
date(2000, 1, 3),
date(2000, 1, 4),
None,
],
'timestamp': [
datetime(2000, 1, 1, 1, 1, 1),
datetime(2000, 1, 1, 1, 1, 2),
None,
datetime(2000, 1, 1, 1, 1, 4),
datetime(2000, 1, 1, 1, 1, 5),
]
}
df = pd.DataFrame(data)
df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')
The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.
With PyArrow:
Using Pyarrow, you do it like this:
import pyarrow as pa
pyarrow_schema = pa.schema([
('int', pa.int64()),
('float', pa.float64()),
('str', pa.string()),
('bool', pa.bool_()),
('date', pa.date64()),
('timestamp', pa.timestamp(unit='s'))
])
df.to_parquet(
path='pyarrow.parquet',
schema=pyarrow_schema,
engine='pyarrow'
)
Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.
With fastparquet:
First you need to write out the data with these additional steps:
from fastparquet import write
write('fast.parquet', df, has_nulls=True, times='int96')
The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.
Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.
The external table:
Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:
CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
"int" bigint,
"float" float,
"str" varchar(255),
"boolean" bool,
"date" date,
"timestamp" timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location
's3://<your_bucket>/<your_prefix>/';
Final story and note:
When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.
I hope, some other people find this explanation and resources helpful.
The question already holds the answer. :)
I a dataframe of below format
I want to send each row separately as below:
{ 'timestamp': 'A'
'tags': {
'columnA': '1',
'columnB': '11',
'columnC': '21'
.
.
.
.}}
The columns vary and I cannot hard code it. Then Send it to firestore collection
Then second row in above format to firestore collection and so on
How can I do this?
and don't mark the question as duplicate without comparing questions
I am not clear on the firebase part, but I think this might be what you want
import json
import pandas as pd
# Data frame to work with
x = pd.DataFrame(data={'timestamp': 'A', 'ca': 1, 'cb': 2, 'cc': 3}, index=[0])
x = x.append(x, ignore_index=True)
# rearranging
x = x[['timestamp', 'ca', 'cb', 'cc']]
def new_json(row):
return json.dumps(
dict(timestamp=row['timestamp'], tag=dict(zip(row.index[1:], row[row.index[1:]].values.tolist()))))
print x.apply(new_json, raw=False, axis=1)
Output
Output is a pandas series with each entry being a str in the json format as needed
0 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
1 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
I have two PySpark dataframes that I'm trying to join into a new dataframe. The join operation seems to show a null dataframe.
I'm using Jupyter notebooks to evaluate the code, on a PySpark kernel, on a cluster with a single master, 4 workers, YARN for resource allocation.
from pyspark.sql.functions import monotonically_increasing_id,udf
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import DenseVector
firstelement=udf(lambda v:float(v[1]),FloatType())
a = [{'c_id': 'a', 'cv_id': 'b', 'id': 1}, {'c_id': 'c', 'cv_id': 'd', 'id': 2}]
ip = spark.createDataFrame(a)
b = [{'probability': DenseVector([0.99,0.01]), 'id': 1}, {'probability': DenseVector([0.6,0.4]), 'id': 2}]
op = spark.createDataFrame(b)
op.show() #shows the df
#probability, id
#[0.99, 0.01], 1
##probability is a dense vector, id is bigint
ip.show() #shows the df
#c_id, cv_id, id
#a,b,1
##c_id and cv_id are strings, id is bigint
op_final = op.join(ip, ip.id == op.id).select('c_id','cv_id',firstelement('probability')).withColumnRenamed('<lambda>(probability)','probability')
op_final.show() #gives a null df
#but the below seems to work, however, quite slow
ip.collect()
op.collect()
op_final.collect()
op_final.show() #shows the joined df
Perhaps it's my lack of expertise with Spark, but could someone please explain why I'm able to see the first two dataframes, but not the joined dataframe unless I use collect()?
I would like to create and use an index on "age" key of a standard column family.
I did the following using pycassa:
In [10]: sys.create_index('test01', 'word_map', 'age', 'IntegerType', index_type=0, index_name='index_age')
In [11]: age_expr = create_index_expression('age', 6, GT)
In [12]: clause = create_index_clause([age_expr], count=20)
In [13]: cf.get_indexed_slices(clause)
error: 'No indexed columns present in index clause with operator EQ'
According to this nice page, I need to set the value type. However:
In [16]: cf_words.column_validators
Out[16]: {'black_white': 'BooleanType', 'url': 'UTF8Type', 'age': 'IntegerType', 'site': 'UTF8Type', 'len': 'IntegerType', 'content': 'UTF8Type', 'colourful': 'BooleanType', 'printer_friendly': 'BooleanType'}
so age has a data type set.
Any ideas?
Instead of the string 'GT', use pycassa.index.GT. It's an enum that Thrift implements with integers.
You can find all of the documentation and an example usage here: http://pycassa.github.com/pycassa/api/pycassa/index.html