Python: Reading a Dataframe - python

I'm trying to read through a dataframe row by row, and grab what I want out of said row by index. The csv i'm reading in looks something like this...
but when I read it in, and run this code on it...
def sendKafkaMessagesTest(df):
df.columns = ['Platform_Name', 'Index', 'Type', 'Weapon', 'Munitions', 'Location', 'Tracks', 'Time']
for ind in df.index:
data = {'platform_name': str(df['Platform_Name'][ind]),
'tracks': str(df['Tracks'][ind]), 'time': str(df['Time'][ind])}
print(data)
producer.send('numtest', data)
It produces this... {'platform_name': '540', 'tracks': '0', 'time': 'nan'}
I tried changing the columns which I thought would work, but still a no go. It's like it's not considering Row A to be part of the data or something. Any ideas?
EDIT: Reading CSV file as df = pd.read_csv(event.src_path)
EDIT: Expected output is {'platform_name': 'TSC2_commander', 'tracks': '0', 'time': '0'}

Related

How to format query results as CSV?

My goal: Automate the operation of executing a query and output the results into a csv.
I have been successful in obtaining the query results using Python (this is my first project ever in Python). I am trying to format these results as a csv but am completely lost. It's basically just creating 2 massive rows with all the data not parsed out. The .txt and .csv results are attached (I obtained these by simply calling the query and entering "file name > results.txt" or "file name > results.csv".
txt results: {'data': {'get_result': {'job_id': None, 'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', '__typename': 'get_result_response'}}} {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', 'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c', 'error': None, 'runtime': 157, 'generated_at': '2022-04-07T20:14:36.693419+00:00', 'columns': ['project_name', 'leaderboard_date', 'volume_30day', 'transactions_30day', 'floor_price', 'median_price', 'unique_holders', 'rank', 'custom_sort_order'], '__typename': 'query_results'}], 'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA', 'floor_price': 0.375, 'leaderboard_date': '2022-04-07', 'median_price': 343.4, 'project_name': 'Terraforms by Mathcastles', 'rank': 1, 'transactions_30day': 2774, 'unique_holders': 2179, 'volume_30day': 744611.6252}, '__typename': 'get_result_template'}, {'data': {'custom_sort_order': 'AB', 'floor_price': 4.69471, 'leaderboard_date': '2022-04-07', 'median_price': 6.5, 'project_name': 'Meebits', 'rank': 2, 'transactions_30day': 4153, 'unique_holders': 6200, 'volume_30day': 163520.7377371168}, '__typename': 'get_result_template'}, etc. (repeats for 100s of rows)..
Your results text string actually contains two dictionaries separated by a space character.
Here's a formatted version of what's in each of them:
dict1 = {'data': {'get_result': {'job_id': None,
'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'__typename': 'get_result_response'}}}
dict2 = {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c',
'error': None,
'runtime': 157,
'generated_at': '2022-04-07T20:14:36.693419+00:00',
'columns': ['project_name',
'leaderboard_date',
'volume_30day',
'transactions_30day',
'floor_price',
'median_price',
'unique_holders',
'rank',
'custom_sort_order'],
'__typename': 'query_results'}],
'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA',
'floor_price': 0.375,
'leaderboard_date': '2022-04-07',
'median_price': 343.4,
'project_name': 'Terraforms by Mathcastles',
'rank': 1,
'transactions_30day': 2774,
'unique_holders': 2179,
'volume_30day': 744611.6252},
'__typename': 'get_result_template'},
{'data': {'custom_sort_order': 'AB',
'floor_price': 4.69471,
'leaderboard_date': '2022-04-07',
'median_price': 6.5,
'project_name': 'Meebits',
'rank': 2,
'transactions_30day': 4153,
'unique_holders': 6200,
'volume_30day': 163520.7377371168},
'__typename': 'get_result_template'},
]}}
(BTW I formatting them using the pprint module. This is often a good first step when dealing with these kinds of problems — so you know what you're dealing with.)
Ignoring the first one completely and all but the repetitive data in the second — which is what I assume is all you really want — you could create a CSV file from the nested dictionary values in the dict2['data']['get_result_by_result_id'] list. Here's how that could be done using the csv.DictWriter class:
import csv
from pprint import pprint # If needed.
output_filepath = 'query_results.csv'
# Determine CSV fieldnames based on keys of first dictionary.
fieldnames = dict2['data']['get_result_by_result_id'][0]['data'].keys()
with open(output_filepath, 'w', newline='') as outp:
writer = csv.DictWriter(outp, delimiter=',', fieldnames=fieldnames)
writer.writeheader() # Optional.
for result in dict2['data']['get_result_by_result_id']:
# pprint(result['data'], sort_dicts=False)
writer.writerow(result['data'])
print('fini')
Using the test data, here's the contents of the 'query_results.csv' file it created:
custom_sort_order,floor_price,leaderboard_date,median_price,project_name,rank,transactions_30day,unique_holders,volume_30day
AA,0.375,2022-04-07,343.4,Terraforms by Mathcastles,1,2774,2179,744611.6252
AB,4.69471,2022-04-07,6.5,Meebits,2,4153,6200,163520.7377371168
It appears you have the data in a python dictionary. The google sheet says access denied so I can't see the whole data.
But essentially you want to convert the dictionary data to a csv file.
At the bare bones you can use code like this to get where you need to. For your example you'll need to drill down to where the rows actually are.
import csv
new_path = open("mytest.csv", "w")
file_dictionary = {"oliva":199,"james":145,"potter":187}
z = csv.writer(new_path)
for new_k, new_v in file_dictionary.items():
z.writerow([new_k, new_v])
new_path.close()
This guide should help you out.
https://pythonguides.com/python-dictionary-to-csv/
if I understand your question right, you should construct a dataframe format with your results and then save the dataframe in .csv format. Pandas library is usefull and easy to use.

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

create a pandas dataframe from python list containing tuple with nested dictionary

I have wrestled with this for a few days now, but can't figure it out.
I'm trying to create a dataframe "account_activity" from the results of an api get.
i make an api call and print it out.
account_activities = api.get_activities()
print(account_activities)
returns:
[AccountActivity({ 'activity_type': 'FILL',
'cum_qty': '100',
'id': '20211111105648607::a0ef3f04-ff00-4b8e-834d-54737d89c332',
'leaves_qty': '0',
'order_id': '32c9a40e-e6d2-4c7c-8949-a39ad32b535f',
'order_status': 'filled',
'price': '187.09',
'qty': '56',
'side': 'sell',
'symbol': 'U',
'transaction_time': '2021-11-11T15:56:48.607222Z',
'type': 'fill'})]
How do I create a dataframe "account_activity" where the keys are the column headers and the index is the transaction_time is the row index with values in the rows?
Assuming j is te JSON from your AccountActivity object:
df = pd.DataFrame(j, index=['']).set_index('transaction_time',drop=True)
How you get the JSON depends on the APIs you're using. Perhaps
j = account_activities[0].__dict__
will work?

how to read .SDL text file?

I have a .SDL Text file in format
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
is there any python library to read and interpret such .SDL text file?
I am assuming that there will be no multiple line in the file.
data.sdl
490797|C|64||BLAH BLAH BLAH||||0|190/0000/07|A|1998889|198666566|||8990900|BLAGHHH72|L78899|||0040|012|432565|012|435659||MBLAHAHAHAHASIE|2WES|ARGHKKHHHT|PRE||0002|012|432565|012|435659||MR. JOHN DOE|PO BOX 198898|SILUHHHHH||0052|661|13||82110|35000000|2|0|||||0|0||||Y||70877746414|R
Python script to extract data in a list:
data_list = []
# with open('path/to/file.sdl') as file
with open('data.sdl', 'r') as file:
data = file.read()
data_list = data.split('|')
data_list[-1] = data_list[-1].strip()
data_list = list(filter(None, data_list))
Output:
['490797', 'C', '64', 'BLAH BLAH BLAH', '0', '190/0000/07', 'A', '1998889', '198666566', '8990900', 'BLAGHHH72', 'L78899', '0040', '012', '432565', '012', '435659', 'MBLAHAHAHAHASIE', '2WES', 'ARGHKKHHHT', 'PRE', '0002', '012', '432565', '012', '435659', 'MR. JOHN DOE', 'PO BOX 198898', 'SILUHHHHH', '0052', '661', '13', '82110', '35000000', '2', '0', '0', '0', 'Y', '70877746414', 'R']
Please let me know if you need anything else.
Presuming there's more rows than you've provided in the same format, Pandas .read_csv() will be able to load this up for you!
import pandas as pd
df = pd.read_csv("my_path/whateverfilename.sdl", sep="|")
This will create a DataFrame object for you, which may be what you're after
If you just wanted each row as a list, you can simply load the file and .split() each line, though this will probably be harder to work with overall
split_lines = []
with open("my_path/whateverfilename.sdl") as fh:
for line in fh: # file-like objects are iterable by-line
split_lines.append(line.split("|"))
Assuming that each line as the same amount of columns:
File './path_to_data':
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
||||0||0|| , |C|64||
Data "reader":
import numpy as np
path = './path_to_data'
N_COLS = 13
# declare the data type of each column - in this case python Object
dts = np.dtype(', '.join(['O'] * N_COLS))
data = np.loadtxt(fname=path, delimiter='|', dtype=dts, unpack=False, skiprows=0, max_rows=None)
for i in data:
print(i)
Output
('244455', '199', '6577888', '20210401', '138.61', '0.78', '83.16', '0.00', '0.00', '221.77', '6', '0.00', '17000')
('', '', '', '', '0', '', '0', '', ' , ', 'C', '64', '', '')
To get the data as column: unpack=True
Tell form which line start to read skiprows=0
End reading at line max_rows=None if None read everything (default).
Here the doc.

Finding the min of a column across multiple lists in python

I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))

Categories