Save document with tuples - python

I have a dataset containing multiple columns and each column contains tuple data. I want to save this data so that I can recall it in different python notebooks without having to rerun everything again. Any format is okay (csv, JSON, etc.).
id
pos_tag
clean data
1
[(sangat, RB), (baik, JJ)]
[baik]
2
[(sangat, RB), (membantu, VB)]
[membantu]
3
[(kenapa, WH), (kok, NN), (perbaikan, NN), (sistem, NN), (bayar, VB), (bisa, MD)]
[perbaikan, sistem, bayar]
This is what I found so far...
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
Will also need help on how to recall the data.

Just in case you want write data to csv and retrieve it back, I have an example for you:
# Sample dataframe written to csv file and read back
d = {'id': [1, 2], 'pos_tag': [(1, 2), ("a", "b", "c")]}
df1 = pd.DataFrame(d)
df1.to_csv("/content/df1.csv", sep="|", index=False)
df2 = pd.read_csv("/content/df1.csv", sep="|", )
df2.head()
I save it to csv with a pipe delimiter, just to make sure it is different from other literals like comma, semicolon etc.
# Retrieve tuple values
df2['pos_tag'] = [eval(s) for s in df2['pos_tag']]
df2.head()

Related

How to separate .csv data into different columns

I have a text file with data which looks like this:
NCP_341_1834_0022.png 2 0 130 512 429
I would like to split the data into different columns with names like this:
['filename','class','xmin','ymin','xmax','ymax']
I have done this:
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt")
test_txt.to_csv(r"../working/test/train.csv",index=None, sep='\t')
train = pd.read_csv("../working/test/train.csv")
However when I download the .csv file, it gives me the data line all in one column, as opposed to 6 columns. How can I fix this?
Just set the right separator (',' by default):
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt", sep=' ', header=None)
if you are using test_COVIDx_CT-3A.txt from Kaggle.
Don't forget to set header=None since there is no header. You can also use colnames=['image', 'col1', 'col2', ...] to replace default names (0, 1, 2, ...)
Just to answer my own question, You can use str to split the single .csv file into different columns. For me, I split it into 6 columns, for my 6 labels:
train[['filename', 'class','xmin','ymin','xmax','ymax']] = train['NCP_96_1328_0032.png 2 9 94 512 405'].str.split(' ', 6, expand=True)
train.head()
Then just drop the column you dont need:
train.drop(train.columns[[0]], axis=1)

Python Pandas Dataframe from API JSON Response >>

I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`

Adding a pandas.dataframe to another one with it's own name

I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)

Pandas- How to save frequencies of different values in different columns line by line in a csv file (including 0 frequencies)

I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)

Issue with exporting list based pandas dataframe to Excel

I have a series of dataframes which I am exporting to excel within the same file. A number of them appear to be stored as a list of dictionaries due to the way they have been constructed. I converted them using .from_dict. but when I use the df.to_excel an error is raised.
An example of one of the df's which is raising the error is shown below. My code:
excel_writer = pd.ExcelWriter('My_DFs.xlsx')
df_Done_Major = df[
(df['currency_str'].str.contains('INR|ZAR|NOK|HUF|MXN|PLN|SEK|TRY')==False) &
(df['state'].str.contains('Done'))
][['Year_Month','state','currency_str','cust_cdr_display_name','rbc_security_type1','rfq_qty','rfq_qty_CAD_Equiv']].copy()
# Trades per bucket
df_Done_Major['Bucket'] = pd.cut(df_Done['rfq_qty'], bins=bins, labels=labels)
# Polpulate empty buckets with 0 so HK, SY and TK data can be pasted adjacently
df_Done_Major_Fill_Empty_Bucket = df_Done_Major.groupby(['Year_Month','Bucket'], as_index=False)['Bucket'].size()
mux = pd.MultiIndex.from_product([df_Done_Major_Fill_Empty_Bucket.index.levels[0], df_Done_Major['Bucket'].cat.categories])
df_Done_Major_Fill_Empty_Bucket = df_Done_Major_Fill_Empty_Bucket.reindex(mux, fill_value=0)
dfTemp = df_Done_Major_Fill_Empty_Bucket
display(dfTemp)
dfTemp = pd.DataFrame.from_dict(dfTemp)
display(dfTemp)
# Export
dfTemp.to_excel(excel_writer, sheet_name='Sheet1', startrow=0, startcol=21, na_rep=0, header=True, index=True, merge_cells= True)
2018-05 0K 0
10K 2
20K 4
40K 10
60K 3
80K 1
100K 14
> 100K 273
dtype: int64
TypeError: Unsupported type <class 'pandas._libs.period.Period'> in write()
Even though I have converted to df is there additional conversion required?
Update: I can get the data into the excel using the following but the format of the dataframe is lost, which means significant excel vba to resolve.
list = [{"Data": dfTemp}, ]

Categories