savWriter writerows with Dataframe - python

I am trying to use the savReaderWriter library with Python. I have a dataframe which is read in via df = pd.read_csv. However using the following piece of code it won't seem to write the rows to the file.
with savReaderWriter.SavWriter(savFileName, *args) as writer:
writer.writerows(df)
I am getting the following error TypeError: 'str' object does not support item assignment.Any help is greatly appreciated.

This is the sample on https://pythonhosted.org/savReaderWriter/
savFileName = 'someFile.sav'
records = [[b'Test1', 1, 1], [b'Test2', 2, 1]]
varNames = ['var1', 'v2', 'v3']
varTypes = {'var1': 5, 'v2': 0, 'v3': 0}
with savReaderWriter.SavWriter(savFileName, varNames, varTypes) as writer:
for record in records:
writer.writerow(record)
I think you can divide you DataFrame into 3 fields(records, varNames, varTypes)
By the way, you can use method in pandas to write data into file.
import pandas as pd
sensor_values = pd.DataFrame([[1,'aaa','bbb'],[2,'ppp','xxx']], columns=['A','B','C'])
varNames=sensor_values.columns
records = sensor_values.values
varType = {key: 0 for x, key in enumerate(sensor_values.columns)}

Related

how to convert string to datatable excel using pandas?

Following my previous question, now i'm trying to put data in a table and convert it to an excel file but i can't get the table i want, if anyone can help or explain what's the cause of it, this is the final output i want to get
this the data i'm printing
Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}
and here is my code
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
cols = df.columns
cols = [((col.split('.')[0], col)) for col in df.columns]
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.to_excel("v.xlsx")
but this is what i get
How can i solve the probleme please this the final and most important part and thank you in advance.
Within the for loop, the value "Hotel1 : chambre double" is held in d[0]
(try it by yourself by printing d[0].)
In your previous question, the "Name3" column was built by the following line of code:
cols = [((col.split('.')[0], col)) for col in df.columns]
Now, to save "Hotel1 : chambre double", you need to access it within the first for loop.
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
cols = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
cols2 = df.columns
cols = [((d[0], col)) for col in df.columns]
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.T.to_csv(r"v.csv")
This works, because you are taking the d[0] (hotel name) within the for loop, and creating tuples for your column names whilst you have access to that object.
you then create a multi index column in the line of code you already had, outside the loop:
df.columns=pd.MultiIndex.from_tuples(cols)
Finally, to answer the output to excel query you had, please add the following line of code at the bottom:
df.T.to_csv(r"v.csv")

How to write the data to an excel using python

Im writing the data inside my dictionary to an excel which looks like below
my_dict = { 'one': 100, 'two': 200, 'three': 300}
df = pd.DataFrame(my_dict.items(), columns=['Summary','Count'])
with pd.ExcelWriter('outfile.xlsx') as writer:
df.to_excel(writer, sheet_name='sheet1', index=False)
for the above code im getting the desired output like below.
I have one more list which have some values which needs to be pasted in the 3rd column of the excel.
my_list = [10,20,30]
expected output:
Edit: I need to add the data in my_dict and the my_list at the same time.
I have tried finding out a solution unfortunately couldn't able to. Any help is appreciated!
Many thanks!!
To add the data in my_dict and the my_list at the same time to define the dataframe df, you can chain the pd.DataFrame() call with .assign() to define the column named my_list using the input list my_list as input:
df = pd.DataFrame(my_dict.items(), columns=['Summary','Count']).assign(my_list=my_list)
Of course, the most trivial way of doing that is to separate them into 2 statements, defining the dataframe by pd.DataFrame first and then add column, as below. But this will be in 2 statement and not sure whether you still count it as "at the same time".
df = pd.DataFrame(my_dict.items(), columns=['Summary','Count']) # Your existing statement unchanged
df['my_list'] = my_list
Result:
print(df)
Summary Count my_list
0 one 100 10
1 two 200 20
2 three 300 30
This may also solve your problem
import pandas as pd
my_dict = {'summary': ['one', 'two', 'three'], 'count': [100, 200, 300]}
my_list = [10,20,30]
df = pd.DataFrame.from_dict(my_dict)
df['my_list'] = my_list
df.to_excel('df.xlsx')

ignore columns not present in parquet with pyarrow in pandas

I am trying to read a parquet with pyarrow==1.0.1 as engine.
Given :
columns = ['a','b','c']
pd.read_parquet(x, columns=columns, engine="pyarrow")
if file x does not contain c, it will give out :
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Field named 'c' not found or not unique in the schema.
There is no argument to ignore warning and just read columns that are missing as nan.
The error handling is also pretty bad.
pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")
It is pretty hard to get the filed name that was missing, so that it can be used to remove the columns that is passed in next try.
Is there a method to this?
You can read the metadata from your parquet file to figure out which columns are available.
Bear in mind though that pandas won't be able to guess the type of the missing column (c in the example below), which may cause issues when you concatenate tables later.
import pandas as pd
import pyarrow.parquet as pq
all_columns = ['a', 'b', 'c']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)
parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
parquet_file
.read(columns=columns_in_file)
.to_pandas()
.reindex(columns=all_columns)
)

What causes the problem: csv, pandas or nltk?

I have a strange problem resulting in wrong output delivered by NLTK collocations. In short, when I pass pandas object created in python envi (PyCharm or Jupyter) to the function I get correct result. When I save this object to csv and upload it to the pandas object, functions returns single letters and/or numbers instead of full words. Must be sth wrong with csv upload through pandas but I have no idea what is wrong...
here is the code.
Function that is applied:
def counts(x):
trigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(x)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(x)
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
c = Counter(ngram_list).most_common(3)
return pd.DataFrame([(x.name, ) + element for element in c], columns=['group', 'Ngram', 'Frequency'])
Here is the object:
d = {'words' : pd.Series((['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
['galley', 'work', 'table', 'stuck'],
['cloth', 'stuck'],
['stuck', 'coffee'])),
'group' : pd.Series([1, 2, 1, 2])}
df_cleaned = pd.DataFrame(d)
Then I apply function from above + some extra functions:
output = df_cleaned.groupby('group', as_index=False).words.apply(counts).reset_index(drop=True)
Result is correct:
But when pandas object is saved and uploaded result is sth like this:
here is a code for saving and uploading:
df.to_csv('test_file.csv', index=False, sep=',')
df = pd.read_csv('path/test_file.csv',
sep=',', usecols=['group','words'])
I found quotes in uploaded pandas object therefore I had removed them before applying the fucntion"
df = df.replace({'\'': ''}, regex=True)
output = df_cleaned.groupby('group', as_index=False).words.apply(counts).reset_index(drop=True)
Now it returns wrong results.
Do have any suggestions which way shall I go?
I reproduced what you described in the following steps. I don't see any errors
import pandas as pd
d = {'words' : pd.Series((['coffee', 'maker', 'brewing', 'properly', '2','420', '420', '420'],
['galley', 'work', 'table', 'stuck'],
['cloth', 'stuck'],
['stuck', 'coffee'])),
'group' : pd.Series([1, 2, 1, 2])}
df_cleaned = pd.DataFrame(d)
df_cleaned
The function you're using is
import nltk
from nltk.util import ngrams
from nltk.collocations import *
from collections import Counter
def counts(x):
trigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(x)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(x)
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
c = Counter(ngram_list).most_common(3)
return pd.DataFrame([(x.name, ) + element for element in c], columns=['group', 'Ngram', 'Frequency'])
You then apply counts to the data
output = df_cleaned.groupby('group',
as_index=False).words.apply(counts).reset_index(drop=True)
and save the results to file
output.to_csv('test_file.csv', index=False, sep=',')
df = pd.read_csv('test_file.csv',sep=',')
I don't see any problems

Applying different functions and their arguments using a general functions (Ta-Lib in particular)

would appreciate if you guys can help with a function that takes in a pandas df, a function name, input columns needed and argument/kwargs
import talib
The df is of the form:
Open High Low Close Volume
Date
1993-01-29 43.970001 43.970001 43.750000 43.939999 1003200
1993-02-01 43.970001 44.250000 43.970001 44.250000 480500
1993-02-02 44.220001 44.380001 44.130001 44.340000 201300
This following code is ok:
def ApplyIndicator(df, data_col, indicator_func,period):
df_copy = df.copy()
col_name = indicator_func.__name__
df_copy[col_name]=df_copy[data_col].apply(lambda x:indicator_func(x,period))
return df_copy
Sample:
new_df = ApplyIndicator(df,['Close'],talib.SMA,10)
However, if I want a general ApplyIndicator which could take different columns, for example, talib.STOCH, it takes more than 1 arguments and need different columns:
slowk, slowd = STOCH(input_arrays, 5, 3, 0, 3, 0, prices=['high', 'low', 'open'])
For this case, how can I do a general ApplyIndicator function that do it on general talib function assuming all required columns are in df already.
Thank you.
More details on the two functions:
SMA(real[, timeperiod=?])
and
STOCH(high, low, close[, fastk_period=?, slowk_period=?, slowk_matype=?, slowd_period=?, slowd_matype=?])
With the original ApplyIndicator, it can be be done like this:
def slowk(arr, per):
return STOCH(arr, 5, 3, 0, 3, 0, prices=['high', 'low', 'open'])[0]
new_df = ApplyIndicator(df,['Close'], slowk, None)
Lambda won't work here because it its name is always "", but with some smarter column naming they should be fine too.
To make it slightly more elegant, we can let arbitrary number of attributes:
def ApplyIndicator(df, indicator_func, *args):
col_name = indicator_func.__name__
df[col_name] = df.apply(lambda x:indicator_func(x, *args))
return df
new_df = ApplyIndicator(df[['Close']], talib.SMA, 10)
new_df = ApplyIndicator(df[...], STOCH, 5, 3, 0, 3, 0, ['high', 'low', 'open'])
But in fact, the whole function is so trivial it might be easier to replace it with a single call like this:
df[['slowk', 'slowd']] = df.apply(
lambda idx, row: STOCH(row, 5, 3, 0, 3, 0, ['high', 'low', 'open']))

Categories