how to write pandas dataframe in libreoffice calc using python? - python

I am using the following code in ubuntu 20.
import pyoo
import os
import uno
import pandas as pd
os.system("/usr/lib/libreoffice/program/soffice.bin --headless --invisible --nocrashreport --nodefault --nofirststartwizard --nologo --norestore --accept='socket,host=localhost,port=2002,tcpNoDelay=1;urp;StarOffice.ComponentContext'")
df=pd.Dataframe()
df['Name']=['Anil','Raju','Arun']
df['Age']=['32','34','45']
desktop = pyoo.Desktop('localhost', 2002)
doc = desktop.open_spreadsheet("/home/vivek/Documents/Libre python trial/oi_data.ods")
sh1=doc.sheets['oi_data']
sh1[1,4].value=df
doc.save()
It gives all data in a single cell as a string:
'Name age0 Anil 321 Raju 342 Arun 45'
I want to write a DataFrame in LibreOffice Calc in columns & rows of sheet like this :
Name age
0 Anil 32
1 Raju 34
2 Arun 45
example code used in xlwings in window os just for reference (I want to achieve same with simple code in Libreoffice calc in ubuntu/Linux, if possible..)
import pandas as pd
import xlwings as xlw
# Connecting with excel workbook
file=xlw.Book("data.xlsx")
# connection with excel sheet
sh1=file.sheets('sheet1')
df=pd.DataFrame()
df['Name']=['Anil','Raju','Arun']
df['Age']=['32','34','45']
sh1.range('A4').value=df

From the pyoo documentation, a range of values is set with a list of lists.
sheet[1:3,0:2].values = [[3, 4], [5, 6]]
To get a list of lists from a dataframe, the following code is recommended at How to convert a Python Dataframe to List of Lists? :
lst = [df[i].tolist() for i in df.columns]
EDIT:
Write a function called insertDf() that does the two things above, calculating the required indices.
Then instead of sh1.range('A4').value=df, write insertDf(df,'A4',sh1).
Or perhaps more elegant is to create a class called CalcDataFrame that extends pandas.DataFrame to add a method called writeCells().
Also, it would be easier to write location arguments as (row number, column number) instead of a 'column letters&row number' combined string.
df = CalcDataFrame()
df['Name']=['Anil','Raju','Arun']
df['Age']=['32','34','45']
df.writeCells(sh1,1,4)

Related

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

How to customise my csv file using python instead of Excel macro

I want to store my python code result in to csv file, but here is my python code i am not show my python result in my csv file
I have converted the macro vb file to python... Any advise would be more appreciated, because I am new to this.
*unable to enter full code due to site error.
Please find my code
import pandas as pd
import csv
import numpy as np
from vb2py.vbfunctions import *
from vb2py.vbdebug import *
def My_custom_MACRO():
#
# My_custom_MACRO Macro
#
#
Range('A1:A2').Select()
Range('A2').Activate()
Columns('A:A').EntireColumn.AutoFit()
Columns('G:G').Select()
Selection.Insert(Shift=xlToRight, CopyOrigin=xlFormatFromLeftOrAbove)
Columns('P:P').EntireColumn.AutoFit()
Columns('P:P').Select()
Selection.Cut(Destination=Columns('G:G'))
Range('G53').Select()
ActiveWindow.SmallScroll(Down=- 45)
Range('G1').Select()
ActiveCell.FormulaR1C1 = 'Live Deli'
ActiveWindow.SmallScroll(Down=- 9)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=7, Criteria1='>50', Operator=xlAnd)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=4, Criteria1='>50', Operator=xlAnd)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=7, Criteria1='>50', Operator=xlAnd)
ActiveWindow.SmallScroll(Down=0)
ActiveSheet.Range('$A$1:$N$201').AutoFilter(Field=4, Criteria1='>50%', Operator=xlAnd)
df_1=pd.read_csv(r'D:\proj\project.csv',My_custom_MACRO)
df_1.to_csv(r'D:\proj\project_output.csv')
I don't see sort in your macro just a column move and filter. Try this ;
import pandas as pd
df = pd.read_csv(r'c:\temp\project.csv')
cols = df.columns
# cut column P paste column G as Live Deli
colP = df.pop(cols[14])
df.insert(14,'','')
df.insert(6,'Live Deli',colP)
# apply filter to col 4 and 7
df1 = df.loc[ (df[cols[3]] > 0.5) & (df['Live Deli'] > 50) ]
# save
df1.to_csv(r'c:\temp\project_output.csv', index=False)

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

Boxplot needs to use multiple groupby in Pandas

I am using pandas, Jupyter notebooks and python.
I have a following dataset as a dataframe
Cars,Country,Type
1564,Australia,Stolen
200,Australia,Stolen
579,Australia,Stolen
156,Japan,Lost
900,Africa,Burnt
2000,USA,Stolen
1000,Indonesia,Stolen
900,Australia,Lost
798,Australia,Lost
128,Australia,Lost
200,Australia,Burnt
56,Australia,Burnt
348,Australia,Burnt
1246,USA,Burnt
I would like to know how I can use a box plot to answer the following question "Number of cars in Australia that were affected by each type". So basically, I should have 3 boxplots(for each type) showing the number of cars affected in Australia.
Please keep in mind that this is a subset of the real dataset.
You can select only the rows corresponding to "Australia" from the column "Country" and group it by the column "Type" as shown:
from StringIO import StringIO
import pandas as pd
text_string = StringIO(
"""
Cars,Country,Type,Score
1564,Australia,Stolen,1
200,Australia,Stolen,2
579,Australia,Stolen,3
156,Japan,Lost,4
900,Africa,Burnt,5
2000,USA,Stolen,6
1000,Indonesia,Stolen,7
900,Australia,Lost,8
798,Australia,Lost,9
128,Australia,Lost,10
200,Australia,Burnt,11
56,Australia,Burnt,12
348,Australia,Burnt,13
1246,USA,Burnt,14
""")
df = pd.read_csv(text_string, sep = ",")
# Specifically checks in column name "Cars"
group = df.loc[df['Country'] == 'Australia'].boxplot(column = 'Cars', by = 'Type')

Python: How do you extract a column of data in a csv file if the name of the column has spaces?

For instance, this code wouldn't work because 'data.apple pies' has a space in between. But this is representative of the column to be selected. So how can I still extract the data from the csv column without altering the name of the column? I could have changed 'apple pies' to 'applepies', and subsequently used data.applepies instead of data.apple pies, but this would be inconvenient if I've to change the names this way for many columns (Or should I just remove all the spaces in the column names?)
data = pandas.read_csv('type of foods.csv', names = ['apple pies', 'peach pies'])
applepies = list(data.apple pies)
You can use subscript to access the column, and if you want to convert it to a list, you should use Series.tolist() method instead of list(...) . Example -
applepies = data['apple pies'].tolist()
Please note, the important thing in the answer is to use data['apple pies'] (subscript) to access the column , with that even list(data['apple pies']) would work . But I prefer the .tolist() method for converting the series to a list.
Demo -
Sample csv -
"apple pies"
12
33
33
123
22
Code -
In [1]: import pandas as pd
In [2]: df = pd.read_csv('a.csv')
In [3]: df['apple pies'].tolist()
Out[3]: [12, 33, 33, 123, 22]
This works "with" spaces just fine.
import pandas as pd
from StringIO import StringIO
csv = r"""apple pies,peach pies,something,else
apple1,peach1,1,2
apple2,peach2,1,2
apple3,peach3,1,2
apple4,peach4,1,2
apple5,peach5,1,2
"""
data = pd.read_csv(StringIO(csv),
usecols=["apple pies","peach pies"],
header=0)
data.columns = ['apple', 'peach']
print list(data.apple)
print data
output:
['apple1', 'apple2', 'apple3', 'apple4', 'apple5']
apple peach
0 apple1 peach1
1 apple2 peach2
2 apple3 peach3
3 apple4 peach4
4 apple5 peach5

Categories