I have a pandas DataFrame and I would like to save the DataFrame in a tab separated file format with pound(#) symbol at the beginning of the header.
Here is my demo code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
The above code create a dataframe and save it in a tab separated value format. that looks like:
a b c
1 2 3
4 5 6
7 8 9
But how I can add add pound symbol with the header while saving the DataFrame.
I want the output to be like bellow:
#a b c
1 2 3
4 5 6
7 8 9
Hope I am clear with the question and thanks in advance for the help.
Note: I would like to keep the DataFrame header definition same
Using your code, just modify the a column to be #a like below
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['#a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
Edit
If you don't want to adjust the starting dataframe, use .rename before sending to csv:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.rename(columns={
'a' : '#a'
}).to_csv(file_name, sep='\t', index=False)
Use the header argument to create aliases for the columns.
df.to_csv(file_name, sep='\t', index=False,
header=[f'#{x}' if x == df.columns[0] else x for x in df.columns])
#a b c
1 2 3
4 5 6
7 8 9
Here's another way to get your column aliases:
from itertools import zip_longest
header = [''.join(x) for x in zip_longest('#', df.columns, fillvalue='')]
#['#a', 'b', 'c']
Related
I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)
Python Script
#!/bin/python3
import pandas as pd
import numpy as np
class test(object):
def checker(self):
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
return df2
if __name__ == "__main__":
q = test()
q.checker()
I want that df2 object. The dataframe.
R code
x <- py_run_file("new1.py")
The output ends of being a Dictionary with 28 items.
What is the correct way to grab that object in R using Reticulate?
You need to pull an object from that environment:
import pandas as pd
import numpy as np
class test(object):
def checker(self):
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
return df2
if __name__ == "__main__":
q = test()
x = q.checker()
In R:
library(reticulate)
x <- py_run_file("test.py")$x
x
a b c
1 1 2 3
2 4 5 6
3 7 8 9
I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?
You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)
You could also just do:
df['column_name'] = df['column_name'].replace('-','None')
import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.
you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e
I´ve got the following problem:
If i select some index of my Pandas DataFrame:
df = pd.DataFrame(data=CoordArray[0:,1:],index=CoordArray[:,0],columns=["x","y","z"])
like this:
print(df.loc[['1234567','7654321'],:])
it works pretty well.
but if i have those data in a numpy array, transform this array to a list and do it like this:
mynewlist = list(SomeNumpyArray)
print(df.loc[mynewlist])
i get the following problem:
"None of [[1234567, 7654321]] are in the [index]"
I really dont know whats going wrong.
I haven't been able to replicate your issue. As #Wen commented, your list and numpy array may not have the same types.
Here is an example demonstrating that lists or numpy arrays are acceptable as indexers:
import pandas as pd, numpy as np
df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
index=['1000', '2000', '3000', '4000'],
columns= ['x', 'y', 'z'])
idx = np.array(['2000', '3000'])
df.loc[idx]
# x y z
# 2000 4 5 6
# 3000 7 8 9
lst = list(idx)
df.loc[idx_lst]
# x y z
# 2000 4 5 6
# 3000 7 8 9
I would like to repeat these example_1 example_2 with my dataset.
import pandas_ml as pdml
df = pdml.ModelFrame({'A': [1, 2, 3], 'B': [2, 3, 4],
'C': [3, 4, 5]}, index=['a', 'b', 'c'])
df
A B C
a 1 2 3
b 2 3 4
c 3 4 5
But, the issue is I have my data set in a csv file.
x_test = pd.read_csv("x_test.csv",sep=';',header=None)
I've tried to convert pandas data frame to dict, but it didn't work.
So, the question is there a way for converting the pandas dataframe into Pandas-Ml ModelFrame?
I think you need DataFrame.to_dict with parameter orient:
x_test = pd.read_csv("x_test.csv",sep=';',header=None)
df = pdml.ModelFrame(x_test.to_dict(orient='list'))