This question already has answers here:
Prevent pandas read_csv treating first row as header of column names
(4 answers)
Closed 3 years ago.
I have a file which has coordinates like
1 1
1 2
1 3
1 4
1 5
and so on
There are no zeros in them.I tried using comma and tab as a delimiter and still stuck in same problem.
Now when I printed the output to screen I saw something very weird. It looks like it is missing the very first line.
The output after running pa.read_csv('co-or.txt',sep='\t') is as follows
1 1
0 1 2
1 1 3
2 1 4
3 1 5
and so on..
I am not sure if I am missing any arguments in this.
Also when I tried to convert that to numpy array using np.array, It is again missing the first line and hence the first element [1 1]
df = pd.read_csv('data.csv', header=None)
You need to specifcy header=None otherwise pandas takes the first row as the header.
If you want to give them a meaningful name you can use the names as such:
df = pd.read_csv('data.csv', header=None, names=['foo','bar'])
Spend some time with pandas Documentation as well to get yourself familiar with their API. This one is for read_csv
You can try this:
file = open('file.dat','r')
lines = file.readlines()
file.close()
and it does work.
Related
This question already has answers here:
Python Pandas: How to read only first n rows of CSV files in?
(3 answers)
Closed last month.
How to read the first cell from my csv file and store it as a variable
for example, my list is
header 1
header 2
AM
Depth
Value
10
20
30
122
60
222
how can I read the (AM) cell and store it as "x" variable?
and how I can I ignore AM cell later on and start my data frame from my headers (Depth, value)?
You should be able to get a specific row/column using indexing. iloc should be able to help.
For example, df.iloc[0,0] returns AM.
Also, pandas.read_csv allows you to skip rows when reading the data, You can use pd.read_csv("test.csv", sep="\t",skiprows=1) to skip first row.
Result:
0 10 20
1 30 122
2 60 222
Use pd.read_csv and then select the first row:
import pandas as pd
df = pd.read_csv('your file.csv')
x = df.iloc[0]['header 1']
Then, to delete it, use df.drop:
df.drop(0, inplace=True)
Hi I am using dummy csv file which is generated using data you posted in this question.
import pandas as pd
# read data
df = pd.read_csv('test.csv')
File contents are as follows:
header 1 header 2
0 AM NaN
1 Depth Value
2 10 20
3 30 122
4 60 222
One can use usecols parameter to access different columns in the data. If you are interested in just first column in this case it can be just 0 or 1. Using 0 or 1 you can access individual columns in the data.
You can save contents of this to x or whichever variable you want as follows:
# Change usecols to load various columns in the data
x = pd.read_csv('test.csv',usecols=[0])
Header:
# number of line which you want to use as a header set it using header parameter
pd.read_csv('test.csv',header=2)
Depth Value
0 10 20
1 30 122
2 60 222
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I need to extract a specific value from pandas df column. The data looks like this:
row my_column
1 artid=delish.recipe.45064;artid=delish_recipe_45064;avb=83.3;role=4;data=list;prf=i
2 ab=px_d_1200;ab=2;ab=t_d_o_1000;artid=delish.recipe.23;artid=delish;role=1;pdf=true
3 dat=_o_1000;artid=delish.recipe.23;ar;role=56;passing=true;points001
The data is not consistent, but separated by a comma and I need to extract role=x.
I separated the data by a semicolon. And can loop trough the values to fetch the roles, but was wondering if there is a more elegant way to solve it.
Desired output:
row my_column
1 role=4
2 role=1
3 role=56
Thank you.
You can use str.extract and pass the required pattern within parentheses.
df['my_column'] = df['my_column'].str.extract('(role=\d+)')
row my_column
0 1 role=4
1 2 role=1
2 3 role=56
This should work:
def get_role(x):
l=x.split(sep=';')
t=[i for i in l if i[:4]=='role')][0]
return t
df['my_column']=[i for i in map(lambda y: get_role(y), df['my_column'])]
I have the following arquive format:
7.2393690416406E+000 1.0690994646755E+001 3.1429089063731E+000
-2.7606309583594E+000 1.0690994646755E+001 1.3142908906373E+001
That is: Before non-negative values (talking about first column), there is one white space, and before negative values there is not white spaces. Therefore, if you read with a code like the following:
df = pd.read_csv('example.csv',header=None,engine='python',sep=' ')
You will get something like this:
1 NaN 7.239369 10.690995 3.142909
2 -2.760631 10.690995 13.142909 NaN
This happens because pandas identifies the first white space, and assumes it is a column. The dataframe indeed contains all values, but each negative line (talking about the first column) will be dephased by one column. How can I fix it? How can a get a pretty dataframe like the folliwing?
1 7.239369 10.690995 3.142909
2 -2.760631 10.690995 13.142909
Use sep='\s+'
df = pd.read_csv('test.csv', header=None, sep='\s+')
0 1 2
0 7.239369 10.690995 3.142909
1 -2.760631 10.690995 13.142909
I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9
This question already has answers here:
How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML?
(10 answers)
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 7 months ago.
When I create the following Pandas Series:
pandas.Series(['a', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa']
I get this as a result:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
How can I instead get a Series without the ellipsis that looks like this:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
pandas is truncating the output, you can change this:
In [4]:
data = pd.Series(['a', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'])
pd.set_option('display.max_colwidth',1000)
data
Out[4]:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
dtype: object
also see related: Output data from all columns in a dataframe in pandas and Output data from all columns in a dataframe in pandas
By the way if you are using IPython then if you do a docstring lookup (by pressing tab) then you will see the current values and the default values (the default is 50 characters).
For Pandas versions older than 0.10 use
pd.set_printoptions(max_colwidth, 1000)
See related: Python pandas, how to widen output display to see more columns?