I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.
Related
I have some text file that I want to load into my python code, but the format of the txt file is not suitable.
Here is what it contains:
SEQ MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3 CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
95024445656543114678678999999999999999888889998886
SS8 CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
96134445555554311253378999999999999999999999999987
SA EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
41012123422000000103006262214011342311110000030001
TA bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
00789889988663201010099999999999999999898999998741
CD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
54433221111112221122124212411342243234323333333333
I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.
Like this:
I tried pd.read_csv but it doesn't give me the result I want.
Thank you !
To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.
SEQ, SS3, ....
MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....
Steps
Use pd.read_fwf() to read files in a fixed-width format.
Fill the missing values with the last available value by df.ffill().
Assign group number gp for the row number in the output using a groupby-cumcount construct.
Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.
Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.
Code
# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])
# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose.
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()
Result
print(df_ans) # show the first 3 characters only
col CD SA SEQ SS3 SS8 TA
gp
0 NNN EEE MSS CCC CCH bhH
1 544 410 NaN 950 961 007
Then you can save the resulting DataFrame using df_ans.to_csv().
You can use this script to load the .txt file to DataFrame and save it as csv file:
import pandas as pd
data = {}
with open('<your file.txt>', 'r') as f_in:
for line in f_in:
line = line.split()
if len(line) == 2:
data[line[0]] = [line[1]]
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Saves this CSV:
I'm new to python and I have a challenge. I need to add a column in a text file delimited by ";". So far so good ... except that the value of this column depends on the value of another column. I will leave an example in case I was not clear
My file looks like this:
Account;points
1;500
2;600
3;1500
If the value of the points column is greater than 1000, enter 2, if less, enter 1.
In this case the file would look like this:
Account;points;column_created
1;500;1
2;600;1
3;1500;2
Approach without using pandas, this code assumes your points column will always be at the second position.
with open('stats.txt', 'r+') as file:
lines = file.readlines()
file.seek(0,0)
for line in lines:
columns = line.strip().split(";")
if int(columns[1])>1000:
file.write(";".join(columns)+";2\n")
else:
file.write(";".join(columns) + ";1\n")
File (hard drive) can't add new item between new elements. You have to read all data to memory, add new column, and write all back to file.
You could use pandas to easily add new column based on value from other colum.
In example I use io.StringIO() only to create minimal working code so everyone can copy it and text. Use read_csv('input.csv', sep=';') with your file
import pandas as pd
import io
text = '''Account;points
1;500
2;600
3;1500'''
#df = pd.read_csv('input.csv', sep=';')
df = pd.read_csv(io.StringIO(text), sep=';')
print('--- before ---')
print(df)
df['column_created'] = df['points'].apply(lambda x: 2 if x > 1000 else 1)
print('--- after ---')
print(df) #
df.to_csv('output.csv', sep=';', index=False)
Result
--- before ---
Account points
0 1 500
1 2 600
2 3 1500
--- after ---
Account points column_created
0 1 500 1
1 2 600 1
2 3 1500 2
You can use python library to create csv files. Here is the link to the documentation.
https://docs.python.org/3/library/csv.html
I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)
You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2
You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])
I am struggling to read simple csv to pandas, actual problem is that it doesnt separate ",".
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv', header=0, delimiter ="\t")
print(df)
I have tried sep=',' and it does not separate..
Event," 2016-02-01"," 2016-02-02"," 2016-02-03"," 2016-02-04","
Contact joined,"5","7","18","20",
Launch first time,"30","62","86","110",
It should looks like 1 header with Dates and 2 rows:
2016-02-01 2016-02-02 etc
0 5 7
1 30 62
UPDATE: Yes, the problem was in cdv itself with unnecessary quotes and characters.
You seem to be using both delimiter= and sep=, which both do the same thing. If it is actually comma seperated, try:
import pandas as pd
df = pd.read_csv('C:\\Users\\xxx\\1.csv')
print(df)
sep=',' is the default, so it's not necessary to explicitly state that. The same goes for header=0. delimiter= is just an alias for sep=.
You still seem to have a problem with the formating of your column names. If you post an example of your csv, I can try to fix that...
I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.