Python script that efficiently drops columns from a CSV file - python

I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)

You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2

You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])

Related

the number of rows in a csv file

I have a csv file that has only one column. I want to extract the number of rows.
When I run the the code below:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
I get the following output:
[65422771 rows x 1 columns]
But when I run the code below:
file = open("data.csv")
numline = len(file.readlines())
print (numline)
I get the following output:
130845543
What is the correct number of rows in my csv file? What is the difference between the two outputs?
Is it possible that you have an empty line after each entry? because the readlines count is exactly double wrt pandas df rows.
So pandas is skipping empty lines while readlines count them
in order to check the number of empty lines try:
import sys
import csv
csv.field_size_limit(sys.maxsize)
data= open ('data.csv')
for line in csv.reader(data):
if not line:
empty_lines += 1
continue
print line

Convert .txt file to .csv with specific columns PYTHON

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.
Here is what it contains:
SEQ MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3 CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
95024445656543114678678999999999999999888889998886
SS8 CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
96134445555554311253378999999999999999999999999987
SA EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
41012123422000000103006262214011342311110000030001
TA bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
00789889988663201010099999999999999999898999998741
CD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
54433221111112221122124212411342243234323333333333
I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.
Like this:
I tried pd.read_csv but it doesn't give me the result I want.
Thank you !
To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.
SEQ, SS3, ....
MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....
Steps
Use pd.read_fwf() to read files in a fixed-width format.
Fill the missing values with the last available value by df.ffill().
Assign group number gp for the row number in the output using a groupby-cumcount construct.
Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.
Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.
Code
# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])
# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose.
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()
Result
print(df_ans) # show the first 3 characters only
col CD SA SEQ SS3 SS8 TA
gp
0 NNN EEE MSS CCC CCH bhH
1 544 410 NaN 950 961 007
Then you can save the resulting DataFrame using df_ans.to_csv().
You can use this script to load the .txt file to DataFrame and save it as csv file:
import pandas as pd
data = {}
with open('<your file.txt>', 'r') as f_in:
for line in f_in:
line = line.split()
if len(line) == 2:
data[line[0]] = [line[1]]
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Saves this CSV:

Filtering rows on a csv with values that I have on a txt file

So I have a csv file where I need to filter the rows based on the values that I have on a txt file. Is there an easy way to do this on pandas? The csv will have about 2000 rows and the txt file has about 400 data points. I need to generate a csv with rows that match the data on the txt file.
The CSV file looks like this:
Chromosome Gene Start End
1 PERM1 5 6
2 AGRN 7 10
3 MIB2 9 12
The Text file looks like
PERM1
NADK
GNB1
Thank you
First read Text file into a list or tuple:
lines = tuple(open(filename, 'r'))
Then filter the lines exist in the text file:
df = read_csv('csvfile')
result = df[df['Chromosome Gene'].isin(lines)]
easy enough using pandas read and filter functionality. I'm assuming you have an input .csv file called input_csv_file and a filter file called filter.csv. Input file has a column "filter_locatitons" and input_file has a column called "locations":
input_df = pd.read_csv('input_csv_file.csv')
filter_df = pd.read_csv('filter.csv')
filtered_df = input_df[[input_df['location'].isin(filter_df['filter_locations']]
This can be achieved using mask and loading both the files in dataframes. Below code assumes your test file does not have a header and your csv file is space delimited
import pandas as pd
df1 = pd.read_csv('csvfile.csv', delimiter=' ')
df2 = pd.read_csv('textfile.txt', header=None)
df2.columns = ['Gene']
m = df1.Gene.isin(df2.Gene)
df3 = df1[m]
print(df3)

exporting data frame to csv file in python with pandas

I want to export my dataframe to a csv file. normally I want my dataframe as 2 columns but when I export it, in csv file there is only one column and the data is separated with comma.
m is one column and s is another.
df = pd.DataFrame({'MSE':[m], 'SSIM': [s]})
to append new data frames I used below function and save data to csv file:.
with open('test.csv', 'a+') as f:
df.to_csv(f, header=False)
print(df)
when I print dataframe on console output looks like:
MSE SSIM
0 0.743373 0.843658
but in csv file a column looks like: here first is index, second is m and last one is s. I want them in 3 seperate columns
0,1.1264238582283046,0.8178900901529639
How can I solve this?
Your excel setting is most likely ; (semi-colon). Use:
df.to_csv(f, header=False, sep=';')

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Categories