comment='#' in pandas explanation - python

Can anyone explain how comment='#' works within a csv file in pandas
pd.read_csv(..., comment='#',...)? Sample code is below.
# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
print(df1.head(5))
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')
# Print the output of df2.head()
print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)

Here is an example of how the comment argument works:
csv_string = """col1;col2;col3
1;4.4;99
#2;4.5;200
3;4.7;65"""
# Without comment argument
print(pd.read_csv(StringIO(csv_string), sep=";"))
# col1 col2 col3
# 0 1 4.4 99
# 1 #2 4.5 200
# 2 3 4.7 65
# With comment argument
print(pd.read_csv(StringIO(csv_string),
sep=";", comment="#"))
# col1 col2 col3
# 0 1 4.4 99
# 1 3 4.7 65

You can found everything in the documentation.
Citation:
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #emptyna,b,cn1,2,3 with header=0 will result in a,b,c being treated as the header.
Thus, it's just ignoring everything after # until the new line or header.

Related

Outputting DataFrame to tsv, how to ignore or override 'need to escape' error

Related to, but distinct from, this question.
I want to output my pandas dataframe to a tsv file. The first column of my data is a pattern that actually contains 3 bits of information which I'd like to separate into their own columns:
Range c1
chr1:2953-2965 -0.001069
chr1:35397-35409 -0.001050
chr1:37454-37466 -0.001330
chr2:37997-38009 -0.001235
chrX:44465-44477 -0.001292
So I do this:
Df = Df.reset_index()
Df["Range"] = Df["Range"].str.replace( ":", "\t" ).str.replace( "-", "\t" )
Df
Range c1
0 chr1\t2953\t2965 -0.001069
1 chr1\t35397\t35409 -0.001050
2 chr1\t37454\t37466 -0.001330
3 chr2\t37997\t38009 -0.001235
4 chrX\t44465\t44477 -0.001292
All I need to do now is output with no header or index, and add one more '\t' to separate the last column and I'll have my 4-column output file as desired. Unfortunately...
Df.to_csv( "~/testout.bed",
header=None,
index=False,
sep="\t",
quoting=csv.QUOTE_NONE,
quotechar=""
)
Error: need to escape, but no escapechar set
Here is where I want to ignore this error and say "No, python, actually you Don't need to escape anything. I put those tab characters in there specifically to create column separators."
I get why this error occurs. Python thinks I forgot about those tabs, and this is a safety catch, but actually I didn't forget about anything and I know what I'm doing. I know that the tab characters in my data will be indistinguishable from column-separators, and that's exactly what I want. I put them there specifically for this reason.
Surely there must be some way to override this, no? Is there any way to ignore the error and force the output?
You can simply use str.split to split the Range column directly -
df['Range'].str.split(r":|-", expand=True)
# 0 1 2
#0 chr1 2953 2965
#1 chr1 35397 35409
#2 chr1 37454 37466
#3 chr2 37997 38009
#4 chrX 44465 44477
To retain all the columns, you can simply join this split with the original
df = df.join(df['Range'].str.split(r":|-", expand=True))

Am I using groupby.sum() correctly?

I've the following code, and a problem in the new_df["SUM"] line:
import pandas as pd
df = pd.read_excel(r"D:\Tesina\Proteoma Humano\Tablas\uno - copia.xlsx")
#df = pd.DataFrame({'ID': ['C9JLR9','O95391', 'P05114',"P14866"], 'SEQ': ['1..100,182..250,329..417,490..583', '1..100,206..254,493..586', '1..100', "1..100,284..378" ]})
df2 = pd.DataFrame
df["SEQ"] = df["SEQ"].replace("\.\."," ", regex =True)
new_df = df.assign(SEQ=df.SEQ.str.split(',')).explode('SEQ')
for index, row in df.iterrows():
new_df['delta'] = new_df['SEQ'].map(lambda x: (int(x.split()[1])+1)-int(x.split()[0]) if x.split()[0] != '1' else (int(x.split()[1])+1))
new_df["SUM"] = new_df.groupby(["ID"]).sum().reset_index(drop=True) #Here's the error, even though I can't see where
df2 = new_df.groupby(["ID","SUM"], sort=False)["SEQ"].apply((lambda x: ','.join(x.astype(str)))).reset_index(name="SEQ")
To give some context, what it does is the following: grabs every line with the same ID, separates the numbers with a "," in between, does some math with those numbers (that's where the "delta" (which i know it's not a delta) line gets involved), and finally sums up all the "delta" for each ID, grouping them all by their original ID, so I maintain the same numbers of rows.
And, when I use a sample of the data (the one that´s commented at the beginning), it works perfectly, giving me the ouptut that I wish:
ID SUM SEQ
0 C9JLR9 353 1 100,182 250,329 417,490 583
1 O95391 244 1 100,206 254,493 586
2 P05114 101 1 100
3 P14866 196 1 100,284 378
But, when I aply it on my Excel file (that has 10471 rows), the groupby.sum() line doesn't work as it's supposed to (I've already checked everything else, I know the error is within that line).
This is the output that I receive:
ID SUM SEQ
0 C9JLR9 39 1 100,182 250,329 417,490 583
1 O95391 20 1 100,206 254,493 586
2 P05114 33 1 100
4 P98177 21 1 100,176 246
You can clearly see that the SUM values differ (and are not correct at all). I haven't been able to figure out where those numbers come from, also. It's really weird.
If anyone is interested, the solution was provided in the comments: I had to change the line with the following:
new_df["SUM"] = new_df.groupby("ID")["delta"].transform("sum")

Match index function from excel in pandas

There is a match index function in Excel that i use to match if the elements are present in the required column
=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)
This is the function i am using right now and it is yielding me good results but I want to implement it in python.
brand N Z None
Honor 63 96 190
Tecno 0 695 763
from this table I want
brand L N Z
Honor 0 63 96
Tecno 0 0 695
It should compare both the column and index and give the appropriate value
i have tried the lookup function in python but that gives me the
ValueError: Row labels must have same size as column labels
What you basically do with your excel formula, is creating something like a pivot table, you can also do that with pandas. E.g. like this:
# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing= set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) + [brands_query[0]] * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)
For my testdata, this outputs:
value
variable L N Z
brand
Bar NaN NaN NaN
Honor NaN 63.0 96.0
Tecno NaN 0.0 695.0
The testdata is (note, that above we don't see col None and row Foo, but we see row Bar and column L, which are actually not present in the testdata, but were "queried"):
brand N Z None
0 Honor 63 96 190
1 Tecno 0 695 763
2 Foo 8 111 231
You can generate this testdata using:
import pandas as pd
import numpy as np
import io
raw=\
"""brand N Z None
Honor 63 96 190
Tecno 0 695 763
Foo 8 111 231"""
df= pd.read_csv(io.StringIO(raw), sep='\s+')
Note: the result as shown in the output is a regular pandas dataframe. So in case you plan to write the data back to a excel sheet, there should be no problem (pandas provides methods to read/write dataframes to/from excel-files).
Do you need to use Pandas for this action. You can do it with simple python as well. Read from one text file and print out matched and processed fields.
Basic file reading in Python goes like this. Where datafile.csv is your file. This reads all the lines in one file and prints out right result. First you need to save your file in .csv format so there is a separator between fields ','.
import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output
Input file:
brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763
Prints out:
brand L N Z
Honor 0 63 96
Tecno 0 0 695
(I am not familiar with Excel Match-function so you may need to add some logic to above Python script to get logic working with all your data.)

Apply operation on columns of CSV file excluding headers and update results in last row

I have a CSV file created like this:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
Now I want the fourth row to get appended to the existing CSV file as followings:
First column: Remains same: 1213
Second column: Get max value: 898
Third column: Get min value: 009
Fourth column: Get avg value: 422.6
So the final CSV file should be:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.6
Please help me to achieve the same. It's not mandatory to use Pandas.
Thanks in advance!
df.agg(...) accepts a dict where the dict keys are the names of the columns and the values are strings that perform an aggregation that you want:
df_agg = df.agg({'keep_same': 'mode', 'get_max': 'max',
'get_min': 'min', 'get_avg': 'mean'})[df.columns]
Produces:
keep_same get_max get_min get_avg
0 1213 898 9 422.666667
Then you just append df_agg to df:
df = df.append(df_agg, ignore_index=False)
Result:
keep_same get_max get_min get_avg
0 1213 176 901 517.000000
1 1213 198 9 219.000000
2 1213 898 201 532.000000
0 1213 898 9 422.666667
Notice that the index of the appended row is 0. You can pass ignore_index=True to append if you desire.
Also note that if you plan to do this append operation a lot, it will be very slow. Other approaches do exist in that case but for once-off or just a few times, append is OK.
assuming you do not care about the index you can use loc[-1] to add the row:
df = pd.read_csv('file.csv', sep=';', dtype={'get_min':'object'}) # read csv set dtype to object for leading 0 col
row = [df['keep_same'].values[0], df['get_max'].max(), df['get_min'].min(), df['get_avg'].mean()] # create new row
df.loc[-1] = row # add row to a new line
df['get_avg'] = df['get_avg'].round(1) # round to 1
df['get_avg'] = df['get_avg'].apply(lambda x: '%g'%(x)) # strip .0 from the other records
df.to_csv('file1.csv', index=False, sep=';') # to csv file
out:
keep_same;get_max;get_min;get_avg
1213;176;901;517
1213;198;009;219
1213;898;201;532
1213;898;009;422.7

Python Pandas read_table with line continuation

Is it possible for pandas to read a text file that contains line continuation?
For example, say I have a text file, 'read_table.txt', that looks like this:
col1, col2
a, a string
b, a very long \
string
c, another string
If I invoke read_table on the file I get this:
>>> pandas.read_table('read_table.txt', delimiter=',')
col1 col2
0 a a string
1 b a very long \
2 string NaN
3 c another string
I'd like to get this:
col1 col2
0 a a string
1 b a very long string
2 c another string
Use escapechar:
df = pd.read_table('in.txt', delimiter=',',escapechar="\\")
That will include the newline as DSM pointed out, you can remove the newlines with df.col2 = df.col2.str.replace("\n\s*","")
I couldn't get the escapechar option to work as Padraic suggested, probably because I'm stuck on a Windows box at the moment (tell-tale \r):
col1 col2
0 a a string
1 b a very long \r
2 string NaN
3 c another string
What I did get to work correctly was a regex pass:
import pandas as pd
import re
import StringIO # python 2 on this machine, embarrassingly
with open('read_table.txt') as f_in:
file_string = f_in.read()
subbed_str = re.sub('\\\\\n\s*', '', file_string)
df = pd.read_table(StringIO.StringIO(subbed_str), delimiter=',')
This yielded your desired output:
col1 col2
0 a a string
1 b a very long string
2 c another string
Very cool question. Thanks for sharing it!

Categories