Is it possible for pandas to read a text file that contains line continuation?
For example, say I have a text file, 'read_table.txt', that looks like this:
col1, col2
a, a string
b, a very long \
string
c, another string
If I invoke read_table on the file I get this:
>>> pandas.read_table('read_table.txt', delimiter=',')
col1 col2
0 a a string
1 b a very long \
2 string NaN
3 c another string
I'd like to get this:
col1 col2
0 a a string
1 b a very long string
2 c another string
Use escapechar:
df = pd.read_table('in.txt', delimiter=',',escapechar="\\")
That will include the newline as DSM pointed out, you can remove the newlines with df.col2 = df.col2.str.replace("\n\s*","")
I couldn't get the escapechar option to work as Padraic suggested, probably because I'm stuck on a Windows box at the moment (tell-tale \r):
col1 col2
0 a a string
1 b a very long \r
2 string NaN
3 c another string
What I did get to work correctly was a regex pass:
import pandas as pd
import re
import StringIO # python 2 on this machine, embarrassingly
with open('read_table.txt') as f_in:
file_string = f_in.read()
subbed_str = re.sub('\\\\\n\s*', '', file_string)
df = pd.read_table(StringIO.StringIO(subbed_str), delimiter=',')
This yielded your desired output:
col1 col2
0 a a string
1 b a very long string
2 c another string
Very cool question. Thanks for sharing it!
Related
We are trying to extract rows from a column whose value contains strictly one of the following values [TC1, TC2, TC3]. The trick is that some rows also contain the following values TC12,TC13 etc. We don't want to extract them. Using str.contains is not an option in here.
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
4 D TC12
5 D TC15
6 D TC16
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
We used the following commands:
df1 = df.loc[df1['Col_3'].str.match("TC\d{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1-3]{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1,2,3]")]
But the problem is that is not working. Instead of returning the first 3 rows, it is returning all of the rows. We don't understand why it's wrong.
I would do
import pandas as pd
df = pd.DataFrame({"col":['TC1','TC2','TC3','TC12','TC15','TC16']})
print(df[df["col"].str.match(r"^TC\d$")])
output
col
0 TC1
1 TC2
2 TC3
Explanation: I used ^ and $ which mean start and end, so it will only detect where there is fullmatch, so-called raw-string so I can use \d inside it without need of additional escaping (for more about this see re docs). As side note "TC[1,2,3]" does not do what you think - if you enumerate characters inside [ ] there is no seperator to be used, so , is treated as character, so
import re
if(re.match("TC[1,2,3]", "TC,")):
print("match")
else:
print("no match")
output
match
You can use str.contains -
df = df[df.Col_3.str.contains(pat = r'^TC[\d{1}]$')]
or via str.match -
df = df[df.Col_3.str.match(pat = r'^TC[\d{1}]$')]
or via str.fullmatch -
df = df[df.Col_3.str.fullmatch(pat = r'^TC[\d{1}]')]
or via apply(slow) -
import re
df = df[df.Col_3.apply(lambda x : re.match(r'^TC[\d{1}]$', x)).notna()]
I have a column that contains a complicated string format. I would like to keep the first word only, and/or keep the first word in addition to certain other words.
I wish to keep certain key words in the string, such as 'RED', 'DB', 'APP', 'Infra', etc.
DATA
type grp
Goodbye-CCC-LET-TestData-A.1 a
Hello-PIR-SSS-Hellosims-App-INN-A.0 b
Hello-PIR-SSS-DB-RED-INN-C.0 c
Hello-PIR-SSS-App-SA200-F.0 d
Goodbye-PIR-SIR-DB_set-int-e.1 c
OK-PIR-SVV-Infra_ll-NA-A.0 e
DESIRED
type grp
Goodbye a
Hello-App b
Hello-DB-RED c
Hello-App d
Goodbye-DB c
OK-Infra e
DOING
s = (df['type'].str.split('-')
.str[0]
.str.cat(rack['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False),
sep=' ',
na_rep='')
.str.strip())
df.insert(1, 'type', s)
The following code just give me the first word, for example:
Goodbye
Hello
OK
Any suggestion is appreciated. I am still researching
you can use str.extractall on your series then join the values
import pandas as pd
import re
df.drop('type',1).join(df['type'].str.extractall('(^\w+)-|(app|red|infra|db)'
,flags=re.IGNORECASE)\
.stack()\
.groupby(level=0)\
.agg(type='-'.join))
grp type
0 a Goodbye
1 b Hello-App
2 c Hello-DB-RED
3 d Hello-App
4 c Goodbye-DB
5 e OK-Infra
Can anyone explain how comment='#' works within a csv file in pandas
pd.read_csv(..., comment='#',...)? Sample code is below.
# Read the raw file as-is: df1
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
print(df1.head(5))
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')
# Print the output of df2.head()
print(df2.head())
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
Here is an example of how the comment argument works:
csv_string = """col1;col2;col3
1;4.4;99
#2;4.5;200
3;4.7;65"""
# Without comment argument
print(pd.read_csv(StringIO(csv_string), sep=";"))
# col1 col2 col3
# 0 1 4.4 99
# 1 #2 4.5 200
# 2 3 4.7 65
# With comment argument
print(pd.read_csv(StringIO(csv_string),
sep=";", comment="#"))
# col1 col2 col3
# 0 1 4.4 99
# 1 3 4.7 65
You can found everything in the documentation.
Citation:
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #emptyna,b,cn1,2,3 with header=0 will result in a,b,c being treated as the header.
Thus, it's just ignoring everything after # until the new line or header.
I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?
IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...
Here we go again.
Hi, I'm trying to detect an error in a CSV file.
The file should look as follows
goodfile.csv
"COL_A","COL_B","COL_C","COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
But the file I have is actually
brokenfile.csv
"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"
When I import the two files with pandas
data = pd.read_csv('goodfile.csv')
data = pd.read_csv('brokenfile.csv')
I get the same result
data
COL_A COL_B COL_C COL_D
0 ROW1COLA ROW1COLB ROW1COLC ROW1COLD
1 ROW2COLA ROW2COLB ROW2COLC ROW2COLD
2 ROW3COLA ROW3COLB ROW3COLC ROW3COLD
3 ROW4COLA ROW4COLB ROW4COLC ROW4COLD
4 ROW5COLA ROW5COLB ROW5COLC ROW5COLD
5 ROW6COLA ROW6COLB ROW6COLC ROW6COLD
6 ROW7COLA ROW7COLB ROW7COLC ROW7COLD
Anyway, what I want is to detect the error in the second file "brokenfile.csv" that currently lacks "" between the header COL_C
I think you can detect missing " in columns of DataFrame with str.contains and boolean indexing with inverted boolean array by ~:
import pandas as pd
import io
temp=u'''"COL_A","COL_B",COL C,"COL_D"
"ROW1COLA","ROW1COLB","ROW1COLC","ROW1COLD"
"ROW2COLA","ROW2COLB","ROW2COLC","ROW2COLD"
"ROW3COLA","ROW3COLB","ROW3COLC","ROW3COLD"
"ROW4COLA","ROW4COLB","ROW4COLC","ROW4COLD"
"ROW5COLA","ROW5COLB","ROW5COLC","ROW5COLD"
"ROW6COLA","ROW6COLB","ROW6COLC","ROW6COLD"
"ROW7COLA","ROW7COLB","ROW7COLC","ROW7COLD"'''
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), quoting = 3)
print df
"COL_A" "COL_B" COL C "COL_D"
0 "ROW1COLA" "ROW1COLB" "ROW1COLC" "ROW1COLD"
1 "ROW2COLA" "ROW2COLB" "ROW2COLC" "ROW2COLD"
2 "ROW3COLA" "ROW3COLB" "ROW3COLC" "ROW3COLD"
3 "ROW4COLA" "ROW4COLB" "ROW4COLC" "ROW4COLD"
4 "ROW5COLA" "ROW5COLB" "ROW5COLC" "ROW5COLD"
5 "ROW6COLA" "ROW6COLB" "ROW6COLC" "ROW6COLD"
6 "ROW7COLA" "ROW7COLB" "ROW7COLC" "ROW7COLD"
print df.columns
Index([u'"COL_A"', u'"COL_B"', u'COL C', u'"COL_D"'], dtype='object')
print df.columns.str.contains('"')
[ True True False True]
print ~df.columns.str.contains('"')
[False False True False]
print df.columns[~df.columns.str.contains('"')]
Index([u'COL C'], dtype='object')
Pandas is trying to be smart about recognising data types when reading in data. This is exactly what's happening in the case you're describing, COL_C and "COL_C" are both parsed as strings really.
In short, there is no error to detect! At least pandas will not produce an error in cases like this.
What you could do, if you wanted to detect missing quotes in header, you could try to read in the first line in a more "traditional" pythonic way and make your own conclusions from there:
>>> with open('filename') as f:
lines = f.readlines()
....