How to skip text being used as column heading using python - python

I am importing a web log text file in Python using Pandas. Python is reading the headers however has used the text "Fields:" as a header and has then added another column of blanks (NaN's) at the end. How can I stop this text being used as a column heading?
here is my code:
arr = pd.read_table("path", skiprows=3, delim_whitespace=True, na_values=True)
Here is the start of the file:
Software: Microsoft Internet Information Services 7.5
Version: 1.0
Date: 2014-08-01 00:00:25
Fields: date time
2014-08-01 00:00:25...
Result is that 'Fields' is being used as a column heading and a column full of NaN values is being created for column 'time'.

You can do it calling read_table twice.
# reads the forth line into 1x1 df being a string,
# then splits it and skips the first field:
col_names = pd.read_table('path', skiprows=3, nrows=1, header=None).iloc[0,0].split()[1:]
# reads the actual data:
df = pd.read_table('path', sep=' ', skiprows=4, names=col_names)
If you already know the names of the columns (eg. date and time) then it's even simpler:
df = pd.read_table('path', sep=' ', skiprows=4, names = ['date', 'time'])

I think you may want skiprows = 4 and header = None

Related

Python Pandas Reading

I am trying to read a large log file, which has been parsed using different delimiters (legacy changes).
This code works
import os, subprocess, time, re
import pandas as pd
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
df = pd.read_csv(file, sep='[,|;: \t]+', header=None, engine='python', skipinitialspace=True)
for index, row in df.iterrows():
print(row[0], row[1])
This works well for the following data
user1#email.com address1
user2#email.com;address2
user3#email.com,address3
user4#email.com;;address4
user5#email.com,,address5
Issue #1: the following row in the input file will break the code. I wish for this to be parsed into 2 columns (not 3)
user6#email.com,,address;6
Issue #2: I wish to replace all single and double quotes in address, but neither of the following seem to work.
df[1]=df[1].str.replace('"','DQUOTES')
df.replace('"', 'DQUOTES', regex=True)
Pls help!
You can first read the file into one column and then do the processing step by step in pandas:
split into two columns (n=1)
replace the quotes
if needed (i.e. if there are possibly further columns you don't need) split the address column again and take the first column ([0]) only (here you may want to remove the space from the list of separators). If any commas and semicolons etc. are part of the address then you of course don't need this step.
import io
s= """user1#email.com address1
user2#email.com;address2
user3#email.com,address3
user4#email.com;;address4
user5#email.com,,address5
user6#email.com,,address;6
user6#email.com,,address with "double quotes"
user6#email.com,,address with 'single quotes'
"""
df = pd.read_csv(io.StringIO(s), sep='\n', header=None)
df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'address'})
df.address = df.address.str.replace('\'|"', 'DQUOTES')
df.address = df.address.str.split('[,|;:]+', 1, expand=True)[0] #depending on what you need
Result:
email address
0 user1#email.com address1
1 user2#email.com address2
2 user3#email.com address3
3 user4#email.com address4
4 user5#email.com address5
5 user6#email.com address
6 user6#email.com address with DQUOTESdouble quotesDQUOTES
7 user6#email.com address with DQUOTESsingle quotesDQUOTES

Why is Pandas' whitespace delimiter skipping one of my values?

I'm currently trying to use Python read a text file into Sqlite3 using Pandas. Here are a few entries from the text file:
1 Michael 462085 2.2506 Jessica 302962 1.5436
2 Christopher 361250 1.7595 Ashley 301702 1.5372
3 Matthew 351477 1.7119 Emily 237133 1.2082
The data consists of popular baby names, and I have to separate male names and female names into their own tables and perform queries on them. My method consists of first placing all the data into both tables, then dropping the unneeded columns afterwards. My issue is that when I try to add names to the columns, I get a value error: The expected axis has 6 elements, but 7 values. I'm assuming it's because Pandas possibly isn't reading the last values of each line, but I can't figure out how to fix it. My current delimiter is a whitespace delimiter that you can see below.
Here is my code:
import sqlite3
import pandas as pd
import csv
con = sqlite3.connect("C:\\****\\****\\****\\****\\****\baby_names.db")
c=con.cursor()
# Please note that most of these functions will be commented out, because they will only be run once.
def create_and_insert():
# load data
df = pd.read_csv('babynames.txt', index_col=0, header=None, sep= '\s+', engine = 'python')
# Reading the textfile
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber', 'Girlpercent']
# Adding Column names
df.columns = df.columns.str.strip()
con = sqlite3.connect("*************\\baby_names.db")
# drop data into database
df.to_sql("Combined", con)
df.to_sql("Boys", con)
df.to_sql("Girls", con)
con.commit()
con.close()
create_and_insert()
def test():
c.execute("SELECT * FROM Boys WHERE Rank = 1")
print(c.fetchall())
test()
con.commit()
con.close()
I've tried adding multiple delimiters, but it didn't seem to do anything. Using just regular space as the delimiter seems to just create 'blank' column names. From reading the Pandas docs, it says that multiple delimiters are possible, but I can't quite figure it out. Any help would be greatly appreciated!
Note that:
your input file contains 7 columns,
but the initial column is set as the index (you passed index_col=0),
so your DataFrame contains only 6 regular columns.
Print df to confirm it.
Now, when you run df.columns = ['Rank', ...], you attempt to assing the
7 passed names to existing 6 data columns.
Probably you should:
read the DataFrame without setting the index (for now),
assign all 7 column names,
set Rank column as the index.
The code to do it is:
df = pd.read_csv('babynames.txt', header=None, sep='\s+', engine='python')
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent']
df.set_index('Rank', inplace=True)
Or even shorter (all in one):
df = pd.read_csv('babynames.txt', sep='\s+', engine='python',
names=['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent'], index_col='Rank')

How to get the count from Unstructrued data

I have unstructured data from my perf logs. I would like to capture the service details from it. I can do delimiter, however I am not able to count or Print the column, since it doesn't have any header.
Kindly help me to figure out this issue.s
import pandas as pd
df = pd.read_csv (r'/Users/Myhome/Documents/Py_Learning/log.csv', sep = '|' , skipinitialspace=True)
#df = pd.read_csv (r'/Users/Myhome/Documents/Py_Learning/log.csv', sep =':|,|[|]', engine='python', header=None) ---> Multi separator is giving error.
#df.groupby("CLIENT")
SERVICE = df.columns[4]
print (SERVICE)
How can I find the unique service name in all lines and get the count. I would like to give it as a graph with last week data.
Sample data :
2019-10-22 15:35|Where:CARD|SERVICE:Dell|VERSION:1.0|CLIENT:HDD|OPERATION:boverdue|RESPONSETIME:0034|STATUS:100000:ERR_TRANSACTION_TIMED_OUT|SEVERITY:ERROR|STATUSCODE:SOAP-FAULT|STATUSMESSAGE:NA 2019-10-22 15:35|Where:Digital|SERVICE:Laptop|VERSION:1.0|CLIENT:mouse|OPERATION:connet|RESPONSETIME:3456|STATUS:NO_RECORDS_MATCH_SELECTION_CRITERIA|SEVERITY:INFO|STATUSCODE:1120|STATUSMESSAGE:NA
First, since you're data has no header, you need to set header=None.
df = pd.read_csv ('data.csv', sep = '|' , skipinitialspace=True, header=None)
And then to get service data, simply do
new_df = df.iloc[:,2].str.split(':', expand=True).iloc[:,1].value_counts()

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.
I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')
Try df = pd.read_csv(file, header=None, error_bad_lines=False)
The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

Having trouble removing headers when using pd.read_csv

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Categories