How to get the count from Unstructrued data

How to get the count from Unstructrued data - python

I have unstructured data from my perf logs. I would like to capture the service details from it. I can do delimiter, however I am not able to count or Print the column, since it doesn't have any header.
Kindly help me to figure out this issue.s
import pandas as pd
df = pd.read_csv (r'/Users/Myhome/Documents/Py_Learning/log.csv', sep = '|' , skipinitialspace=True)
#df = pd.read_csv (r'/Users/Myhome/Documents/Py_Learning/log.csv', sep =':|,|[|]', engine='python', header=None) ---> Multi separator is giving error.
#df.groupby("CLIENT")
SERVICE = df.columns[4]
print (SERVICE)
How can I find the unique service name in all lines and get the count. I would like to give it as a graph with last week data.
Sample data :
2019-10-22 15:35|Where:CARD|SERVICE:Dell|VERSION:1.0|CLIENT:HDD|OPERATION:boverdue|RESPONSETIME:0034|STATUS:100000:ERR_TRANSACTION_TIMED_OUT|SEVERITY:ERROR|STATUSCODE:SOAP-FAULT|STATUSMESSAGE:NA 2019-10-22 15:35|Where:Digital|SERVICE:Laptop|VERSION:1.0|CLIENT:mouse|OPERATION:connet|RESPONSETIME:3456|STATUS:NO_RECORDS_MATCH_SELECTION_CRITERIA|SEVERITY:INFO|STATUSCODE:1120|STATUSMESSAGE:NA

First, since you're data has no header, you need to set header=None.
df = pd.read_csv ('data.csv', sep = '|' , skipinitialspace=True, header=None)
And then to get service data, simply do
new_df = df.iloc[:,2].str.split(':', expand=True).iloc[:,1].value_counts()

Related

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )

import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]

It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

How can I fix "Error tokenizing data" on pandas csv reader?

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.

I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')

Try df = pd.read_csv(file, header=None, error_bad_lines=False)

The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

renaming the header when using dictreader

I'm looking for the best way to rename my header using dictreader / dictwriter to add to my other steps already done.
This is what I am trying to do to the Source data example below.
Remove the first 2 lines
Reorder the columns (header & data) to 2, 1, 3 vs the source file
Rename the header to ASXCode, CompanyName, GISC
When I'm at
If I use 'reader = csv.reader.inf' the first lines are removed and columns reordered but as expected no header rename
Alternately when I run the dictreader line 'reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))' I receive the error 'dict contains fields not in fieldnames:' and shows the first row of data rather than the header.
I'm a bit stuck on how I get around this so any tips appreciated.
Source Data example
ASX listed companies as at Mon May 16 17:01:04 EST 2016
Company name ASX code GICS industry group
1-PAGE LIMITED 1PG Software & Services
1300 SMILES LIMITED ONT Health Care Equipment & Services
1ST AVAILABLE LTD 1ST Health Care Equipment & Services
My Code
import csv
import urllib.request
from itertools import islice
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
with open(temp_filename, 'r', newline='') as inf, \
open(local_filename, 'w', newline='') as outf:
# reader = csv.DictReader(inf, fieldnames=('ASXCode', 'CompanyName', 'GICS'))
reader = csv.reader(inf)
fieldnames = ['ASX code', 'Company name', 'GICS industry group']
writer = csv.DictWriter(outf, fieldnames=fieldnames)
# 1. Remove top 2 rows
next(islice(reader, 2, 2), None)
# 2. Reorder Columns
writer.writeheader()
for row in csv.DictReader(inf):
writer.writerow(row)

IIUC here is a solution using pandas and its function read_csv:
import pandas as pd
#Considering that you have your data in a file called 'stock.txt'
#and it is tab separated, by default the blank lines are not read by read_csv,
#hence set the header=1
df = pd.read_csv('stock.txt', sep='\t',header=1)
#Rename the columns as required
df.columns= ['CompanyName', 'ASXCode', 'GICS']
#Reorder the columns as required
df = df[['ASXCode','CompanyName','GICS']]
And this is how you would do it in ipython and the output would look like:

Based on your tips I got it working in the end. I hadn't used pandas before so had to ready up a little first.
I eventually worked out pandas uses a data frame so I had to do a few things differently with tocsv function and eventually add index=False parameter to the tocsv function to remove the df index.
Now all great thankyou.
import csv
import os
import urllib.request
import pandas as pd
local_filename = "C:\\myfile.csv"
url = ('http://mysite/afile.csv')
temp_filename, headers = urllib.request.urlretrieve(url)
#using pandas dataframe
df = pd.read_csv(temp_filename, sep=',',header=1) #skip header
df.columns = ['CompanyName', 'ASXCode', 'GICS'] #rename columns
df = df[['ASXCode','CompanyName','GICS']] #reorder columns
df.to_csv(local_filename, sep=',', index=False)
os.remove(temp_filename) # clean up

How to skip text being used as column heading using python

I am importing a web log text file in Python using Pandas. Python is reading the headers however has used the text "Fields:" as a header and has then added another column of blanks (NaN's) at the end. How can I stop this text being used as a column heading?
here is my code:
arr = pd.read_table("path", skiprows=3, delim_whitespace=True, na_values=True)
Here is the start of the file:
Software: Microsoft Internet Information Services 7.5
Version: 1.0
Date: 2014-08-01 00:00:25
Fields: date time
2014-08-01 00:00:25...
Result is that 'Fields' is being used as a column heading and a column full of NaN values is being created for column 'time'.

You can do it calling read_table twice.
# reads the forth line into 1x1 df being a string,
# then splits it and skips the first field:
col_names = pd.read_table('path', skiprows=3, nrows=1, header=None).iloc[0,0].split()[1:]
# reads the actual data:
df = pd.read_table('path', sep=' ', skiprows=4, names=col_names)
If you already know the names of the columns (eg. date and time) then it's even simpler:
df = pd.read_table('path', sep=' ', skiprows=4, names = ['date', 'time'])

I think you may want skiprows = 4 and header = None

Having trouble removing headers when using pd.read_csv

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius

Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254

I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600

I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the count from Unstructrued data - python

First, since you're data has no header, you need to set header=None. df = pd.read_csv ('data.csv', sep = '|' , skipinitialspace=True, header=None) And then to get service data, simply do new_df = df.iloc[:,2].str.split(':', expand=True).iloc[:,1].value_counts()

Related

Handle variable as file with pandas dataframe

How can I fix "Error tokenizing data" on pandas csv reader?

renaming the header when using dictreader

How to skip text being used as column heading using python

Having trouble removing headers when using pd.read_csv

Categories

Resources