Python Pandas Reading - python

I am trying to read a large log file, which has been parsed using different delimiters (legacy changes).
This code works
import os, subprocess, time, re
import pandas as pd
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
df = pd.read_csv(file, sep='[,|;: \t]+', header=None, engine='python', skipinitialspace=True)
for index, row in df.iterrows():
print(row[0], row[1])
This works well for the following data
user1#email.com address1
user2#email.com;address2
user3#email.com,address3
user4#email.com;;address4
user5#email.com,,address5
Issue #1: the following row in the input file will break the code. I wish for this to be parsed into 2 columns (not 3)
user6#email.com,,address;6
Issue #2: I wish to replace all single and double quotes in address, but neither of the following seem to work.
df[1]=df[1].str.replace('"','DQUOTES')
df.replace('"', 'DQUOTES', regex=True)
Pls help!

You can first read the file into one column and then do the processing step by step in pandas:
split into two columns (n=1)
replace the quotes
if needed (i.e. if there are possibly further columns you don't need) split the address column again and take the first column ([0]) only (here you may want to remove the space from the list of separators). If any commas and semicolons etc. are part of the address then you of course don't need this step.
import io
s= """user1#email.com address1
user2#email.com;address2
user3#email.com,address3
user4#email.com;;address4
user5#email.com,,address5
user6#email.com,,address;6
user6#email.com,,address with "double quotes"
user6#email.com,,address with 'single quotes'
"""
df = pd.read_csv(io.StringIO(s), sep='\n', header=None)
df = df[0].str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'address'})
df.address = df.address.str.replace('\'|"', 'DQUOTES')
df.address = df.address.str.split('[,|;:]+', 1, expand=True)[0] #depending on what you need
Result:
email address
0 user1#email.com address1
1 user2#email.com address2
2 user3#email.com address3
3 user4#email.com address4
4 user5#email.com address5
5 user6#email.com address
6 user6#email.com address with DQUOTESdouble quotesDQUOTES
7 user6#email.com address with DQUOTESsingle quotesDQUOTES

Related

Why is Pandas' whitespace delimiter skipping one of my values?

I'm currently trying to use Python read a text file into Sqlite3 using Pandas. Here are a few entries from the text file:
1 Michael 462085 2.2506 Jessica 302962 1.5436
2 Christopher 361250 1.7595 Ashley 301702 1.5372
3 Matthew 351477 1.7119 Emily 237133 1.2082
The data consists of popular baby names, and I have to separate male names and female names into their own tables and perform queries on them. My method consists of first placing all the data into both tables, then dropping the unneeded columns afterwards. My issue is that when I try to add names to the columns, I get a value error: The expected axis has 6 elements, but 7 values. I'm assuming it's because Pandas possibly isn't reading the last values of each line, but I can't figure out how to fix it. My current delimiter is a whitespace delimiter that you can see below.
Here is my code:
import sqlite3
import pandas as pd
import csv
con = sqlite3.connect("C:\\****\\****\\****\\****\\****\baby_names.db")
c=con.cursor()
# Please note that most of these functions will be commented out, because they will only be run once.
def create_and_insert():
# load data
df = pd.read_csv('babynames.txt', index_col=0, header=None, sep= '\s+', engine = 'python')
# Reading the textfile
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber', 'Girlpercent']
# Adding Column names
df.columns = df.columns.str.strip()
con = sqlite3.connect("*************\\baby_names.db")
# drop data into database
df.to_sql("Combined", con)
df.to_sql("Boys", con)
df.to_sql("Girls", con)
con.commit()
con.close()
create_and_insert()
def test():
c.execute("SELECT * FROM Boys WHERE Rank = 1")
print(c.fetchall())
test()
con.commit()
con.close()
I've tried adding multiple delimiters, but it didn't seem to do anything. Using just regular space as the delimiter seems to just create 'blank' column names. From reading the Pandas docs, it says that multiple delimiters are possible, but I can't quite figure it out. Any help would be greatly appreciated!
Note that:
your input file contains 7 columns,
but the initial column is set as the index (you passed index_col=0),
so your DataFrame contains only 6 regular columns.
Print df to confirm it.
Now, when you run df.columns = ['Rank', ...], you attempt to assing the
7 passed names to existing 6 data columns.
Probably you should:
read the DataFrame without setting the index (for now),
assign all 7 column names,
set Rank column as the index.
The code to do it is:
df = pd.read_csv('babynames.txt', header=None, sep='\s+', engine='python')
df.columns = ['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent']
df.set_index('Rank', inplace=True)
Or even shorter (all in one):
df = pd.read_csv('babynames.txt', sep='\s+', engine='python',
names=['Rank', 'BoyName', 'Boynumber', 'Boypercent', 'Girlname', 'Girlnumber',
'Girlpercent'], index_col='Rank')

How to skip text being used as column heading using python

I am importing a web log text file in Python using Pandas. Python is reading the headers however has used the text "Fields:" as a header and has then added another column of blanks (NaN's) at the end. How can I stop this text being used as a column heading?
here is my code:
arr = pd.read_table("path", skiprows=3, delim_whitespace=True, na_values=True)
Here is the start of the file:
Software: Microsoft Internet Information Services 7.5
Version: 1.0
Date: 2014-08-01 00:00:25
Fields: date time
2014-08-01 00:00:25...
Result is that 'Fields' is being used as a column heading and a column full of NaN values is being created for column 'time'.
You can do it calling read_table twice.
# reads the forth line into 1x1 df being a string,
# then splits it and skips the first field:
col_names = pd.read_table('path', skiprows=3, nrows=1, header=None).iloc[0,0].split()[1:]
# reads the actual data:
df = pd.read_table('path', sep=' ', skiprows=4, names=col_names)
If you already know the names of the columns (eg. date and time) then it's even simpler:
df = pd.read_table('path', sep=' ', skiprows=4, names = ['date', 'time'])
I think you may want skiprows = 4 and header = None

Having trouble removing headers when using pd.read_csv

I have a .csv that contains contains column headers and is displayed below. I need to suppress the column labeling when I ingest the file as a data frame.
date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7
When I issue the following command:
df = pd.read_csv('c:/temp1/test_csv.csv', usecols=[4,5], names = ["zip","weight"], header = 0, nrows=10)
I get:
zip weight
0 1417464 3546600
I have tried various manipulations of header=True and header=0. If I don't use header=0, then the columns will all print out on top of the rows like so:
zip weight
height locale
0 1417464 3546600
I have tried skiprows= 0 and 1 but neither removes the headers. However, the command works by skipping the line specified.
I could really use some additional insight or a solve. Thanks in advance for any assistance you could provide.
Tiberius
Using the example of #jezrael, if you want to skip the header and suppress de column labeling:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header=None, skiprows=1)
print df
4 5
0 3546600 254
I'm not sure I entirely understand why you want to remove the headers, but you could comment out the header line as follows as long as you don't have any other rows that begin with 'd':
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='d') # comments out lines beginning with 'date,color' . . .
>>> df
3 4
0 1417464 3546600
It would be better to comment out the line in the csv file with the crosshatch character (#) and then use the same approach (again, as long as you have not commented out any other lines with a crosshatch):
>>> df = pd.read_csv('test.csv', usecols=[3,4], header=None, comment='#') # comments out lines with #
>>> df
3 4
0 1417464 3546600
I think you are right.
So you can change column names to a and b:
import pandas as pd
import numpy as np
import io
temp=u"""date,color,id,zip,weight,height,locale
11/25/2013,Blue,122468,1417464,3546600,254,7"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], names = ["a","b"], header = 0 , nrows=10)
print df
a b
0 3546600 254
Now these columns have new names instead of weight and height.
df = pd.read_csv(io.StringIO(temp), usecols=[4,5], header = 0 , nrows=10)
print df
weight height
0 3546600 254
You can check docs read_csv (bold by me):
header : int, list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Merge csv's with some common columns and fill in Nans

I have several csv files (all in one folder) which have columns in common but also have distinct columns. They all contain the IP column. The data looks like
File_1.csv
a,IP,b,c,d,e
info,192.168.0.1,info1,info2,info3,info4
File_2.csv
a,b,IP,d,f,g
info,,192.168.0.1,info2,info5,info6
As you can see File 1 and File 2 disagree on what belongs in column d but I do not mind which file it keeps the information from. I have tried pandas.merge but this however returns two separate entries for 192.168.0.1 with NaN in the columns present in File 1 not in File 2 and vice versa. Does anyone know of a way to do this?
Edit 1:
The desired output should look like:
output
a,IP,b,c,d,e,f,g
info,192.168.0.1,info1,info2,info3,info4,info5,info6
and I would like the output to be like this for all rows, not every item in file 1 is in file2 and vice versa.
Edit 2:
Any IP address present in file 1 but not present in file 2 should have a blank or Not Available Value in any unique columns in the output file. For example in the output file, columns f and g would be blank for IP addresses that were present in file 1 but not in file 2. Similarly, for an IP in file 2 and not in file 1, columns c and e would be blank in the output file.
This case:
Set IP_address as index column and then use combine_first() to fill in a holes in a data_frame which is the union of all IP_address and columns.
import pandas as pd
#read in the files using the IP address as the index column
df_1 = pd.read_csv('file1.csv', header= 0, index_col = 'IP')
df_2 = pd.read_csv('file2.csv', header= 0, index_col = 'IP')
#fill in the Nan
combined_df = df_1.combine_first(df_2)
combined_df.write_csv(path = '', sep = ',')
EDIT: The union of the indices will be taken, so we should put the IP address in the index column to ensure IP addresses in both files are read in.
combine_first() for other cases:
As the documentation states, you'll only have to be careful if the same IP address in both files has conflicting nonempty information for a column (such as column d in your above example). In df_1.combine_first(df_2), df_1 is prioritized and column d will be set to the value from df_1. Since you said, it doesn't matter which file you will draw information from in this case, this isn't a concern for this problem.
I think a simple dictionary should do the job. Assume you've read the contents of each file into lists file1 and file2, so that:
file1[0] = [a,IP,b,c,d,e]
file1[1] = [info,192.168.0.1,info1,info2,info3,info4]
file2[0] = [a,b,IP,d,f,g]
file2[1] = [info,,192.168.0.1,info2,info5,info6]
(with quotes around each entry). The following should do what you want:
new_dict = {}
for i in range(0, len(file2[0]):
new_dict[file2[0][i]] = file2[1][i]
for i in range(0, len(file1[0]):
new_dict[file1[0][i]] = file1[1][i]
output = [[],[]]
output[0] = [key for key in new_dict]
output[1] = [new_dict[key] for key in output[0]]
Then you should get:
output[0] = [a,IP,b,c,d,e,f,g]
output[1] = [info,192.168.0.1,info1,info2,info3,info4,info5,info6]

Merging CSV Files with missing columns in Pandas

I'm a new to pandas and python, so I hope this will make sense.
I have parsed multiple tables from a website to multiple CSV files, and unfortunately if the value was not available for the parsed data, it was omitted from the table. Hence, I now have CSV files with varying number of columns.
I've used the read_csv() and to_csv() in the past and it works like a charm when the data is clean, but I'm stumped here.
I figured there might be a way to "map" the read data if I first fed the pandas DF with all column headers, then I map each file against the columns in the main file.
E.g. Once i used read_csv(), then to_csv() will look at the main merged file and "map" the available fields to the correct columns in the merged file.
This is a short version of the data:
File 1:
ID, Price, Name,
1, $800, Jim
File 2:
ID, Price, Address, Name
2, $500, 1 Main St., Amanda
Desired Output:
ID, Price, Adress, Name
1, $800, , Jim
2, $500, 1 Main St., Amanda
This is the code I got so far.
mypath='I:\\Filepath\\'
#creating list of files to be read, and merged.
listFiles = []
for (dirpath, dirnames, filenames) in walk(mypath):
listFiles.extend(filenames)
break
# reading/writing "master headers" to new CSV using a "master header" file
headers = pd.read_csv('I:\\Filepath\\master_header.csv', index_col=0)
with open('I:\\Filepath\\merge.csv', 'wb') as f:
headers.to_csv(f)
def mergefile(filenames):
try:
# Creating a list of files read.
with open('I:\\Filepath\\file_list.txt', 'a') as f:
f.write(str(filenames)+'\n')
os.chdir('I:\\Filepath\\')
# Reading file to add.
df = pd.read_csv(filenames, index_col=0)
# Appending data (w/o header) to the new merged data CSV file.
with open('I:\\Filepath\\merge.csv', 'a') as f:
df.to_csv(f, header=False)
except Exception, e:
with open('I:\\Filepath\\all_error.txt', 'a') as f:
f.write(str(e)+'\n')
for eachfilenames in listFiles:
mergefile(eachfilenames)
This code merges the data, but since the number of columns vary, they are not in the right place...
Any help would be greatly appreciated.
Try using the pandas concat[1] function, which defaults to an outer join (all columns will be present, and missing values will be NaN). For example:
import pandas as pd
# you would read each table into its own data frame using read_csv
f1 = pd.DataFrame({'ID': [1], 'Price': [800], 'Name': ['Jim']})
f2 = pd.DataFrame({'ID': [2], 'Price': [500], 'Address': '1 Main St.', 'Name': ['Amanda']})
pd.concat([f1, f2]) # merged data frame
[1] http://pandas.pydata.org/pandas-docs/stable/merging.html
Here is a complete example that demonstrates how to load the files and merge them using concat:
In [297]:
import pandas as pd
import io
t="""ID, Price, Name
1, $800, Jim"""
df = pd.read_csv(io.StringIO(t), sep=',\s+')
t1="""ID, Price, Address, Name
2, $500, 1 Main St., Amanda"""
df1 = pd.read_csv(io.StringIO(t1), sep=',\s+')
pd.concat([df,df1], ignore_index=True)
Out[297]:
Address ID Name Price
0 NaN 1 Jim $800
1 1 Main St. 2 Amanda $500
Note that I pass ignore_index=True otherwise you will get duplicate index entries which I assume is not what you want, also I'm assuming that in your original data sample for 'File 1' you don't really have a trailing comma in your header line: ID, Price, Name, so I removed it from my code above

Categories