Pandas read_csv() - Flie contains different data

Pandas read_csv() - Flie contains different data - python

Problem:
I am trying to read in a csv to a pandas dataframe that contains data of different column sizes.
Example & Description:
Code:
df = pd.read_csv(input, error_bad_lines=False)
input:
ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8
and this pattern continues for the entirety of the file. Originally I was thinking about throwing away the extra columns since read_csv option throws and error and doesn't read them I just started to ignore them. However I then get duplicate headers in my dataframe... To combat this I tried the drop_duplicates() but found out that only in V0.17 of pandas do they include the keep=False option. I eventually started to convince myself that try to keep the data. So here is my question. Based on the dataset above I was hoping that I might be able to create two unique dataframes. You can assume that the ID will always be unique so you can create N number of frames for the number of different IDs you have. Each ID will not have the same number of headers. Once a different ID is encountered its header will be printed. So for example if we hit another ID 16 its header will be printed prior to the data. And again if we hit another ID 15 its header will be printed prior to its data.
I was thinking maybe to just preprocess the data before I started using dataframes since that is an option. But since I am still fairly new to all that pandas can do, I was hoping maybe some people here would have suggestions before I went ahead and wrote some nasty preprocessing code :). The other thought I had which turns into a question is - For the error_bad_lines, is there a way to save those lines to another dataframe or something else? Additionally, I tell pandas in the read_csv to only look for items that have an ID of X and just do that for all my ID's? I will add that the number of IDs is finite and known.
My current version of pandas is 0.14.

Note I corrected what I think is a typo in your sample data.
I split your data with a lookahead regular expression. I look for newline characters that are followed by ID.
Then parse each element of the list and concatenate.
from io import StringIO
import pandas as pd
import re
txt = """ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16, 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8"""
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])
ID Time Val Val1 Val2
0 15 18:00:01 4.0 NaN NaN
1 15 18:00:02 6.0 NaN NaN
2 15 18:00:03 5.0 NaN NaN
0 16 18:00:03 NaN 1.0 43.0
0 15 18:00:04 8.0 NaN NaN
The above was working with the sample data provided by OP. If this were in a csv file the solution would look like this
from io import StringIO
import pandas as pd
import re
with open('myfile.csv') as f:
txt = f.read()
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])

You can treat the file as having four columns:
df = pd.read_csv(input, names=['id', 'time', 'v1', 'v2'])
and filter out the extra headers:
df = df[df.id != 'ID']
Then your two data sets are simply df[pd.isnull(df.v2)] and df[~pd.isnull(df.v2)].

Related

How to create python function that performs multiple checks on a dataframe?

I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'

Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?

you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)

You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Fuzzy CSV column matching

I'm parsing a lot of .csv files right now, and I'm running into issues where one .csv will identify, say, a column that holds the name of a candidate running for office with the header candidate_name, while another will use CANDIDATE_FULL_NAME.
I'm updating dictionaries with the results of the columns like this, except constantly changing the row[value] for each different header.
dict.update({
'candidate': row['column_header']
Is there a way to fuzzy match this? Preferably something that I can use almost drop-in so that I don't have to set up a class/method that regex tests each column for its similarity.
I already set up a class that tests to match a value to a list of values, but I feel as if this is something that I won't have to write myself. Unfortunately, my google-fu has returned nothing.
I'd use the column number, but unfortunately the columns aren't always in the same order. Additionally, I can't alter the original .csv files (or else I'd definitely normalize them).

No "fuzzy" matching built-in to pandas as far as I know. If there is some common denominator, e.g. the word "name" is only and always in the column that contains the candidate's name, you could use it to rename the name column. For example:
import pandas as pd
import numpy as np
def fuzzymatch(df, string, stname):
for col in df.columns:
if col.lower().find(string) > -1:
df.rename(columns={col:stname}, inplace=True)
break
return df
df = pd.DataFrame({"CANDIDATE_NAME_HERE": ["Ted","Fred","Sally","John","Jane"], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
#pd.read_csv('filename.csv') will load your csv file
string = 'name'
stname = 'candidate_name'
df = fuzzymatch(df, string, stname)
print(df)
B C candidate_name
0 20 32 Ted
1 30 234 Fred
2 10 23 Sally
3 40 23 John
4 50 42523 Jane

Fill pandas Panel object with data

This is probably very very basic but I can't seem to find a solution anywhere. I'm trying to construct a 3D panel object in pandas and then fill it with data which I read from several csv files. An example of what I'm trying to do would be the following:
import numpy as np
import pandas as pd
year = np.arange(2000,2005)
obs = np.arange(1,5)
variables = ['x1','x2']
data = pd.Panel(items = obs, major_axis = year, minor_axis = variables)
So that data[i] gives me all the data belonging to one of the observation units in the panel:
data[1]
x1 x2
2000 NaN NaN
2001 NaN NaN
2002 NaN NaN
2003 NaN NaN
2004 NaN NaN
Then, I read in data from a csv which gives me a DataFrame that looks like this (I'm just creating an equivalent object here to make this a working example):
x1data = pd.DataFrame(data = zip(year, np.random.randn(5)), columns = ['year', 'x1'])
x1data
year x1
0 2000 -0.261514
1 2001 0.474840
2 2002 0.021714
3 2003 -1.939358
4 2004 1.167545
No I would like to replace the NaN's in the x1 column of data[1] with the data that is in the x1data dataframe. My first idea (given that I'm coming from R) was to simply make sure that I select an object from x1data that has the same dimension as the x1 column in my panel and assign it to the panel:
data[1].x1 = x1data.x1
However, this doesn't work which I guess is due to the fact that in x1data, the years are a column of the dataframe, whereas in the panel they are whatever it is that shows up to the left of the columns (the "row names", would this be an index)?
As you can probably tell from my question I'm far from really understanding what's going on in the pandas data structure so any help would be greatly appreciated!

I'm guessing this question didn't elicit a lot of replies at it was simply too stupid, but just in case anyone ever comes across this and is as clueless as I was, the very simple answer is to access the panel using the .iloc method, as:
data.iloc[item, major_axis, minor_axis]
where each of the arguments can be single elements or lists, in order to write on slices of the panel. My question above would have been solved by
data.iloc[1, np.arange(2000,2005), 'x1'] = np.asarray(x1data.x1)
or
data.iloc[1, year, 'x1'] = np.asarray(x1data.x1)
Note than had I not used np.asarray, nothing would have happened as data.iloc[] creates an object that has the years as index, while x1data.x1 has an index starting at 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.