How to read csv file with missing values and 'delim_whitespace=True'? - python

I would like to know if it is possible to simply drop any line that couses error instead of rising an exception.
My issue is connected to processing text file such this one:
111 aaa 222 bbb
1 a 2 b
11 22
Because of the varied number of whitespaces as separators, I am using option 'delim_whitespace=True' to read_csv function. I am however also explicitelly specifying data types by 'dtype' parameter.
It is natural that pandas shifts value 22 to second column for the third row (and I don't believe there is a way how to convince it that it actually bellongst to the third one). However since the second column is expected to be string it raises an exception.
I understand that this could be probably solved using 'converters' parameter, but I am worried about performance since the data file is quite large (millions of rows).
So is it possible to drop lines with lower number or columns (there is 'error_bad_lines' for higher) or drop any line which couses exception during retyping. Or do you have any other ideas?

Use pandas.read_fwf to read file. This will fill empty string with NaN values.
=^..^=
import pandas as pd
data = pd.read_fwf('data.txt', header=None)
data.columns = ["c1", "c2", "c3", "c4"]
load:
c1 c2 c3 c4
0 111 aaa 222 bbb
1 1 a 2 b
2 11 NaN 22 NaN
Next simply drop rows with NaN values:
out_data = data.dropna()
Output:
c1 c2 c3 c4
0 111 aaa 222 bbb
1 1 a 2 b

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Generate a sparse dataframe from a list of index positions and values

New learner here. I have a list of data values that are labeled by a comma-delimited string that represents the position in a dataframe; think of the string as representing the row (say 1-20) and column (say A-L) index values of a position in the array where the corresponding value should go. The populated data frame would be sparse, with many empty cells. I am working with pandas for the first time on this project, and am still learning the ropes.
position value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
etc.
I have been trying various approaches, but am not satisfied. I created dummy entries for empty cells in the completed dataframe, then iterated over the list to write the value to the correct cell. It works but it is not elegant and it seems like a brittle solution.
I can pre-generate a dataframe and populate it, or generate a new dataframe as part of the population process: either solution would be fine. It seems like this should be a simple task. Maybe even a one liner! But I am stumped. I would appreciate any pointers.
This is a standard unstack:
entries.set_index(['row','column']).unstack()
where entries is defined in #StuartBerg answer:
entries = pd.read_csv(StringIO(text), sep='[ ,]+')
output:
value
column A B C D E F G
row
1 32.0 NaN NaN NaN NaN 16.0 NaN
2 NaN 234.0 1345.0 NaN 13.0 NaN 999.0
3 NaN NaN NaN 5332.0 NaN NaN NaN
4 NaN 12.0 NaN NaN NaN NaN NaN
As you suggest, the simplest method might be a for-loop to initialize the non-empty values. Alternatively, you can use pivot() or numpy advanced indexing. All options are shown below.
The only tricky thing is ensuring that your dataframe result will have the complete set of rows and columns, as explained in the update below.
text = """\
row,column,value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
"""
from io import StringIO
import numpy as np
import pandas as pd
# Load your data and convert the column letters to integers.
# Note: Your exapmple data is delimited with both spaces and commas,
# which is why we need a custom 'sep' argument here.
entries = pd.read_csv(StringIO(text), sep='[ ,]+')
entries['icol'] = entries['column'].map(lambda c: ord(c) - ord('A'))
# Construct an empty DataFrame with the appropriate index and columns.
rows = range(1, 1 + entries['row'].max())
columns = [chr(ord('A') + i) for i in range(1 + entries['icol'].max())]
df = pd.DataFrame(index=rows, columns=columns)
##
## Three ways to populate the dataframe:
##
# Option 1: Iterate in a for-loop
for e in entries.itertuples():
df.loc[e.row, e.column] = e.value
# Option 2: Use pivot() or unstack()
df = df.fillna(entries.pivot('row', 'column', 'value'))
# Option 3: Use numpy indexing to overwrite the underlying array:
irows = entries['row'].values - 1
icols = entries['icol'].values
df.values[irows, icols] = entries['value'].values
Result:
A B C D E F G
1 32 NaN NaN NaN NaN 16 NaN
2 NaN 234 1345 NaN 13 NaN 999
3 NaN NaN NaN 5332 NaN NaN NaN
4 NaN 12 NaN NaN NaN NaN NaN
Update:
Late in the day, it occurred to me that this can be solved via pivot() (or unstack(), as suggested by #piterbarg). I've now included that option above.
In fact, it's tempting to just use pivot() without pre-initializing the DataFrame. HOWEVER, there's an important caveat to that approach: If any particular row or column value remains completely unused in your original entries data, then those rows will remain completely omitted from the final table. That is, if no entry uses row 3, your final table would only contain rows 1,2,4. Likewise, if your data contains no data for columns C,E,G (for example), then you would end up with columns A,B,D,F.
If you want to be sure that your rows use contiguous index values and your columns use a contiguous sequence of letters, then pivot() or unstack() is not enough. You must first initialize the indexes of your dataframe as shown above.

Pandas: Some MultiIndex values appearing as NaN when reading Excel sheets

When reading an Excel spreadsheet into a Pandas DataFrame, Pandas appears to be handling merged cells in an odd fashion. For the most part, it interprets the merged cells as desired, apart from the first merged cell for each column, which is producing NaN values where it shouldn't.
dataframes = pd.read_excel(
"../data/data.xlsx",
sheet_name=[0,1,2], # read the first three sheets as separate DataFrames
header=[0,1], # rows [1,2] in Excel
index_col=[0,1,2], # cols [A,B,C] in Excel
)
I load three sheets, but behaviour is identical for each so from now on I will only discuss one of them.
> dataframes[0]
Header 1
H2
H3
Value 1
Overall
Overall
A1
B1
0
10
NaN
NaN
1
11
NaN
B2
0
12
NaN
B2
1
13
--------
-------
-------
-------
A2
B1
0
11
A2
B1
1
12
A2
B2
0
13
A2
B2
1
14
As you can see, A1 loads with NaNs yet A2 (and all beyond it, in the real data) load fine. Both A1 and A1 are actually a single merged cell spanning 4 rows in the Excel spreadsheet itself.
What could be causing this issue? It would normally be a simple fix via a fillna(method="ffill") but MultiIndex does not support that. I have so far not found another workaround.

how to get dataframe columns and index on the same level after set_index?

after setting the index of a dataframe, the columns and index seem to be on different levels
pd.DataFrame({'A':[1,2,3], 'B':[11, 22, 33]}).set_index('A')
B
A
1 11
2 22
3 33
df.index.names is FrozenList(['A']) and df.columns.names is FrozenList([None]) so i couldn't use droplevel for that
The desired output is
A B
1 11
2 22
3 33
I'm answering my own question in case it might be helpful for someone else.
based on how the dataframe is displayed i thought there was some kind of MultiIndex somewhere but as pointed out in the comments, this is just how index labels are displayed so
df.index.name = None
removes the extra level effect (that is also visible for example when using df.to_latex)
B
1 11
2 22
3 33
I was trying df.index.name = '' which actually kept the level effect

Python3 Pandas Filter by Columns with Unknown Column Names

Working with a data set comparing rosters with different dates. It goes through a pivot and we don't know the dates of when the rosters are pulled but the resulting data set is structured like this:
colA ColB colC colD Date:yymmdd Date:yymmdd Date:yymmdd
Bob aa aa aa 0 0 1
Jack bb bb bb 1 1 1
Steve cc cc cc 0 1 1
Mary dd dd dd 1 1 1
Abu ee ee ee 1 1 0
I successfully did a fillna for every column after the first 4 columns (they are known).
df.iloc[:,4:] = df.iloc[:,4:].fillna(0) #<-- Fills blanks on every column after column 4.
Question: Now i'm trying to filter the df on the columns that have a zero. Is there a way to filter by columns after 4? I tried:
df = df[(df.iloc[:,4:] == 0)] # error
df = df[(df.columns[:,4:] == 0)] # error
df = df[(df.columns.str.contains(':') == 0)] # unknown columns do have a ':', but didn't work.
Is there a better way to do this? Looking for a result that only shows the rows with a 0 in any column past #4.
Below snippet will give you one Dataframe containing True and False as cell values of df.
df.iloc[:, 4:].eq(x)
If you want to have only those rows where x is there, then you can any() clause.
like the way #jpp has shown in his answer.
In your case, it will be df[df.iloc[:, 4:].eq(0).any(1)]
This will give you all the rows of Dataframe, where rows have atleast one '0' as data value
If all values are 0 or bigger, use min :
df[df.columns[:,4:].min(axis = 1) == 0]

Categories