Read CSV into a dataFrame with varying row lengths using Pandas - python

So I have a CSV that looks a bit like this:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
And when I try to use the following code to generate a dataFrame..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)
The rest are considered 'bad lines' giving me the following error:
Skipping line 17467: expected 3 fields, saw 9
How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?
Thanks!

If using only pandas, read in lines, deal with the separator after.
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None

If you know that the data contains N columns, you can
tell Pandas in advance how many columns to expect via the names parameter:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
If you have an the upper limit, N, on the number of columns, then you can
have Pandas read N columns and then use dropna to drop completely empty columns:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
Note that this could drop columns from the middle of the data set (not just
columns from the right-hand side) if they are completely empty.

Read fixed width should work:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
or with a delimiter param
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)

add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:
file.csv:
a,b,c,d,e
1,2,3
3
2,3,4
code:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN

Consider using Python csv to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
Gives you a csv import dialect and the following DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Left as an exercise is handling the whitespace padding in the input file.

colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
Change 9 in colnames to number x if code gives the error
Skipping line 17467: expected 3 fields, saw x

Related

Pandas append DataFrame2 ROW to DataFrame1 ROW

I want to append rows from second DataFrame (df2) to first DataFrame (df1) depending whether in df1 column "isValid" is [T]rue.
I know how to iterate over df1 column and search for True values, but don't know how to easily append rows from second DataFrame. Originally my data have around 1000 lines and 40 columns, so I need to do operations automatically.
import pandas
df1 = pandas.read_csv('df1.csv', sep=';')
df2 = pandas.read_csv('df2.csv', sep=';')
print(df1.to_string(), '\n')
print(df2.to_string(), '\n')
columnSeriesObj = df1.iloc[:, 2]
n = 0
k = 0
for i in columnSeriesObj:
if i == "T":
print("True in row number", k)
# APPEND n ROW from df2 to k ROW from df1
n += 1
k += 1
print('\n', df1.to_string())
Here are some test values:
df1.csv
DataA;DataB;isValid
1568;1104;F
1224;1213;F
1676;1246;F
1279;1489;T
1437;1890;T
1705;1007;F
1075;1720;F
1361;1983;F
1966;1751;F
1938;1564;F
1894;1684;F
1189;1803;F
1275;1138;F
1085;1748;T
1337;1775;T
1719;1975;F
1045;1187;F
1426;1757;F
1410;1363;F
1405;1025;F
1699;1873;F
1777;1464;F
1925;1310;T
df2.csv
Nr;X;Y;Z;A ;B;C
1;195;319;18;qwe;hjk;wsx
2;268;284;23;rty;zxc;edc
3;285;277;36;uio;vbn;rfv
4;143;369;34;asd;mlp;tgb
5;290;247;16;fgh;qaz;yhn
I want to df1 after appending look like this (screenshot from Excel):
Thank you for any suggestions! :D
You can filter the index values in df1 where the column isValid equals T, then update the index of df2 with the filtered index values from df1 finally join it with df1:
m = df1['isValid'].eq('T')
idx = m[m].index[:len(df2)]
df1.join(df2.set_index(idx)).fillna('')
DataA DataB isValid Nr X Y Z A B C
0 1568 1104 F
1 1224 1213 F
2 1676 1246 F
3 1279 1489 T 1 195 319 18 qwe hjk wsx
4 1437 1890 T 2 268 284 23 rty zxc edc
5 1705 1007 F
6 1075 1720 F
7 1361 1983 F
8 1966 1751 F
9 1938 1564 F
10 1894 1684 F
11 1189 1803 F
12 1275 1138 F
13 1085 1748 T 3 285 277 36 uio vbn rfv
14 1337 1775 T 4 143 369 34 asd mlp tgb
15 1719 1975 F
16 1045 1187 F
17 1426 1757 F
18 1410 1363 F
19 1405 1025 F
20 1699 1873 F
21 1777 1464 F
22 1925 1310 T 5 290 247 16 fgh qaz yhn
I suggest the following:
I created some dummy data, similar to yours:
import pandas as pd
import random
df = pd.DataFrame({"a": list(range(20)), "b": [random.choice(("T", "F")) for _ in range(20)]})
df2 = pd.DataFrame({"value1": list(range(5)), "nr": list(range(5))})
First you create a new column in the first dataframe that holds the incrementing ID ("Nr"). To do so, use the count generator from itertools.
from itertools import count
counter = count(start=1)
df["id"] = df.apply(lambda row: next(counter) if row["b"] == "T" else None, axis=1)
After that you can perform a join with the merge method.
df.merge(df2, left_on="id", right_on="nr", how="outer")
How about something like this:
(e.g. first find the overlapping index-values and then join the dataframes)
import pandas as pd
import numpy as np
df1 = pd.read_csv("df1.csv", sep=';')
df2 = pd.read_csv(r"df2.csv", sep=';')
# find intersecting indices
useidx = np.intersect1d(df2.index,
df1[df1.isValid == 'T'].index)
# join relevant values
df_joined = df1.join(df2.loc[useidx])
df_joined then looks like this:
>>> DataA DataB isValid Nr X Y Z A B C
>>> 0 1568 1104 F NaN NaN NaN NaN NaN NaN NaN
>>> 1 1224 1213 F NaN NaN NaN NaN NaN NaN NaN
>>> 2 1676 1246 F NaN NaN NaN NaN NaN NaN NaN
>>> 3 1279 1489 T 4.0 143.0 369.0 34.0 asd mlp tgb
>>> 4 1437 1890 T 5.0 290.0 247.0 16.0 fgh qaz yhn
>>> 5 1705 1007 F NaN NaN NaN NaN NaN NaN NaN
>>> 6 1075 1720 F NaN NaN NaN NaN NaN NaN NaN
>>> 7 1361 1983 F NaN NaN NaN NaN NaN NaN NaN

Calculate the sum of values replacing NaN

I have a data frame with some NaNs in column B.
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
df
A B
0 654 987.0
1 987 NaN
2 321 741.0
3 654 NaN
4 987 65.0
5 15 35.0
6 98 94.0
7 338 NaN
I replace NaNs in B with the numbers form A
df.B.fillna(df.A, inplace = True)
df
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0
What's the easiest way to calculate the sum of the values that have replaced the NaNs in B?
You can use series.isna() with .loc[] to filter the Column A which meets the condition that column B is null and then sum:
df.loc[df['B'].isna(),'A'].sum()
Alternative:
df['B'].fillna(df['A']).sum() - df['B'].sum()
Note: you should do this before doing the inplace operation or preferable create a copy and save under a different variable for later reference.
Try the function math.isnan to check the NaN value.
import numpy as np
import pandas as pd
import math
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
for i in range(0,len(df['B'])):
if (math.isnan(df['B'][i])):
df['B'][i] = df['A'][i]
print(df)
Output :
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0

How to impute Null values in python for categorical data?

I have seen in R, imputation of categorical data is done straight forward by packages like DMwR, Caret and also I do have algorithm options like KNN or CentralImputation. But I do not see any libraries in python doing the same. FancyImpute performs well on numeric data.
Is there a way to do imputation of Null values in python for categorical data?
Edit: Added the top few rows of the data set.
>>> data_set.head()
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond \
0 856 854 0 NaN 3 1Fam TA
1 1262 0 0 NaN 3 1Fam TA
2 920 866 0 NaN 3 1Fam TA
3 961 756 0 NaN 3 1Fam Gd
4 1145 1053 0 NaN 4 1Fam TA
BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street \
0 No 706.0 0.0 ... WD 0 Pave
1 Gd 978.0 0.0 ... WD 0 Pave
2 Mn 486.0 0.0 ... WD 0 Pave
3 No 216.0 0.0 ... WD 0 Pave
4 Av 655.0 0.0 ... WD 0 Pave
TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd \
0 8 856.0 AllPub 0 2003 2003
1 6 1262.0 AllPub 298 1976 1976
2 6 920.0 AllPub 0 2001 2002
3 7 756.0 AllPub 0 1915 1970
4 9 1145.0 AllPub 192 2000 2000
YrSold
0 2008
1 2007
2 2008
3 2006
4 2008
[5 rows x 81 columns]
There are few ways to deal with missing values. As I understand you want to fill NaN according to specific rule. Pandas fillna can be used. Below code is example of how to fill categoric NaN with most frequent value.
df['Alley'].fillna(value=df['MSZoning'].value_counts().index[0],inplace =True)
Also this might be helpful sklearn.preprocessing.Imputer
For more information about pandas fillna pandas.DataFrame.fillna
Hope this will work

Data-frame manipulation in python

I have a csv file with two columns of a and b as below:
a b
601 1
602 2
603 3
604 4
605 5
606 6
I want to read and save data in a new csv file as below:
s id
601 1
602 1
603 1
604 2
605 2
606 2
I have tried this code:
data=pd.read_csv('./dataset/test4.csv')
list=[]
i=0
while(i<6):
list.append(data['a'].iloc[i:i+3])
i+=3
df = pd.DataFrame(list)
print(df)
by this out put:
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0
First I need to save the list in a dataframe with following result:
0 1 2 3 4 5
601.0 602.0 603.0 604.0 605.0 606.0
and then save in a csv file. However I've got stuck in the first part.
Thanks for your help.
Assuming every 3 items in a constitute a group in b, just do a little integer division on the index.
data['b'] = (data.index // 3 + 1)
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2
Saving to CSV is straightforward - all you have to do is call df.to_csv(...).
Division by index is fine as long as you have a monotonically increasing integer index. Otherwise, you can use np.arange (on MaxU's recommendation):
data['b'] = np.arange(len(data)) // 3 + 1
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2
By using you output
df.stack().unstack()
Out[115]:
0 1 2 3 4 5
a 601.0 602.0 603.0 604.0 605.0 606.0
Data Input
df
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0
In [45]: df[['a']].T
Out[45]:
0 1 2 3 4 5
a 601 602 603 604 605 606
or
In [39]: df.set_index('b').T.rename_axis(None, axis=1)
Out[39]:
1 2 3 4 5 6
a 601 602 603 604 605 606

How to prevent automatic assignment of values to missing data imported from SPSS

Let's say I have an spss file named "ab.sav" which looks like this:
gender value value2
F 433 329
. . 787
. . .
M 121 .
F 311 120
. . 899
M 341 .
In spss (Variable View) I defined the labels of gender with the values 1 and 2 for M and F respectively.
When I load this in python using the following commands:
>>> from rpy2.robjects.packages import importr
>>> from rpy2.robjects import pandas2ri
>>> foreign=importr("foreign")
>>> data=foreign.read_spss("ab.sav", to_data_frame=True, use_value_labels=True)
>>> pandas2ri.activate()
>>> data2=pandas2ri.ri2py(data)
I get the following dataframe:
>>> data2
gender value value2
0 F 433 329
1 M NaN 787
2 M NaN NaN
3 M 121 NaN
4 F 311 120
5 M NaN 899
6 M 341 NaN
So the missing values in the gender column for a given case are replaced by the subsequent known value of the subsequent case. Is there a simple way to prevent this?
When I change use_value_labels to False I get the expected result though:
>>> data2
gender value value2
0 2 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 1 121 NaN
4 2 311 120
5 NaN NaN 899
6 1 341 NaN
However I'd like to be able to use the labels instead of numeric values for gender as above. Ideally the output should be:
>>> data2
gender value value2
0 F 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 M 121 NaN
4 F 311 120
5 NaN NaN 899
6 M 341 NaN
Assuming data2 is a pandas DataFrame, and there's a 1-to-1 mapping between nulls in value and gender, you can do the following:
nulls = pandas.isnull(data2['value'])
data2.loc[nulls, 'gender'] = np.nan
And that turns it into:
gender value value2
0 F 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 M 121 NaN
4 F 311 120
5 NaN NaN 899
6 M 341 NaN

Categories