Pandas read file with no delimiter and with different column widths - python

I want to read a plaintext file using pandas.
I have entries without delimiters and with different widths like this:
59967Y98Doe John 6211100004545SO20140314- 00024278
N0546664SCHMIDT-PETER 7441100008300AW20140314- 00023643
G4894jmhTAKLONSKY-JUERGEN 4211100005000TB20140315 00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315 00024622
1-8 is a string.
9-28 is a string.
29-31 is numeric.
32-34 is numeric.
35-41 is numeric.
42-43 is a string.
44-51 is a date (yyyyMMdd).
52 is minus or a blank
Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €
I know there is pd.read_fwf
There is an argument width. I could do this:
pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")
But how could I read my file with different columns widths?

As the s in widths suggest, you can pass a list of widths:
pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)
output:
0 1 2 3 4 5 6 7 8
0 59967Y98 Doe John 621 110 4545 SO 20140314 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 20140314 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 20140315 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 20140315 NaN 24622
If you want names and dtypes:
df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
dtypes=[str, str, int, int, int, str, str, str, int])
.assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
)
output:
A B C D E F G H I
0 59967Y98 Doe John 621 110 4545 SO 2014-03-14 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 2014-03-14 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 2014-03-15 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 2014-03-15 NaN 24622
df.dtypes
A object
B object
C int64
D int64
E int64
F object
G datetime64[ns]
H object
I int64
dtype: object

Related

Preserving NaN values when using groupby and lambda function on dataframe

Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.
You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object
You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No
Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No

Compare two dataframes based off of key

I have two dataframes, df1 and df2, which have the exact same columns and most of the time the same values for each key.
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
I want a snippet that compares the keys (key = column Country + column A) in each dataframe against each other and calculates the percent difference for each column B-H, if there is any. If there isn't, output nothing.
Hope, the below given code may help you to solve your problem. I have compared both datasets based on the Key column data and generate the difference of their (B-H) columns respectively. Thereafter, having the difference in percentage, I just have merge on the both datasets on Key column, compare the difference and have the final output in df3diff column of df3 dataset.
import pandas as pd
df1 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632, 0 , 0 , 26.12 , 2 , 0 , 'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48 , 0 ,4375.87 ,202, 0 , 'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 , '154 Argentinaland']
,['Australia', 'xylo', 7650,139472 , 69 , 0 ,16858.42,184, 0 , 'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 , 'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df1['df1BH'] = (df1['B']-df1['H'])/100.00
print(df1)
df2 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632 , 0 , 0 ,26.12 ,2 , 0 ,'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48, 0 ,4375.87 ,202, 0 ,'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 ,'154 Argentinaland']
,['Australia', 'xylo', 97650,139472 , 69, 0 ,96858.42,184, 0 ,'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 ,'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df2['df2BH'] = (df2['B']-df2['H'])/100.00
print(df2)
df3 = pd.merge(df1[['Key','df1BH']],df2[['Key','df2BH']], on=['Key'],how='outer')
df3['df3diff'] = df3['df1BH'] - df3['df2BH']
print(df3)
Output:
Key df1BH df2BH df3diff
0 Argentinaxylo 2.62 2.62 0.0
1 Argentinaphone 68.60 68.60 0.0
2 154 Argentinaland 5.04 5.04 0.0
3 Australiaxylo 76.50 976.50 -900.0
4 Australiamink 12.84 12.84 0.0

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

When does .str.count('\w') work and when doesn't it?

This is a follow up question to Regex inside findall vs regex inside count
.str.count('\w') works for me when called on the column of a dataframe, but not when called on a Series.
X_train[0:7] is a Series:
872 I'll text you when I drop x off
831 Hi mate its RV did u hav a nice hol just a mes...
1273 network operator. The service is free. For T &...
3314 FREE MESSAGE Activate your 500 FREE Text Messa...
4929 Hi, the SEXYCHAT girls are waiting for you to ...
4249 How much for an eighth?
3640 You can stop further club tones by replying \S...
Name: text, dtype: object
X_train[0:7].str.count('\w')
returns
872 0
831 0
1273 0
3314 0
4929 0
4249 0
3640 1
Name: text, dtype: int64)
When called on the same Series, converted into a dataframe column:
d = X_train[0:7]
df = pd.DataFrame(data=d)
df['col1'].str.count('\w') returns:
872 23
831 101
1273 50
3314 120
4929 98
4249 18
3640 98
Name: col1, dtype: int64
Why does it work on a dataframe column, but not on a series? Grateful for your advice.

Setting tab delimiter on just one column

I have a csv file that looks like this when read in as a pandas dataframe:
OBJECTID_1 AP_CODE
0 857720 137\t62\t005\tNE
1 857721 137\t62\t004\tNW
2 857724 137\t62\t004\tNE
3 857726 137\t62\t003\tNE
4 857728 137\t62\t003\tNW
5 857729 137\t62\t002\tNW
df.info() returns this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9313 entries, 0 to 9312
Data columns (total 2 columns):
OBJECTID_1 9312 non-null float64
AP_CODE 9313 non-null object
dtypes: float64(1), object(1)
memory usage: 181.9+ KB
None
and print(repr(open(r'P:\file.csv').read(100)))
returns this:
'OBJECTID_1,AP_CODE\n857720,"137\t62\t005\tNE"\n857721,"137\t62\t004\tNW"\n857724,"137\t62\t004\tNE"\n857726,"137\t'
I want to get rid of the \t in the column AP_CODE but I can't figure out why it is even there, or how to remove it. .replace doesn't work.
If you want to use tabs in replacement, you need to use a raw string by prefexing your string literal with r:
In [299]: df.AP_CODE.str.replace(r'\\t',' ')
Out[299]:
0 137 62 005 NE
1 137 62 004 NW
2 137 62 004 NE
3 137 62 003 NE
4 137 62 003 NW
5 137 62 002 NW
Name: AP_CODE, dtype: object

Categories