Setting tab delimiter on just one column

Setting tab delimiter on just one column - python

I have a csv file that looks like this when read in as a pandas dataframe:
OBJECTID_1 AP_CODE
0 857720 137\t62\t005\tNE
1 857721 137\t62\t004\tNW
2 857724 137\t62\t004\tNE
3 857726 137\t62\t003\tNE
4 857728 137\t62\t003\tNW
5 857729 137\t62\t002\tNW
df.info() returns this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9313 entries, 0 to 9312
Data columns (total 2 columns):
OBJECTID_1 9312 non-null float64
AP_CODE 9313 non-null object
dtypes: float64(1), object(1)
memory usage: 181.9+ KB
None
and print(repr(open(r'P:\file.csv').read(100)))
returns this:
'OBJECTID_1,AP_CODE\n857720,"137\t62\t005\tNE"\n857721,"137\t62\t004\tNW"\n857724,"137\t62\t004\tNE"\n857726,"137\t'
I want to get rid of the \t in the column AP_CODE but I can't figure out why it is even there, or how to remove it. .replace doesn't work.

If you want to use tabs in replacement, you need to use a raw string by prefexing your string literal with r:
In [299]: df.AP_CODE.str.replace(r'\\t',' ')
Out[299]:
0 137 62 005 NE
1 137 62 004 NW
2 137 62 004 NE
3 137 62 003 NE
4 137 62 003 NW
5 137 62 002 NW
Name: AP_CODE, dtype: object

Related

Pandas read file with no delimiter and with different column widths

I want to read a plaintext file using pandas.
I have entries without delimiters and with different widths like this:
59967Y98Doe John 6211100004545SO20140314- 00024278
N0546664SCHMIDT-PETER 7441100008300AW20140314- 00023643
G4894jmhTAKLONSKY-JUERGEN 4211100005000TB20140315 00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315 00024622
1-8 is a string.
9-28 is a string.
29-31 is numeric.
32-34 is numeric.
35-41 is numeric.
42-43 is a string.
44-51 is a date (yyyyMMdd).
52 is minus or a blank
Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €
I know there is pd.read_fwf
There is an argument width. I could do this:
pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")
But how could I read my file with different columns widths?

As the s in widths suggest, you can pass a list of widths:
pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)
output:
0 1 2 3 4 5 6 7 8
0 59967Y98 Doe John 621 110 4545 SO 20140314 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 20140314 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 20140315 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 20140315 NaN 24622
If you want names and dtypes:
df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
dtypes=[str, str, int, int, int, str, str, str, int])
.assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
)
output:
A B C D E F G H I
0 59967Y98 Doe John 621 110 4545 SO 2014-03-14 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 2014-03-14 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 2014-03-15 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 2014-03-15 NaN 24622
df.dtypes
A object
B object
C int64
D int64
E int64
F object
G datetime64[ns]
H object
I int64
dtype: object

ValueError: zero-dimensional arrays cannot be concatenated,

I have two arrays with axis=0 (there are the result of the mean and the std of a df):
df_cats =
0 58.609619
1 105.926514
2 76.706543
3 75.405762
4 68.937744
...
75 113.124268
76 125.557373
77 130.514893
78 141.373779
79 109.185791
Length: 80, dtype: float64 0 63.540835
1 55.053429
2 96.221076
3 42.963771
4 57.447924
...
75 42.080755
76 55.309517
77 38.997856
78 57.364695
79 40.197461
Length: 80, dtype: float64
df_dogs =
0 86.870361
1 153.085205
2 89.576416
3 139.721924
4 107.218750
...
75 129.498291
76 108.676025
77 113.125732
78 145.829346
79 100.272461
Length: 80, dtype: float64 0 57.699218
1 71.814790
2 40.130439
3 44.966932
4 48.964512
...
75 50.994298
76 58.257198
77 89.240987
78 58.945353
79 68.841721
Length: 80, dtype: float64
And I'm trying to concatenate the two arrays with axis=1, using this code:
dogs_and_cats = np.concatenate((df_dogs, df_cats), axis=1)
but always have this problem:
**ValueError:** zero-dimensional arrays cannot be concatenated
How can i can concatenate that?

One dimensionnal arrays don't have a second dimension. This is where your problem comes from. You can concatenate by modifying the shape of these arrays to turn them into 2D arrays with only one column with the following code :
df_dogs.shape=(df_dogs.shape[0],1)
df_cats.shape=(df_cats.shape[0],1)
And now, you can concatenate your arrays.

Why does my output dataframe have two columns with same indexes?

I have two dataframes
results:
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8
acctypeDF:
0 rooftop
1 place
2 rooftop
I wanted to combine both these dataframes into one so i did:
import pandas as pd
resultsfinal = pd.concat([results, acctypeDF], axis=1)
But the output is:
resultsfinal
Out[155]:
0 1 2 3 0
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4 rooftop
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6 place
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8 rooftop
As you can see the output is repeating the index number 0.Why does this happen? My objective is to drop the first index(first column) which has addresses, but I am getting this error:
resultsfinal.drop(columns='0')
raise KeyError('{} not found in axis'.format(labels))
KeyError: "['0'] not found in axis"
I also tried:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
resultsfinal
Out[158]:
0 1 ... 4 5
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 ... rooftop rooftop
1 P.O. Box 5601, 29304, SC 34.945855 ... place place
But as you see above, even though the issue of index 0 repeating goes away, it creates a duplicate column(5)
If i do:
resultsfinal = results[results.columns[1:]]
resultsfinal
Out[161]:
1 2 ... 0 0
0 33.814547 -117.886028 ... 2211 E Winston Rd Ste B, 92806, CA rooftop
1 34.945855 -81.930035 ... P.O. Box 5601, 29304, SC place
print(resultsfinal.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
0 10 non-null object
1 10 non-null float64
2 10 non-null float64
3 10 non-null int64
4 10 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 480.0+ bytes

Use ingnore_index=True:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
or
resultsfinal = pd.concat([results, acctypeDF], axis=1)
resultsfinal.columns=range(len(resultsfinal.columns))
print(resultfinal)
remove the first column:
resultsfinal[resultsfinal.columns[1:]]

When does .str.count('\w') work and when doesn't it?

This is a follow up question to Regex inside findall vs regex inside count
.str.count('\w') works for me when called on the column of a dataframe, but not when called on a Series.
X_train[0:7] is a Series:
872 I'll text you when I drop x off
831 Hi mate its RV did u hav a nice hol just a mes...
1273 network operator. The service is free. For T &...
3314 FREE MESSAGE Activate your 500 FREE Text Messa...
4929 Hi, the SEXYCHAT girls are waiting for you to ...
4249 How much for an eighth?
3640 You can stop further club tones by replying \S...
Name: text, dtype: object
X_train[0:7].str.count('\w')
returns
872 0
831 0
1273 0
3314 0
4929 0
4249 0
3640 1
Name: text, dtype: int64)
When called on the same Series, converted into a dataframe column:
d = X_train[0:7]
df = pd.DataFrame(data=d)
df['col1'].str.count('\w') returns:
872 23
831 101
1273 50
3314 120
4929 98
4249 18
3640 98
Name: col1, dtype: int64
Why does it work on a dataframe column, but not on a series? Grateful for your advice.

KeyError in for loop of dataframe in pandas

I am putting my data into a bokeh layout of a heat map, but am getting a KeyError: '1'. It occurs right at the line num_calls = pivot_table[m][y] does anybody know why this would be?
The pivot table I am using is below:
pivot_table.head()
Out[101]:
Month 1 2 3 4 5 6 7 8 9 CompanyName
Company 1 182 270 278 314 180 152 110 127 129
Company 2 163 147 192 142 186 231 214 130 112
Company 3 126 88 99 139 97 97 96 37 79
Company 4 84 89 71 95 80 89 83 88 104
Company 5 91 96 94 66 81 77 87 83 68
Month 10 11 12
CompanyName
Company 1 117 127 81
Company 2 117 93 101
Company 3 116 111 95
Company 4 93 78 64
Company 5 83 95 65
Below is the section of code leading up to the error:
pivot_table = pivot_table.reset_index()
pivot_table['CompanyName'] = [str(x) for x in pivot_table['CompanyName']]
Companies = list(pivot_table['CompanyName'])
months = ["1","2","3","4","5","6","7","8","9","10","11","12"]
pivot_table = pivot_table.set_index('CompanyName')
# this is the colormap from the original plot
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce",
"#ddb7b1", "#cc7878", "#933b41", "#550b1d" ]
# Set up the data for plotting. We will need to have values for every
# pair of year/month names. Map the rate to a color.
month = []
company = []
color = []
rate = []
for y in Companies:
for m in months:
month.append(m)
company.append(y)
num_calls = pivot_table[m][y]
rate.append(num_calls)
color.append(colors[min(int(num_calls)-2, 8)])
and upon request:
pivot_table.info()
<class 'pandas.core.frame.DataFrame'>
Index: 46 entries, Company1 to LastCompany
Data columns (total 12 columns):
1.0 46 non-null float64
2.0 46 non-null float64
3.0 46 non-null float64
4.0 46 non-null float64
5.0 46 non-null float64
6.0 46 non-null float64
7.0 46 non-null float64
8.0 46 non-null float64
9.0 46 non-null float64
10.0 46 non-null float64
11.0 46 non-null float64
12.0 46 non-null float64
dtypes: float64(12)
memory usage: 4.5+ KB
and
pivot_table.columns
Out[103]: Index([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0], dtype='object')
Also the bokeh code is here: http://docs.bokeh.org/en/latest/docs/gallery/unemployment.html

I've tried the following code and it works on my PC. I use .loc with the aim to avoid potential key error.
import pandas as pd
import numpy as np
# just following your previous post to simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'counts']
pivot_table = result.pivot(index='CompanyName', columns='Month', values='counts')
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce",
"#ddb7b1", "#cc7878", "#933b41", "#550b1d" ]
month = []
company = []
color = []
rate = []
for y in pivot_table.index:
for m in pivot_table.columns:
month.append(m)
company.append(y)
num_calls = pivot_table.loc[y, m]
rate.append(num_calls)
color.append(colors[min(int(num_calls)-2, 8)])

Try changing the loop to
for m in pivot_table.columns:
It seems you can achieve the same thing without any loops though. You're looping through the row index and column index to access each entry individually and appending them to a list, so rate is just a list of all elements in the data frame. You can achieve this by
rate= pivot_table.stack().astype(int).tolist()
color = [colours[min(x - 2, 8)] for x in rate]
Am i missing something here?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Setting tab delimiter on just one column - python

If you want to use tabs in replacement, you need to use a raw string by prefexing your string literal with r: In [299]: df.AP_CODE.str.replace(r'\\t',' ') Out[299]: 0 137 62 005 NE 1 137 62 004 NW 2 137 62 004 NE 3 137 62 003 NE 4 137 62 003 NW 5 137 62 002 NW Name: AP_CODE, dtype: object

Related

Pandas read file with no delimiter and with different column widths

ValueError: zero-dimensional arrays cannot be concatenated,

Why does my output dataframe have two columns with same indexes?

When does .str.count('\w') work and when doesn't it?

KeyError in for loop of dataframe in pandas

Categories

Resources