Why does my output dataframe have two columns with same indexes? - python

I have two dataframes
results:
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8
acctypeDF:
0 rooftop
1 place
2 rooftop
I wanted to combine both these dataframes into one so i did:
import pandas as pd
resultsfinal = pd.concat([results, acctypeDF], axis=1)
But the output is:
resultsfinal
Out[155]:
0 1 2 3 0
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 -117.886028 4 rooftop
1 P.O. Box 5601, 29304, SC 34.945855 -81.930035 6 place
2 4113 Darius Dr, 17025, PA 40.287768 -76.967292 8 rooftop
As you can see the output is repeating the index number 0.Why does this happen? My objective is to drop the first index(first column) which has addresses, but I am getting this error:
resultsfinal.drop(columns='0')
raise KeyError('{} not found in axis'.format(labels))
KeyError: "['0'] not found in axis"
I also tried:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
resultsfinal
Out[158]:
0 1 ... 4 5
0 2211 E Winston Rd Ste B, 92806, CA 33.814547 ... rooftop rooftop
1 P.O. Box 5601, 29304, SC 34.945855 ... place place
But as you see above, even though the issue of index 0 repeating goes away, it creates a duplicate column(5)
If i do:
resultsfinal = results[results.columns[1:]]
resultsfinal
Out[161]:
1 2 ... 0 0
0 33.814547 -117.886028 ... 2211 E Winston Rd Ste B, 92806, CA rooftop
1 34.945855 -81.930035 ... P.O. Box 5601, 29304, SC place
print(resultsfinal.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
0 10 non-null object
1 10 non-null float64
2 10 non-null float64
3 10 non-null int64
4 10 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 480.0+ bytes

Use ingnore_index=True:
resultsfinal = pd.concat([results, acctypeDF], axis=1,ignore_index=True)
or
resultsfinal = pd.concat([results, acctypeDF], axis=1)
resultsfinal.columns=range(len(resultsfinal.columns))
print(resultfinal)
remove the first column:
resultsfinal[resultsfinal.columns[1:]]

Related

Pandas read file with no delimiter and with different column widths

I want to read a plaintext file using pandas.
I have entries without delimiters and with different widths like this:
59967Y98Doe John 6211100004545SO20140314- 00024278
N0546664SCHMIDT-PETER 7441100008300AW20140314- 00023643
G4894jmhTAKLONSKY-JUERGEN 4211100005000TB20140315 00023882
34875738PODESBERG-SCHUMPERTS6211100003671SO20140315 00024622
1-8 is a string.
9-28 is a string.
29-31 is numeric.
32-34 is numeric.
35-41 is numeric.
42-43 is a string.
44-51 is a date (yyyyMMdd).
52 is minus or a blank
Rest is a currency amount without a decimal point (the last 2 digits is always after the decimal point). For example: - 00024278 = -242.78 €
I know there is pd.read_fwf
There is an argument width. I could do this:
pd.read_fwf(StringIO(txt), widths=[8], header="Peronal Nr.")
But how could I read my file with different columns widths?
As the s in widths suggest, you can pass a list of widths:
pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None)
output:
0 1 2 3 4 5 6 7 8
0 59967Y98 Doe John 621 110 4545 SO 20140314 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 20140314 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 20140315 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 20140315 NaN 24622
If you want names and dtypes:
df = (pd.read_fwf(io.StringIO(txt), widths=[8,20,3,3,7,2,8,1,99], header=None,
names=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
dtypes=[str, str, int, int, int, str, str, str, int])
.assign(**{'G': lambda d: pd.to_datetime(d['G'], format='%Y%m%d')})
)
output:
A B C D E F G H I
0 59967Y98 Doe John 621 110 4545 SO 2014-03-14 - 24278
1 N0546664 SCHMIDT-PETER 744 110 8300 AW 2014-03-14 - 23643
2 G4894jmh TAKLONSKY-JUERGEN 421 110 5000 TB 2014-03-15 NaN 23882
3 34875738 PODESBERG-SCHUMPERTS 621 110 3671 SO 2014-03-15 NaN 24622
df.dtypes
A object
B object
C int64
D int64
E int64
F object
G datetime64[ns]
H object
I int64
dtype: object

Turn 1d-indexed xarray variables into 3D coordinates

I have an xarray.Dataset that looks roughly like this:
<xarray.Dataset>
Dimensions: (index: 286720)
Coordinates:
* index (index) int64 0 1 2 3 4 ... 286716 286717 286718 286719
Data variables:
Time (index) float64 2.525 2.525 2.525 ... 9.475 9.475 9.475
ch (index) int64 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 2
pixel (index) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Rough_wavelength (index) float64 2.698 2.701 2.704 ... 32.05 32.05 32.06
Count (index) int64 463 197 265 335 305 ... 285 376 278 0 278
There are only 140 unique values for the Time variable, 2 for the ch(...annel), and 1024 for the pixel value. I'd thus like to turn them into coordinates and completely drop the largely irrelevant index coordinate, something like this:
<xarray.Dataset>
Dimensions: (Time: 140, ch: 2, pixel: 1024)
Coordinates:
Time (time) float64 2.525 ... 9.475
ch (ch) int64 1 2
pixel (pixel) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Data variables:
Rough_wavelength (time, ch, pixel) float64 2.698 ... 32.06
Count (time, ch, pixel) int64 463 ... 278
Is there a way to do this using xarray? If not, what's a sane way to do this using the standard numpy stack?
Replace the index coordinate with a pd.MultiIndex, then unstack the index:
In [10]: ds.assign_coords(
...: {
...: "index": pd.MultiIndex.from_arrays(
...: [ds.Time.values, ds.ch.values, ds.pixel.values],
...: names=["Time", "ch", "pixel"],
...: )
...: }
...: ).drop_vars(["Time", "ch", "pixel"]).unstack("index")

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

Setting tab delimiter on just one column

I have a csv file that looks like this when read in as a pandas dataframe:
OBJECTID_1 AP_CODE
0 857720 137\t62\t005\tNE
1 857721 137\t62\t004\tNW
2 857724 137\t62\t004\tNE
3 857726 137\t62\t003\tNE
4 857728 137\t62\t003\tNW
5 857729 137\t62\t002\tNW
df.info() returns this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9313 entries, 0 to 9312
Data columns (total 2 columns):
OBJECTID_1 9312 non-null float64
AP_CODE 9313 non-null object
dtypes: float64(1), object(1)
memory usage: 181.9+ KB
None
and print(repr(open(r'P:\file.csv').read(100)))
returns this:
'OBJECTID_1,AP_CODE\n857720,"137\t62\t005\tNE"\n857721,"137\t62\t004\tNW"\n857724,"137\t62\t004\tNE"\n857726,"137\t'
I want to get rid of the \t in the column AP_CODE but I can't figure out why it is even there, or how to remove it. .replace doesn't work.
If you want to use tabs in replacement, you need to use a raw string by prefexing your string literal with r:
In [299]: df.AP_CODE.str.replace(r'\\t',' ')
Out[299]:
0 137 62 005 NE
1 137 62 004 NW
2 137 62 004 NE
3 137 62 003 NE
4 137 62 003 NW
5 137 62 002 NW
Name: AP_CODE, dtype: object

Categories