I have a data frame like this:
ReviewDate_month,ProductId,Reviewer
01,185,185
02,155,155
03,130,130
04,111,111
05,110,110
06,98,98
07,101,92
08,71,71
09,73,73
10,76,76
11,105,105
12,189,189
I want to plot it, ReviewDate_Month in X, Product ID and Reviewer in Y ideally. But I will start with 1 line either Product ID or Reviewer.
so i tried:
df_no_monthlycount.plot.line
Got below error msg:
File "C:/Users/user/PycharmProjects/Assign2/Main.py", line 59, in <module>
01 185 185
02 155 155
03 130 130
04 111 111
05 110 110
06 98 98
07 101 92
08 71 71
09 73 73
10 76 76
df_no_monthlycount.plot.line
AttributeError: 'function' object has no attribute 'line'
11 105 105
12 189 189
Process finished with exit code 1
I also tried this:
df_no_monthlycount.plot(x=df_helful_monthlymean['ReviewDate_month'],y=df_helful_monthlymean['ProductId'],style='o')
Error msg like this:
Traceback (most recent call last):
File "C:/Users/user/PycharmProjects/Assign2/Main.py", line 52, in <module>
df_no_monthlycount.plot(x=df_helful_monthlymean['ReviewDate_month'],y=df_helful_monthlymean['ProductId'],style='o')
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 1797, in __getitem__
return self._getitem_column(key)
File "C:\Python34\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column
return self._get_item_cache(key)
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 1084, in _get_item_cache
values = self._data.get(item)
File "C:\Python34\lib\site-packages\pandas\core\internals.py", line 2851, in get
loc = self.items.get_loc(item)
File "C:\Python34\lib\site-packages\pandas\core\index.py", line 1572, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3838)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3718)
File "pandas\hashtable.pyx", line 686, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12294)
File "pandas\hashtable.pyx", line 694, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12245)
KeyError: 'ReviewDate_month'
Call the plot as shown below:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
print(df)
df.plot(x ='ReviewDate_month',y=['ProductId', 'Reviewer'] ,kind='line')
plt.show()
Will give you:
If you want to plot ReviewDate_Month in X, Product ID and Reviewer in Y, you can do it this way:
df_no_monthlycount.plot(x='ReviewDate_Month', y=['Product ID', 'Reviewer'])
Related
I want to remove any players who didn't have over 1000 MP(minutes played).
I could easily write:
league_stats= pd.read_csv("1996.csv")
league_stats = league_stats.drop("Player-additional", axis=1)
league_stats_1000 = league_stats[league_stats['MP'] > 1000]
However, because players sometimes play for multiple teams in a year...this code doesn't account for that.
For example, Sam Cassell has four entries and none are above 1000 MP, but in total his MP for the season was over 1000. By running the above code I remove him from the new dataframe.
I am wondering if there is a way to sort the Dataframe by matching Rank(the RK column gives players who played on different teams the same rank number for each team they played on) and then sort it by... if the total of their MP is 1000=<.
This is the page I got the data from: 1996-1997 season.
Above the data table and to the left of the blue check box there is a dropdown menu called "Share and Export". From there I clicked on "Get table as CSV (for Excel)". After that I saved the CSV to a text editor and change the file extension to .csv to upload it to Jupyter Notebook.
This is a solution I came up with:
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
tot_df = df.loc[df['Tm'] == 'TOT']
mp_1000 = tot_df.loc[tot_df["MP"] < 1000]
# Create list of indexes with unnecessary entries to be removed. We have TOT and don't need these rows.
# *** For the record, I came up with this list by manually going through the data.
indexes_to_remove = [5,6,24, 25, 66, 67, 248, 249, 447, 448, 449, 275, 276, 277, 19, 20, 21, 377, 378, 477, 478, 479,
54, 55, 451, 452, 337, 338, 156, 157, 73, 74, 546, 547, 435, 436, 437, 142, 143, 421, 42, 43, 232,
233, 571, 572, 363, 364, 531, 532, 201, 202, 111, 112, 139, 140, 307, 308, 557, 558, 93, 94, 512,
513, 206, 207, 208, 250, 259, 286, 287, 367, 368, 271, 272, 102, 103, 34, 35, 457, 458, 190, 191,
372, 373, 165, 166
]
df_drop_tot = df.drop(labels=indexes_to_remove, axis=0)
df_drop_tot
First off, no need to manually download the csv and then read it into pandas. You can load in the table using pandas' .read_html().
And yes, you can simply get the list of ranks, player names, or whatever, that have greater than 1000 MP, then use that list to filter the dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
df = df[df['Rk'].ne('Rk')]
df['MP'] = df['MP'].astype(int)
players_1000_rk_list = list(df[df['MP'] >= 1000]['Rk']) #<- coverts the "Rk" column into a list. I can then use that in the next line to only keep the "Rk" values that are in the list of "Rk"s that are >= 1000 MPs
players_df = df[df['Rk'].isin(players_1000_rk_list)]
Output: filters down from 574 rows to 282 rows
print(players_df)
Rk Player Pos Age Tm G ... AST STL BLK TOV PF PTS
0 1 Mahmoud Abdul-Rauf PG 27 SAC 75 ... 189 56 6 119 174 1031
1 2 Shareef Abdur-Rahim PF 20 VAN 80 ... 175 79 79 225 199 1494
3 4 Cory Alexander PG 23 SAS 80 ... 254 82 16 146 148 577
7 6 Ray Allen* SG 21 MIL 82 ... 210 75 10 149 218 1102
10 9 Greg Anderson C 32 SAS 82 ... 34 63 67 73 225 322
.. ... ... .. .. ... .. ... ... ... .. ... ... ...
581 430 Walt Williams SF 26 TOR 73 ... 197 97 62 174 282 1199
582 431 Corliss Williamson SF 23 SAC 79 ... 124 60 49 157 263 915
583 432 Kevin Willis PF 34 HOU 75 ... 71 42 32 119 216 842
589 438 Lorenzen Wright C 21 LAC 77 ... 49 48 60 79 211 561
590 439 Sharone Wright C 24 TOR 60 ... 28 15 50 93 146 390
[282 rows x 30 columns]
I have 2 dataframes (df1 and df2) which look like:
df1
Quarter Body Total requests Requests Processed … Requests on-hold
Q3 2019 A 93 92 … 0
Q3 2019 B 228 210 … 0
Q3 2019 C 180 178 … 0
Q3 2019 D 31 31 … 0
Q3 2019 E 555 483 … 0
df2
Quarter Body Total requests Requests Processed … Requests on-hold
Q2 2019 A 50 50 … 0
Q2 2019 B 191 177 … 0
Q2 2019 C 186 185 … 0
Q2 2019 D 35 35 … 0
Q2 2019 E 344 297 … 0
I am tring to append df2 onto df2 to create df3:
df3
Quarter Body Total requests Requests Processed … Requests on-hold
Q3 2019 A 93 92 … 0
Q3 2019 B 228 210 … 0
Q3 2019 C 180 178 … 0
Q3 2019 D 31 31 … 0
Q3 2019 E 555 483 … 0
Q2 2019 A 50 50 … 0
Q2 2019 B 191 177 … 0
Q2 2019 C 186 185 … 0
Q2 2019 D 35 35 … 0
Q2 2019 E 344 297 … 0
using:
df3= df1.append(df2)
but get the error:
AttributeError: 'NoneType' object has no attribute 'is_extension'
the full error trace is:
File "<ipython-input-405-e3e0e047dbc0>", line 1, in <module>
runfile('C:/2019_Q3/Code.py', wdir='C:/2019_Q3')
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/2019_Q3/Code.py", line 420, in <module>
main()
File "C:/2019_Q3/Code.py", line 319, in main
df3= df1.append(df2, ignore_index=True)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\frame.py", line 6692, in append
sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 229, in concat
return op.get_result()
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 426, in get_result
copy=self.copy)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\managers.py", line 2056, in concatenate_block_managers
elif is_uniform_join_units(join_units):
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\concat.py", line 379, in is_uniform_join_units
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\concat.py", line 379, in <genexpr>
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
AttributeError: 'NoneType' object has no attribute 'is_extension'
using:
df3= pd.concat([df1, df2], ignore_index=True)
gives me a error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
the full error trace is:
Traceback (most recent call last):
File "<ipython-input-406-e3e0e047dbc0>", line 1, in <module>
runfile('C:/2019_Q3/Code.py', wdir='C:/2019_Q3')
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/2019_Q3/Code.py", line 421, in <module>
main()
File "C:/2019_Q3/Code.py", line 321, in main
finalCSV = pd.concat([PreviousCSVdf, df], ignore_index=True)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 228, in concat
copy=copy, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 381, in __init__
self.new_axes = self._get_new_axes()
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 448, in _get_new_axes
new_axes[i] = self._get_comb_axis(i)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 469, in _get_comb_axis
sort=self.sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 70, in _get_objs_combined_axis
return _get_combined_index(obs_idxes, intersect=intersect, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 117, in _get_combined_index
index = _union_indexes(indexes, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 183, in _union_indexes
result = result.union(other)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\base.py", line 2332, in union
indexer = self.get_indexer(other)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\base.py", line 2740, in get_indexer
raise InvalidIndexError('Reindexing only valid with uniquely'
Both df1 and df2 have identical numbers of columns and column names. How would I append df1 and df2?
This tends to happen when you have duplicate columns in one or both of datasets.
Also, for general use its easier to go with pd.concat:
pd.concat([df1, df2], ignore_index=True) # ignore_index will reset index for you
And for the InvalidIndexError you can remove duplicate rows:
df1 = df1.loc[~df1.index.duplicated(keep='first')]
df2 = df2.loc[~df2.index.duplicated(keep='first')]
I'll make this short and sweet. I had this same issue.
The issue is not caused by duplicate column names but instead by duplicate column names with different data types.
Swapping to pd.concat will not fix this issue for you if you don't address the data types first.
I have a dataframe with column titles printed below:
Index(['Unnamed: 0', 'material', 'step', 'zaid', 'mass(gm)', 'activity(Ci)',
'spec.act(Ci/gm)', 'atomden(a/b-cm)', 'atom_frac', 'mass_frac'],
dtype='object')
If I try to obtain data for only, say step 16, and I perform the command:
print (df[(16 in df['step'] == 16)])
Things work as expected:
Unnamed: 0 material step zaid mass(gm) activity(Ci) spec.act(Ci/gm) atomden(a/b-cm) atom_frac mass_frac
447 447 1 16 90232 2.034000e-09 2.231000e-16 1.097000e-07 9.311000e-12 2.597000e-10 3.048000e-10
448 448 1 16 92233 2.451000e-08 2.362000e-10 9.636000e-03 1.117000e-10 3.116000e-09 3.672000e-09
449 449 1 16 92234 4.525000e-05 2.813000e-07 6.217000e-03 2.053000e-07 5.728000e-06 6.780000e-06
450 450 1 16 92235 1.640000e-01 3.544000e-07 2.161000e-06 7.408000e-04 2.067000e-02 2.457000e-02
451 451 1 16 92236 1.553000e-02 1.004000e-06 6.467000e-05 6.987000e-05 1.949000e-03 2.327000e-03
... ... ... ... ... ... ... ... ... ... ...
37781 37781 10 16 67165 5.941000e-05 0.000000e+00 0.000000e+00 1.195000e-08 3.311000e-07 2.785000e-07
37782 37782 10 16 68166 4.205000e-05 0.000000e+00 0.000000e+00 8.411000e-09 2.330000e-07 1.971000e-07
37783 37783 10 16 68167 1.804000e-05 0.000000e+00 0.000000e+00 3.586000e-09 9.934000e-08 8.457000e-08
37784 37784 10 16 68168 7.046000e-06 0.000000e+00 0.000000e+00 1.393000e-09 3.857000e-08 3.303000e-08
37785 37785 10 16 68170 7.317000e-07 0.000000e+00 0.000000e+00 1.429000e-10 3.958000e-09 3.430000e-09
However if I now want to grab data for just the zaid 92235 (which clearly exists as it is displayed in the step 16 results above), according to the command:
print (df[(92235 in df['zaid'] == 92235)])
I get the following error:
Traceback (most recent call last):
File "/Users/jack/Library/Python/3.7/lib/python/site-packages/pandas/core/indexes/base.py", line 2890, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "get_pincell_isos.py", line 57, in <module>
print (df[(92235 in df['zaid'] == 92235)])
File "/Users/jack/Library/Python/3.7/lib/python/site-packages/pandas/core/frame.py", line 2975, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/jack/Library/Python/3.7/lib/python/site-packages/pandas/core/indexes/base.py", line 2892, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
It apparently can't find "92235", even though I know it exists (shown above) and the data is stored as an int64, the same type as the values in "step". This is illustrated by printing all values from "step" and "zaid".
print (df['step'])
print (df['zaid'])
gives the following results:
0 0
1 0
2 0
3 0
4 0
..
37781 16
37782 16
37783 16
37784 16
37785 16
Name: step, Length: 37786, dtype: int64
0 90230
1 90231
2 90232
3 90233
4 90234
...
37781 67165
37782 68166
37783 68167
37784 68168
37785 68170
Name: zaid, Length: 37786, dtype: int64
I hope I'm missing something obvious. I've tried any number of ways to try to cross-section the 'zaid' column data and no attempts have been successful at recognizing any of the values associated with 'zaid'.
Thanks!
Try df[df['zaid'] == 92235]. Try the below code in any ipython console
import pandas as pd
data=data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df['state'] == 'Nevada'
df[df['state'] == 'Nevada']
So I have this df with the first column called "Week":
0 2018-01-07
1 2018-01-14
2 2018-01-21
3 2018-01-28
4 2018-02-04
5 2018-02-11
6 2018-02-18
7 2018-02-25
8 2018-03-04
9 2018-03-11
10 2018-03-18
11 2018-03-25
12 2018-04-01
13 2018-04-08
14 2018-04-15
15 2018-04-22
16 2018-04-29
17 2018-05-06
Name: Week, dtype: object
And other three columns with different names and intergers as values.
My ideia is to plot these dates at X axis and the other 3 columns with ints at Y.
I've tried everything I found but nothing have worked yet...
I did:
df.set_index('Week')
df.plot()
plt.show()
Which worked very well, but X axis stil a float in range(0, 17)...
I also tried:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week')
df.plot()
plt.show()
But I got this error:
Traceback (most recent call last):
File "C:\Users\mar\Desktop\Web Dev\PJ E\EZA.py", line 33, in <module>
df.plot()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 2677, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1902, in plot_frame
**kwds)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1729, in _plot
plot_obj.generate()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._post_plot_logic_common(ax, self.data)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 397, in _post_plot_logic_common
self._apply_axis_properties(ax.yaxis, fontsize=self.fontsize)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 470, in _apply_axis_properties
labels = axis.get_majorticklabels() + axis.get_minorticklabels()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1188, in get_majorticklabels
ticks = self.get_major_ticks()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\axis.py", line 1339, in get_major_ticks
numticks = len(self.get_major_locator()())
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1054, in __call__
self.refresh()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 1074, in refresh
dmin, dmax = self.viewlim_to_dt()
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 832, in viewlim_to_dt
return num2date(vmin, self.tz), num2date(vmax, self.tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 441, in num2date
return _from_ordinalf(x, tz)
File "C:\Users\mar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\matplotlib\dates.py", line 256, in _from_ordinalf
dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC)
ValueError: ordinal must be >= 1
Thanks in advance.
you can do something like this below:
df['Week'] = pd.to_datetime(df['Week'])
df.set_index('Week', inplace=True)
df.plot()
I'm using read_csv() to read data from external .csv file. It's working fine. But whenever I try to find the minimum of the last column of that dataframe using np.min(...), it's giving lots of errors. But it's interesting that the same procedure is working for the rest of the columns that the dataframe has.
I'm attaching the code here.
import numpy as np
import pandas as pd
import os
data = pd.read_csv("test_data_v4.csv", sep = ",")
print(data)
The output is like below:
LINK_CAPACITY_KBPS THROUGHPUT_KBPS HOP_COUNT PACKET_LOSS JITTER_MS \
0 25 15.0 50 0.25 20
1 20 10.5 70 0.45 3
2 17 12.0 49 0.75 7
3 18 11.0 65 0.30 11
4 14 14.0 55 0.50 33
5 15 8.0 62 0.25 31
RSSI
0 -30
1 -11
2 -26
3 -39
4 -25
5 -65
np.min(data['RSSI'])
Now the error comes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1914, in __getitem__
return self._getitem_column(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1921, in _getitem_column
return self._get_item_cache(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/generic.py", line 1090, in _get_item_cache
values = self._data.get(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/internals.py", line 3102, in get
loc = self.items.get_loc(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/index.py", line 1692, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'RSSI'
Following on DSM's comment, try data.columns = data.columns.str.strip()