I have a dataframe of records that looks like:
'Location' 'Rec ID' 'Duration' 'Rec-X'
0 Houston 126 17 [0.2, 0.34, 0.45, ..., 0.28]
1 Chicago 126 19.3 [0.12, 0.3, 0.41, ..., 0.39]
2 Boston 348 17.3 [0.12, 0.3, 0.41, ..., 0.39]
3 Chicago 138 12.3 [0.12, 0.3, 0.41, ..., 0.39]
4 New York 238 11.3 [0.12, 0.3, 0.41, ..., 0.39]
...
500 Chicago 126 19.3 [0.12, 0.3, 0.41, ..., 0.39]
And as part of a genetic algorithm process, I want to initialize a population (10) of records. I want each of my subset to contain 10 records, however I want NOT to contain the same 'Rec-ID' two times.
Any idea on how to generate those 10 different dataframes?
Thanks,
you can drop duplicates based on your column from the dataframe and then access the 10 elements
df2 = df.drop_duplicates('Rec ID')
df2.head(10)
EDIT
If you want to select randomly 10 unique elements
Then something like this will work
def selectRandomUnique(df) :
d2 = df.sample(n=3).drop_duplicates('ID')
while len(d2) != 3 :
d2 = df.sample(n=3).drop_duplicates('ID')
return d2
In this first you randomly select the rows and then drop any duplicates that may exist.
Related
I am new to python a bit.
I am trying to convert a dataframe to list after changing the datatype of a particular column to integer. The funny thing is when converted to list, the column still has float.
There are three columns in the dataframe, first two is float and I want the last to be integer, but it still comes as float.
If I change all to integer, then the list creates as integer.
0 1.53 3.13 0.0
1 0.58 2.83 0.0
2 0.28 2.69 0.0
3 1.14 2.14 0.0
4 1.46 3.39 0.0
... ... ... ...
495 2.37 0.93 1.0
496 2.85 0.52 1.0
497 2.35 0.39 1.0
498 2.96 1.68 1.0
499 2.56 0.16 1.0
Above is the Dataframe.
Below is the last column converted
#convert last column to integer datatype
data[6] = data[6].astype(dtype ='int64')
display(data.dtypes)
The below is converting the dataframe to list.
#Turn DF to list
data_to_List = data.values.tolist()
data_to_List
#below is what is shown now.
[[1.53, 3.13, 0.0],
[0.58, 2.83, 0.0],
[0.28, 2.69, 0.0],
[1.14, 2.14, 0.0],
[3.54, 0.75, 1.0],
[3.04, 0.15, 1.0],
[2.49, 0.15, 1.0],
[2.27, 0.39, 1.0],
[3.65, 1.5, 1.0],
I want the last column to be just 0 or 1 and not 0.0 or 1.0
Yes, you are correct pandas is converting int to float when you use data.values
You can convert your float to int by using the below list comprehension:
data_to_List = [[x[0],x[1],int(x[2])] for x in data.values.tolist()]
print(data_to_List)
[[1.53, 3.13, 0],
[0.58, 2.83, 0],
[0.28, 2.69, 0],
[1.14, 2.14, 0],
[1.46, 3.39, 0]]
I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()
I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64
I have segmented roadway data that looks like this:
import pandas as pd
input_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9', 'US9', 'US9'],
'BMP': [0.0, 0.1, 0.2, 0.3, 0.4],
'EMP': [0.1, 0.2, 0.3, 0.4, 0.5],
'VALUE': [19, 19, 232, 232, 19]
})
>>> print(input_df)
BMP EMP ROUTE VALUE
0.0 0.1 US9 19
0.1 0.2 US9 19
0.2 0.3 US9 232
0.3 0.4 US9 232
0.4 0.5 US9 19
The BMP column represents the begin milepoint of this attribute along a linear referenced GIS representation of the road. The EMP is the associated end mileage. When the VALUE column is equal, I would like to combine adjacent segments.
There is a tool that does this operation in ArcGIS called Dissolve Route Events. I would like to use Pandas to complete this task. Here's the desired output:
output_df = pd.DataFrame({
'ROUTE': ['US9', 'US9', 'US9'],
'BMP': [0.0, 0.2, 0.4],
'EMP': [0.2, 0.4, 0.5],
'VALUE': [19, 232, 19]
})
>>> print(output_df)
BMP EMP ROUTE VALUE
0.0 0.2 US9 19
0.2 0.4 US9 232
0.4 0.5 US9 19
Try this!
input_df['trip'] = (input_df.VALUE.diff() != 0).cumsum()
output_df = input_df.groupby(['ROUTE','trip','VALUE']).agg({'BMP':'first','EMP':'last'})
output_df.reset_index()
#
ROUTE trip VALUE BMP EMP
0 US9 1 19 0.0 0.2
1 US9 2 232 0.2 0.4
2 US9 3 19 0.4 0.5
I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,
import numpy
import pandas
numpy.random.seed(35)
values = numpy.random.random((2,10000))
xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)
I can easily get the desired output with
print numpy.histogram2d(values[0], values[1], (xbins,ybins))
giving
[[ 408. 373. 405. 411. 400.]
[ 390. 413. 400. 414. 368.]
[ 354. 414. 421. 400. 413.]
[ 426. 393. 407. 416. 412.]
[ 412. 397. 396. 356. 401.]
[ 0. 0. 0. 0. 0.]]
However, with pandas,
df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
pandas.cut(df['y'], ybins)])
print binned.size().unstack()
prints
y (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
x
(0, 0.2] 408 373 405 411 400
(0.2, 0.4] 390 413 400 414 368
(0.4, 0.6] 354 414 421 400 413
(0.6, 0.8] 426 393 407 416 412
(0.8, 1] 412 397 396 356 401
i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.
In principle I can come up with ways to check if an index is present, using something like
allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
'({0}, {1}]'.format(ybins[j-1], ybins[j]))
for j in xrange(1, len(ybins))
for i in xrange(1, len(xbins))]
However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().
Any help is much appreciated.
It appears that pd.cut keeps your binning information which means we can use it in a reindex:
In [79]: xcut = pd.cut(df['x'], xbins)
In [80]: ycut = pd.cut(df['y'], ybins)
In [81]: binned = df.groupby([xcut, ycut])
In [82]: sizes = binned.size()
In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
...: .unstack()
...: .fillna(0.0))
...:
Out[85]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
(0.0, 0.2] 408.0 373.0 405.0 411.0 400.0
(0.2, 0.4] 390.0 413.0 400.0 414.0 368.0
(0.4, 0.6] 354.0 414.0 421.0 400.0 413.0
(0.6, 0.8] 426.0 393.0 407.0 416.0 412.0
(0.8, 1.0] 412.0 397.0 396.0 356.0 401.0
(1.0, 1.2] 0.0 0.0 0.0 0.0 0.0