Related
I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks
In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70
this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])
I have the following code
import pandas as pd
from pandas_datareader import data as web
import numpy as np
import math
data = web.DataReader('goog', 'yahoo')
df['lifetime'] = data['High'].asfreq('D').rolling(window=999999, min_periods=1).max() #To check if it is a lifetime high
How can i compare it so that i get a boolean (in 1 and 0 preferably) if df['High'] is close to its df['lifetime'] for each row in pandas:
data['isclose'] = math.isclose(data['High'], data['lifetime'], rel_tol = 0.003)
Any help would be appreciated.
You can use np.where()
import numpy as np
import math
data['isclose'] = np.where(math.isclose(data['High'], data['lifetime'], rel_tol = 0.003), 1, 0)
You could also use pandas' apply() function:
import math
from pandas_datareader import data as web
data = web.DataReader("goog", "yahoo")
data["lifetime"] = data["High"].asfreq("D").rolling(window=999999, min_periods=1).max()
data["isclose"] = data.apply(
lambda row: 1 if math.isclose(row["High"], row["lifetime"], rel_tol=0.003) else 0,
axis=1,
)
print(data)
However, yudhiesh's solution using np.where() is faster.
See also: Why is np.where faster than pd.apply
I have a dataframe like this in a .csv:
Consequence,N_samples
A,227
B,413
C,194
D,1
E,1610
F,10
G,7
H,1
I,1
J,5
K,1
L,5
M,5
N,30
O,7
P,3
And I want to make a plot pie out of it, but grouping all values lower than 150 into "Other" category. I've tried running this code but it's not working.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plotother = {'Consequence' : 'Other', 'N_samples':0}
df=pd.read_csv('df.csv', sep=',')
df = df.append(other,ignore_index=True)
for i in df:
if (x in df['N_samples']) < 150:
df['N_samples'].iloc[-1]=df['N_samples'].iloc[-1] + (x in df['N_samples'])
df.drop([x])
df.plot.pie(label="", title="Consequence", startangle=90);
plot.savefig('Consequence.svg')
Once I run it I get the following error:
KeyError: "['Consequence'] not found in axis"
I would really appreciate any help.
You are making it more difficult than it is.
First get all the rows, where sample size is below 150:
small_sizes = df[df['N_Samples']<150]
The sum up their values:
other_samples = small_sizes['N_Samples'].sum()
Finally drop the rows and add the other row:
df = df[~df['N_Samples']<150]
df.loc['other','N_samples'] = other_samples
That should do the trick.
you can do this as follows:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('df.csv')
collect the rows <150 into a new df:
df_other=pd.DataFrame([{'Consequence':'Other','N_samples':df[df.N_samples<150].N_samples.sum()}])
add that to the rows >= 150 and plot
df2=df[df.N_samples>=150]
df3=pd.concat([df2,df_other],axis=0)
df3.plot.pie(y='N_samples',labels=df3['Consequence'])
plt.show()
if you find yourself iterating thru a dataframe, be aware there's often a built-in way to do whatever you're trying to do.
Define your filtering condition:
cond = df.N_samples < 150
Sum values from filtering condition:
other_sum = df.N_samples[cond].sum()
Filter by opposite to condition and add 'other' row at the bottom in the same line:
df = df.loc[~cond].append({'Consequence': 'other', 'N_samples': other_sum}, ignore_index=True)
I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()
I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:
data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])
I'd like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values
I can specify the index as follows:
df = pd.DataFrame(data,index=data[:,0]),
however I am unsure how to best assign column headers.
You need to specify data, index and columns to DataFrame constructor, as in:
>>> pd.DataFrame(data=data[1:,1:], # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names
edit: as in the #joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.
Here is an easy to understand solution
import numpy as np
import pandas as pd
# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
[6. , 2.2]])
# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
Column1 Column2
0 5.8 2.8
1 6.0 2.2
I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying "option 2" from this great answer, you could do it like this:
import pandas
import numpy
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
df = pandas.DataFrame(values, index=index)
This can be done simply by using from_records of pandas DataFrame
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)
>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)
>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],
... index=[i for i in range(data.shape[0])],
... columns=['f'+str(i) for i in range(data.shape[1])])
>>df.head()
[![array to dataframe][1]][1]
Here simple example to create pandas dataframe by using numpy array.
import numpy as np
import pandas as pd
# create an array
var1 = np.arange(start=1, stop=21, step=1).reshape(-1)
var2 = np.random.rand(20,1).reshape(-1)
print(var1.shape)
print(var2.shape)
dataset = pd.DataFrame()
dataset['col1'] = var1
dataset['col2'] = var2
dataset.head()
Adding to #behzad.nouri 's answer - we can create a helper routine to handle this common scenario:
def csvDf(dat,**kwargs):
from numpy import array
data = array(dat)
if data is None or len(data)==0 or len(data[0])==0:
return None
else:
return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
Let's try it out:
data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)
In [61]: csvDf(data)
Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc
I think this is a simple and intuitive method:
data = np.array([[0, 0], [0, 1] , [1, 0] , [1, 1]])
reward = np.array([1,0,1,0])
dataset = pd.DataFrame()
dataset['StateAttributes'] = data.tolist()
dataset['reward'] = reward.tolist()
dataset
returns:
But there are performance implications detailed here:
How to set the value of a pandas column as list
It's not so short, but maybe can help you.
Creating Array
import numpy as np
import pandas as pd
data = np.array([['col1', 'col2'], [4.8, 2.8], [7.0, 1.2]])
>>> data
array([['col1', 'col2'],
['4.8', '2.8'],
['7.0', '1.2']], dtype='<U4')
Creating data frame
df = pd.DataFrame(i for i in data).transpose()
df.drop(0, axis=1, inplace=True)
df.columns = data[0]
df
>>> df
col1 col2
0 4.8 7.0
1 2.8 1.2