I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()
Related
import numpy as np
import pandas as pd
df = pd.read_csv('test_python.csv')
print(df.groupby('fifth').sum())
this is my data
**And I am summing the first three columns for every word is in fifth.
The result is this and it is correct
The next thing I want to do is take those results and sum the together
example:
**buy = 6
cheese = 8
file = 12
.
.
.
word = 13**
How can I do this? how can I use the results.**
-And also now, want to use the column second as a new column with the name second2 with the results as data, how can I do it?
For Summing you can use apply-lambda ;
df = pd.DataFrame({"first":[1]*14,
"second":np.arange(1,15),
"third":[0]*14,
"forth":["one","two","three","four"]*3+["one","two"],
"fifth":["hello","no","hello","hi","buy","hello","cheese","water","hi","juice","file","word","hi","red"]})
df1 = df.groupby(['fifth'])['first','second','third'].agg('sum').reset_index()
df1["sum_3_Col"] = df1.apply(lambda x: x["first"] + x["second"] + x["third"],axis=1)
df1.rename(columns={'second':'second2'}, inplace=True)
Output of df1;
I have a dataframe like this in a .csv:
Consequence,N_samples
A,227
B,413
C,194
D,1
E,1610
F,10
G,7
H,1
I,1
J,5
K,1
L,5
M,5
N,30
O,7
P,3
And I want to make a plot pie out of it, but grouping all values lower than 150 into "Other" category. I've tried running this code but it's not working.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plotother = {'Consequence' : 'Other', 'N_samples':0}
df=pd.read_csv('df.csv', sep=',')
df = df.append(other,ignore_index=True)
for i in df:
if (x in df['N_samples']) < 150:
df['N_samples'].iloc[-1]=df['N_samples'].iloc[-1] + (x in df['N_samples'])
df.drop([x])
df.plot.pie(label="", title="Consequence", startangle=90);
plot.savefig('Consequence.svg')
Once I run it I get the following error:
KeyError: "['Consequence'] not found in axis"
I would really appreciate any help.
You are making it more difficult than it is.
First get all the rows, where sample size is below 150:
small_sizes = df[df['N_Samples']<150]
The sum up their values:
other_samples = small_sizes['N_Samples'].sum()
Finally drop the rows and add the other row:
df = df[~df['N_Samples']<150]
df.loc['other','N_samples'] = other_samples
That should do the trick.
you can do this as follows:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('df.csv')
collect the rows <150 into a new df:
df_other=pd.DataFrame([{'Consequence':'Other','N_samples':df[df.N_samples<150].N_samples.sum()}])
add that to the rows >= 150 and plot
df2=df[df.N_samples>=150]
df3=pd.concat([df2,df_other],axis=0)
df3.plot.pie(y='N_samples',labels=df3['Consequence'])
plt.show()
if you find yourself iterating thru a dataframe, be aware there's often a built-in way to do whatever you're trying to do.
Define your filtering condition:
cond = df.N_samples < 150
Sum values from filtering condition:
other_sum = df.N_samples[cond].sum()
Filter by opposite to condition and add 'other' row at the bottom in the same line:
df = df.loc[~cond].append({'Consequence': 'other', 'N_samples': other_sum}, ignore_index=True)
my dataframe df looks like this
Row_ID Codes
=============
1 A123,B456,C678
2 X359,C678,F23
3 J3,D24,J36,K994
I want to put all Codes in a list
something like this
['A123', 'B456', 'C678'],['X359', 'C678', 'F23'], ['J3', 'D24', 'J36', 'K994']
I did this
# an empty list
CodeList = []
for i in df['Codes']:
CodeList.append(list(i))
but what I get is this
['A','1','2','3','B'....
How can I do it the right way as mentioned above?
import pandas as pd
data = {"Codes": ["A123, B456, C678", "X359, C678, F23", "J3, D24, J36, K994"]}
df = pd.DataFrame(data)
result = [a.split(", ") for a in df["Codes"]]
print(result)
output
[['A123', 'B456', 'C678'], ['X359', 'C678', 'F23'], ['J3', 'D24', 'J36', 'K994']]
Try spliting using the following:
CodeList.append(i.split(','))
It seems like many of the other answers here might just be plain wrong. (Edit: Currently, they all are)
This code does work:
import pandas as pd
data = {'Codes': ['A123,B456,C678', 'X359,C678,F23', 'J3,D24,J36,K994']}
df = pd.DataFrame(data)
codes_list = df['Codes'].str.split(',').tolist()
codes_list looks like:
[['A123', 'B456', 'C678'], ['X359', 'C678', 'F23'], ['J3', 'D24', 'J36', 'K994']]
Note that this solution is idiomatic Pandas, whereas explicit loops should be avoided whenever possible.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=list('AB'))
print(df.head())
print(df.values.tolist())
output:
[[-0.2645782053241853, 0.5022937587041725], [1.624868960959602, 0.5086915380333786], [1.3593608874498997, 0.7077939622903995]]
Just remove the list from line CodeList.append(list(i))
CodeList = []
for i in df['Codes']:
CodeList.append(i.split(','))
I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1
I am writing a python code, it should read the values of columns but I am getting the KeyError: 'column_name' error. Can anyone please tell me how to fix this issue.
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
### For the purposes of this example, we store feature data from our
### dataframe `df`, in the `f1` and `f2` arrays. We combine this into
### a feature matrix `X` before entering it into the algorithm.
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
print (df)
#df = pd.read_csv(csv_file)
"""
saved_column = df.Distance_Feature
saved_column = df.Speeding_Feature
print(saved_column)
"""
f1 = df['Distance_Feature'].tolist()
f2 = df['Speeding_Feature'].tolist()
print(f1)
print(f2)
X=np.matrix(zip(f1,f2))
print(X)
kmeans = KMeans(n_clusters=2).fit(X)
Can anyone please help me.
Asumming 'C:\Users\Desktop\data.csv' contains the following data
Distance_Feature Speeding_Feature
1 2
3 4
5 6
...
Change
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
to
df = pd.read_csv("data.txt",names=["Distance_Feature","Speeding_Feature"],sep= "\s+|\t+|\s+\t+|\t+\s+",header=1)
# Here it is assumed white space separator, if another separator is used change `sep`.