I am writing a python code, it should read the values of columns but I am getting the KeyError: 'column_name' error. Can anyone please tell me how to fix this issue.
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
### For the purposes of this example, we store feature data from our
### dataframe `df`, in the `f1` and `f2` arrays. We combine this into
### a feature matrix `X` before entering it into the algorithm.
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
print (df)
#df = pd.read_csv(csv_file)
"""
saved_column = df.Distance_Feature
saved_column = df.Speeding_Feature
print(saved_column)
"""
f1 = df['Distance_Feature'].tolist()
f2 = df['Speeding_Feature'].tolist()
print(f1)
print(f2)
X=np.matrix(zip(f1,f2))
print(X)
kmeans = KMeans(n_clusters=2).fit(X)
Can anyone please help me.
Asumming 'C:\Users\Desktop\data.csv' contains the following data
Distance_Feature Speeding_Feature
1 2
3 4
5 6
...
Change
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
to
df = pd.read_csv("data.txt",names=["Distance_Feature","Speeding_Feature"],sep= "\s+|\t+|\s+\t+|\t+\s+",header=1)
# Here it is assumed white space separator, if another separator is used change `sep`.
Related
I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks
In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70
this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])
I have a dataframe df with column name 'col' as the second column and the data looks like:
Dataframe
Want to separate text part in one column with name "Casing Size" and numerical part with "DepthTo" in other column.
Desired Output
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['Test-Checking.xlsx']))
#Method 1
df2 = pd.DataFrame(data=df, columns=['col'])
df2 = df2.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True)
df2.columns = ['CasingSize', 'DepthTo']
df2
#Method 2
def split_col(x):
try:
numb = float(x.split()[0])
txt = x.split()[1]
except:
numb = float(x.split()[1])
txt = x.split()[0]
x['col1'] = txt
x['col2'] = numb
df2['col1'] = df.col.apply(split_col)
df2
Tried two methods but none of them work correctly. Is there anyone help me?
Code in Google Colab
Excel File Attached
Try this
first you need to return the the values from your functions. then you can unpack them into your columns using the to_list()
def sample(x):
b,y=x.split()
return b,y
temp_df=df2['col'].apply(sample)
df2[['col1','col2']]=pd.DataFrame(temp_df.tolist())
You could try splitting the values into a list, then sorting them so that the numerical part comes first. Then you could apply pd.Series and assign back to the two columns.
import pandas as pd
df = pd.DataFrame({'col':["PWT 69.2", '283.5 HWT', '62.9 PWT', '284 HWT']})
df[['Casing Size','DepthTO']] = df['col'].str.split().apply(lambda x: sorted(x)).apply(pd.Series)
print(df)
Output
col Casing Size DepthTO
0 PWT 69.2 69.2 PWT
1 283.5 HWT 283.5 HWT
2 62.9 PWT 62.9 PWT
3 284 HWT 284 HWT
I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1
I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()
I am trying to create an XY chart using Python and the Pygal library. The source data is contained in a CSV file with three columns; ID, Portfolio and Value. Unfortunately I can only plot one axis and I suspect it's an issue with the array. Can anyone point me in the right direction? Do I need to use numpy? Thank you!
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value'] << I suspect this is wrong
)
xy_chart.render_in_browser()
With
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio']
)
xy_chart.render_in_browser()
I get:
A graph with a series of horizontal data points/values; i.e. it has the X values but no Y values.
With:
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value']
)
xy_chart.render_in_browser()
I get:
KeyError: ('Portfolio', 'Value')
Sample data:
ID Portfolio Value
1 1 -2560.042036
2 2 1208.106958
3 3 5702.386949
4 4 -8827.63913
5 5 -3881.665733
6 6 5951.602484
Maybe a little late here, but I just did something similar. Your second example requires multiple columns to be handed in as a array and then the DataFrame you get back needs to be converted into a list of tuples.
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
points = data[['Portfolio','Value']].to_records(index=False).tolist()
xy_chart = pygal.XY()
xy_chart.add('Portfolio', points)
xy_chart.render_in_browser()
There may be a more elegant use of the pandas or pygal API to get the columns into a list of tuples.