Beginner here.
With the help of someone here I was able to extract the second and third Tables on this page (Team Statistics and Team Analytics 5-on-5) that included this last part:
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
for table in tables[1:3]:
print(table)
They are standard dataframes but i just can't figure out how drop some columns out of it. I've tried to do this by using df.drop :
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(each, header=1)[0])
tables = tables[tables['Rk'].ne('Rk')]
tables = tables.rename(columns={'Unnamed: 1':'Team'})
except:
for table in tables[1:3]:
df = pd.read_table = [1]
df = df.drop({"AvAge", "GP", "W", "L", "OL", "PTS", "GF", "GA", "SOW", "SOL", "SOS", "PP", "PPO", "PP%", "PPA", "PPOA", "PK%", "SH", "SHA", "PIM/G", "oPIM/G", "S", "SA", "SO"}, 1)
print(df)
df = pd.read_table = [2]
df = df = df.drop({"S%", "SV%", "CF", "CA", "FF", "FA", "xGF", "xGA", "aGF", "aGA", "SCF", "SCA", "HDF", "HDA", "HDGF", "HDGA"}, 1)
print(df)
but I got this answer:
AttributeError: 'list' object has no attribute 'drop'
It feels like there's a problem with using "df" and "table" but I'm not sure at all. And this is where I'm stuck for the moment.
Thanx in advance!
No, the problem is with the compound assignment statement.
df = pd.read_table = [1]
print(df)
print(pd.read_table)
Output:
[1]
[1]
This code assigns [1] to both df and pd.read_table. Then the code calls df.drop() but df is a list and list does not have a drop() method. More troubling is that the code assigns a list to the pd.read_table callable. I'm uncertain what you're trying to do here, but this is certainly the source of your error.
Related
I'm running a for loop using pandas that checks if another DataFrame with same name has been created. If it has been created, then just append the values to the correspondent columns. If it has not been created, then create the df and append the values to the named columns.
dflistglobal = []
####
For loop that generate a, b, and c variables every time it runs.
####
###
The following code runs inside the for loop, so that everytime it runs, it should generate a, b, and c, then check if a df has been created with a specific name, if yes, it should append the values to that "listname". If not, it should create a new list with "listname". List name changes everytime I run the code, and it varies but can be repeated during this for loop.
###
if listname not in dflistglobal:
dflistglobal.append(listname)
listname = pd.DataFrame(columns=['a', 'b', 'c'])
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
else:
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
I am getting the following error:
File "test.py", line 150, in <module>
functiontest(image, results, list)
File "test.py", line 68, in funtiontest
listname = listname.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
AttributeError: 'str' object has no attribute 'append'
The initial if statement runs fine, but the else statement causes problems.
Solved this issue by not using pandas dataframes. I looped thru the for loop generating a unique identifier for each listname, then appended a,b,c,listname on a list. At the end you will end up with a large df that can be filtered using the groupby function.
Not sure if this will be helpful for anyone, but avoid creating pandas dfs and using list is the best approach.
That error tells you that listname is a string (and you cannot append a DataFrame to a string).
You may want to check if somewhere in your code you are adding a string to your list dflistglobal.
EDIT: Possible solution
I'm not sure how you are naming your DataFrames, and I don't see how you can access them afterwards.
Instead of using a list, you can store your DataFrames inside a dictionary dict = {"name": df}. This will let you easily access the DataFrames by name.
import pandas as pd
import random
df_dict = {}
# For loop
for _ in range(10):
# Logic to get your variables (example)
a = random.randint(1, 10)
b = random.randint(1, 10)
c = random.randint(1, 10)
# # Logic to get your DataFrame name (example)
df_name = f"dataframe{random.randint(1,10)}"
if df_name not in df_dict.keys():
# DataFrame with same name does not exist
new_df = pd.DataFrame(columns=['a', 'b', 'c'])
new_df = new_df.append({'a':a, 'b':b, 'c':c}, ignore_index=True)
df_dict[df_name] = new_df
else:
# DataFrame with same name exists
updated_df = df_dict[df_name].append({'a':a, 'b':b, 'c':c}, ignore_index=True)
df_dict[df_name] = updated_df
Also, for more info, you may want to visit this question
I hope it was clear and it helps.
i have a python script that read dataframe using pandas and display its content using streamlit.
What i want is to replace current value with a new value based on the user input.
Where user select the required column and than enter the current value in a text field than the new value in the second text field when button replace is pressed the old value is replaced by the new value and the new dataframe is displayed.
the problem is that when it display the dataframe nothing is changed
code:
import pandas as pd
import streamlit as st
df =pd.DataFrame({
"source_number": [
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
columns = st.selectbox("Select column", df.columns)
old_values = st.multiselect("Current Values",list(df[columns].unique()),list(df[columns].unique()))
col1,col2 = st.beta_columns(2)
with col1:
old_val = st.text_input("old value")
with col2:
new_val = st.text_input("new value")
if st.button("Replace"):
df[columns]=df[columns].replace({old_val:new_val})
st.dataframe(df)
There is a little error in your code.
df[columns]=df[columns].replace({old_val:new_val})
When you look at a pandas docs in examples there is a code like that
s.replace({'a': None}) - it replaces value 'a' with None value
When looking at your code what it means, that you are trying to replace value that is a list with another list, but it does not work like that, because your column doesn't have a list as an element so can't change it like that.
When I ran your code in Jupyter I got an error that list is unhashable
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-41a02888936d> in <module>
30 oldVal = [11199,11328,11287,32345]
31
---> 32 df["source_number"] = df["source_number"].replace({oldVal:newVal})
33 df
TypeError: unhashable type: 'list'
And this a reason why it doesn't change anything for you.
If you want to change all values using lists you will have to write it like that:
df[column] = df[column].replace(old_values, new_values)
This code works just fine.
I hope I was clear enough and it will work for you.
Your code works for text columns (location and category). It doesn't work for the numeric source_number column as you're trying to replace one string by another.
For numeric columns you'll need to use number_input instead of text_input:
import pandas as pd
from pandas.api.types import is_numeric_dtype
import streamlit as st
df = pd.DataFrame({
"source_number":
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
columns = st.selectbox("Select column", df.columns)
old_values = st.multiselect("Current Values",list(df[columns].unique()),list(df[columns].unique()))
col1,col2 = st.beta_columns(2)
st_input = st.number_input if is_numeric_dtype(df[columns]) else st.text_input
with col1:
old_val = st_input("old value")
with col2:
new_val = st_input("new value")
if st.button("Replace"):
df[columns]=df[columns].replace({old_val:new_val})
st.dataframe(df)
Update as per comment: you could cache the df to prevent re-initalization upon each widget interaction (you'll have to manually clear the cache to start over):
#st.cache(allow_output_mutation=True)
def get_df():
return pd.DataFrame({
"source_number":
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
df = get_df()
I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask
there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)
You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')
This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.
I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated
I think in line 4 you reduce your dataframe to just one column "Amount"
To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].
It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'
I am trying to extract the data from a table from a webpage, but keep receiving the above error. I have looked at the examples on this site, as well as others, but none deal directly with my problem. Please see code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table', class_='dataframe')
rows = table.find_all('tr')[2:]
data = {
'RK' : [],
'PLAYER' : [],
'TEAM' : [],
'GP' : [],
'G' : [],
'A' : [],
'PTS' : [],
'+/-' : [],
'PIM' : [],
'PTS/G' : [],
'SOG' : [],
'PCT' : [],
'GWG' : [],
'G1' : [],
'A1' : [],
'G2' : [],
'A2' : []
}
for row in rows:
cols = row.find_all('td')
data['RK'].append( cols[0].get_text() )
data['PLAYER'].append( cols[1].get_text() )
data['TEAM'].append( cols[2].get_text() )
data['GP'].append( cols[3].get_text() )
data['G'].append( cols[4].get_text() )
data['A'].append( cols[5].get_text() )
data['PTS'].append( cols[6].get_text() )
data['+/-'].append( cols[7].get_text() )
data['PIM'].append( cols[8].get_text() )
data['PTS/G'].append( cols[9].get_text() )
data['SOG'].append( cols[10].get_text() )
data['PCT'].append( cols[11].get_text() )
data['GWG'].append( cols[12].get_text() )
data['G1'].append( cols[13].get_text() )
data['A1'].append( cols[14].get_text() )
data['G2'].append( cols[15].get_text() )
data['A2'].append( cols[16].get_text() )
df = pd.DataFrame(data)
df.to_csv("NHL_Players_Stats.csv")
I have eradicated the error, by seeing that the error was referring to the table (i.e. the Resultset) not having the method find_all and got the code running by commenting out the following line:
#rows = table.find_all('tr')[2:]
and changing this:
for row in rows:
This, however, does not extracts any data from the webpage and simply creates a .csv file with column headers.
I have tried to extract some data directly into rows using soup.find_all, but get the following error;
data['GP'].append( cols[3].get_text() )
IndexError: list index out of range
which I have not been able to resolve.
Therefore, any help would be very much appreciated.
Also, out of curiosity, are there any ways to achieve the desired outcome using:
dataframe = pd.read_html('url')
because, I have tried this also, but keep keeping:
FeatureNotFound: Couldn't find a tree builder with the features you
requested: html5lib. Do you need to install a parser library?
Ideally this is the method that I would prefer, but can't find any examples online.
find_all returns a ResultSet, which is basically a list of elements. For this reason, it has no method find_all, as this is a method that belongs to an individual element.
If you only want one table, use find instead of find_all to look for it.
table = soup.find('table', class_='dataframe')
Then, getting its rows should work as you have already done:
rows = table.find_all('tr')[2:]
The second error you got is because, for some reason, one of the table's rows seems to have only 3 cells, thus your cols variable became a list with only indexes 0, 1 and 2. That's why cols[3] gives you an IndexError.
In terms of achieving the same outcome using:
dataframe = pd.read_html('url')
It is achieved using just that or similar:
dataframe = pd.read_html(url, header=1, index_col=None)
The reason why I was receiving errors previously is because I had not configured Spyder's iPython console's backend to 'automatic' in 'Preferences'.
I am still, however, trying to resolve this problem using BeautifulSoup. So any useful comments would be appreciated.