Hello,
I have problem with my code Python 3. I want to copy tupple in a cell dataframe. Python return warning message ...SettingWithCopyWarning...
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df=pd.DataFrame(data,columns=['Début'],index=['P1','P2','P3','P4'])
d=data['Début'][0]
d=d.split("/")
d.reverse()
d= tuple(list(map(int,d)))
df.Début[i]=d
I read pandas doc. and I try this... but python return error...(Must have equal len keys and value when setting with an iterable).
df.loc[0,'Début']=d
other way ...no work,it's same error.
df.at[0,'Début']=d
As pointed out, there is an issue that your dataframe is already using a copy of the data dictionary as it's data, and so there are issues with copied data. One way you can avoid this by processing your data in the way you want it before you put it in the dataframe. For instance:
import pandas as pd
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df = pd.DataFrame(data, columns = ['Début'], index = ['P1','P2','P3','P4'])
# Split your data, make a tuple out of it, and reverse it in a list iteration
date_tuples = [tuple(map(int, i.split("/")))[::-1] for i in data['Debut']]
df['Début'] = date_tuples
Related
I'm seeing a behavior with Pandas DataFrames where attempting to assign a value with a data type that is incompatible with the existing dtype of the Series may or may work, depending on the existing dtype of the Series. For instance, if the Series is of dtype int64 and I assign a list then it will fail with the error:
ValueError: setting an array element with a sequence.
... on the other hand if the initial dtype was bool then the assignment of a single element list will work, but only the contents of the list will be stored in the DataFrame, not the list + contents as expected/intended.
The code below shows some of this unpredictable behavior ... in this instance I try to store a DataFrame inside a DataFrame and can get it achieve a mangled result that stores only part of the nested DataFrame. If I first wrap the DataFrame in a list, Pandas will unwrap it and store it fine ... but again, not as a list of DataFrames as expected.
I can work around the issue by changing the dtype of the Series to a compatible dtype prior to the assignment but maybe someone can explain what's going on here or point to the relevant documentation. Thanks.
test_df_1 = pd.DataFrame( {'A':[1]} )
print('*** New DataFrame #1 ****************************')
print(test_df_1)
test_df_1.iloc[0,0] = 2
print('\n*** Works as Expected: [0,0]=2 ******************')
print(test_df_1)
test_df_2 = pd.DataFrame( { 'X':'Z',
'Y':[5] })
print('\n*** New DataFrame #2 ****************************')
print(test_df_2)
test_df_1.iloc[0,0] = [1.2] # this works fine
#test_df_1.iloc[0,0] = [12] # this will break
#test_df_1.iloc[0,0] = test_df_2 # this will execute but can't explain behavior
test_df_1.iloc[0,0] = [test_df_2] # Why do I have to wrap in a list?
print('\n*** Nested DataFrame ****************************')
print(test_df_1)
print('\n*** Retrieve the DataFrame **********************')
out = test_df_1.iloc[0,0]
print(type(out)) # why not a list?
print(out)
I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))
So I have a pandas DataFrame that has several columns that contain values I'd like to use to create new columns using a function I've defined. I'd been planning on doing this using Python's List Comprehension as detailed in this answer. Here's what I'd been trying:
df['NewCol1'], df['NewCol2'] = [myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])]
This runs correctly until it comes time to assign the values to the new columns, at which point it fails, I believe because it hasn't been iteratively assigning the values and instead tries to assign a constant value to each column. I feel like I'm close to doing this correctly, but I can't quite figure out the assignment.
EDIT:
The data are all strings, and the function performs a fetching of some different information from another source based on those strings like so:
def myFunction(x, y):
# read file based on value of x
# search file for values a and b based on value of y
return(a, b)
I know this is a little vague, but the helper function is fairly complicated to explain.
The error received is:
ValueError: too many values to unpack (expected 4)
You can use zip()
df['NewCol1'], df['NewCol2'] = zip(*[myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])])
Please don't flag my answer instantaniously, because I searched several other questions that didn't solve my problem, like this.
I'm trying to generate a python set of strings from a csv file. The printed pandas dataframe of the loaded csv file has the following structure:
0
0 me
1 yes
2 it
For a project I need this to be formatted to look like this
STOPWORDS = {'me', 'yes', 'it'}
I tried to do this by the following code.
import pandas as pd
df_stopwords = pd.read_csv("C:/Users/Jakob/stopwords.csv", encoding = 'iso8859-15', header=-1)
STOPWORDS = {}
for index, row in df_stopwords.iterrows():
STOPWORDS.update(str(row))
print(STOPWORDS)
However, I get this error:
dictionary update sequence element #0 has length 1; 2 is required
When I use the STOPWORDS.update(str(row)) I get the this error:
'dict' object has no attribute 'add'
Thank you all in advance!
You can directly create a set from the values in the dataframe with:
set(df.values.ravel())
{'me', 'yes', 'it'}
A dictionary is a mapping of keys and values. Like an object in many other languages. Since you need it as a set, define it as a set. Don't change it to a set later.
import pandas as pd
df_stopwords = pd.read_csv("C:/Users/Jakob/stopwords.csv", encoding = 'iso8859-15', header=-1)
STOPWORDS = set()
for index, row in df_stopwords.iterrows():
STOPWORDS.add(str(row))
print(STOPWORDS)
It looks like you need to convert the values in your column as a list and then use the list as your stop words.
stopwords = df_stopwords['0'].tolist()
--> ['me', 'yes', 'it']
As mentioned in the accepted answer here. You might wanna use itertuples() since it is faster.
STOPWORDS = set()
for index, row in df_stopwords.itertuples():
STOPWORDS.add(row)
print(STOPWORDS)
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!