I created a sub dataframe (drama_df) based on a criteria in the original dataframe (df). However, I can't access a cell using the typical drama_df['summary'][0] . Instead I get a KeyError: 0. I'm confused since type(drama_df) is a DataFrame. What do I do? Note that df['summary'][0] does indeed return a string.
drama_df = df[df['drama'] > 0]
#Now we generate a lump of text from the summaries
drama_txt = ""
i = 0
while (i < len(drama_df)):
drama_txt = drama_txt + " " + drama_df['summary'][i]
i += 1
EDIT
Here is an example of df:
Here is an example of drama_df:
This will solve it for you:
drama_df['summary'].iloc[0]
When you created the subDataFrame you probably left the index 0 behind. So you need to use iloc to get the element by position and not by index name (0).
You can also use .iterrows() or .itertuples() to do this routine:
Itertuples is a lot faster, but it is a bit more work to handle if you have a lot of columns
for row in drama_df.iterrows():
drama_txt = drama_txt + " " + row['summary']
To go faster:
for index, summary in drama_df[['summary']].itertuples():
drama_txt = drama_txt + " " + summary
Wait a moment here. You are looking for the str.join() operation.
Simply do this:
drama_txt = ' '.join(drama_df['summary'])
Or:
drama_txt = drama_df['summary'].str.cat(sep=' ')
Related
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = str(data['Prop-House Number']) + data['Prop-Street Name'] + data['Prop-Mode'] + str(data['Prop-Apt Unit Number'])
df = pd.DataFrame(data, columns = ['Name','New_addy'])
So this is the code
As you can see Prop-House Number and Prop-Apt Number are both int, and the rest are strings, I am trying to combine all these so that the full address is under one column labeled 'New addy'
Follow the string assignment with each variable using map as mentioned below:
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = data['Prop-House Number'].map(str) + data['Prop-Street Name'].map(str) + data['Prop-Mode'].map(str) + data['Prop-Apt Unit Number'].map(str)
#select the desired columns for further work
data = data[['Name','New_addy']]
One way is using list comprehension:
data['New_addy'] = [str(n) + street + mode + str(apt_n) for n,street,mode,apt_n in zip(
data['Prop-House Number'],data['Prop-Street Name'],data['Prop-Mode'],data['Prop-Apt Unit Number'])]
I am currently new to python data frames and trying to iterate over rows. I want to be able to get values of next 2 rows and store it in the variable. Following is code snippet:
df = pd.read_csv('Input.csv')
for index, row in df.iterrows():
row_1 = row['Word']
row_2 = row['Word'] + 1 # I know this is incorrect and it won't work
row_3 = row['Word'] + 2 # I know this is incorrect and it won't work
print(row_1, row_2, row_3)
I was hoping that given ('Input.csv'):
Word <- #Column
Hi
Hello
Some
Phone
keys
motion
entries
I want the output as following:
Hi, Hello, Some
Hello, Some, Phone
Some, Phone, keys
Phone, keys, motion
keys, motion, entries
Any help is appreciated. Thank you.
you can simply use iloc property
df = pd.read_csv('Input.csv')
for index, row in df.iterrows():
row_1 = df.iloc[index]['word']
row_2 = df.iloc[index + 1]['word']
row_3 = df.iloc[index + 2]['word']
print(row_1, row_2, row_3)
Many ways to do it but something simple like this which makes use of Pandas vectorized operations will work,
(df['Word'] + ', ' + df['Word'].shift(-1)+ ', ' + df['Word'].shift(-2)).dropna()
I'm trying to remove doubles from a dataframe.
Basically, the dataframe contains two (or more) occurence of a document.
The doubles can be found by comparing the description of the document.
In my logic, I had to find who the duplicates are, copy the data and drop them from both the dataframe and the iterated dataframe.
But it appears there are still doubles, I do think it is because of the drop but don't know how to fix it.
So what is in green is the description, I need to drop one of the two, and fuse all that there is in black.
For example:
URL1 + URL2|Explorimmo + Bien_ici|Apartment|Description
Unfortunately, I can't link the dataset.
file = pd.ExcelFile(mc.file_path)
df = pd.read_excel(file)
description_duplicate = df.loc[df.duplicated(['DESCRIPTION']) == True]
for idx1, clean in description_duplicate.iterrows():
for idx2, dirty in description_duplicate.iterrows():
if idx1 != idx2:
if clean['DESCRIPTION'] == dirty['DESCRIPTION']:
clean['CRAWL_SOURCE'] = clean['CRAWL_SOURCE'] + " / " +dirty['CRAWL_SOURCE']
clean['URL'] = clean['URL'] + " / " + dirty['URL']
description_duplicate = description_duplicate.drop(idx2)
df = df.drop(idx2)
df[idx1] = clean
You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function:
df.drop_duplicates(subset='DESCRIPTION', inplace=True)
I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)
Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)
I am trying to format an excel sheet using python using the function like this,
def highlight_myrow_cells(sheetnumber, Sheetname ,dataFrame):
Pre_Out_df_ncol = dataFrame.shape[1]
RequiredCol_let = colnum_num_string(Pre_Out_df_ncol)
#identifying the rows that needs to be highlighted
arr = (dataFrame.select_dtypes(include=[bool])).eq(False).any(axis=1).values
ReqRows = np.arange(1, len(dataFrame) + 1)[arr].tolist()
#The ReqRows are a list of values something like [1,2,3,5,6,8,10]
print("Highlighting the Sheet " + Sheetname + " in the output workbook")
# Program is too slow over here---
for i in range(len(ReqRows)):
j = ReqRows[i] + 1
xlwb1.sheets(sheetnumber).Range('A' + str(j) + ":" + RequiredCol_let + str(j)).Interior.ColorIndex = 6
xlwb1.sheets(sheetnumber).Columns.AutoFit()
for i in range(1, Emergency_df.shape[1]):
j = i - 1
RequiredCol_let = colnum_num_string(i)
Required_Column_Name = (Emergency_df.columns[j])
DateChecker1 = contains_word(Required_Column_Name, "Date", "of Death", "Day of Work")
ResultChecker = Required_Column_Name.startswith("Result")
if ResultChecker == False:
if (DateChecker1 == True):
xlwb1.sheets(sheetnumber).Range(Required_Column_Name + ":" + Required_Column_Name).NumberFormat = "m/d/yyyy"
The program is too slow while highlighting the rows based on logics,
From what I understand from excel is - the speed is quiet good if you highlight using a range of rows, rather than to use one row after another row.
I am not looking to do this with an external library like stylewriter etc.,
Since you can't use threading, I would just cut down the time needed to execute each loop. The methods I know of would look something like:
ReqRows += 1
for i in range(len(ReqRows)):
xlwb1.sheets(sheetnumber).Range('A{0}:{1}{0}'.format(i, RequiredCol_let)).Interior.ColorIndex = 6
xlwb1.sheets(sheetnumber).Columns.AutoFit()
This should speed up your loop (albeit probably not nearly as much as threading). Hope this helps solve your problem!