I have created a new dataframe:
import pandas as pd
###creating a dataframe:
bb=pd.DataFrame(columns = ['INDate', 'INCOME', 'EXDate','EXPENSE'])
bb.to_excel('/py/deleteafter/bb_black_book.xlsx')
bb.head()
I can see new dataframe without rows:
Then I need to add a new value to the one of columns by cycle.
income_value=message.text ###It is depend from the user input
for i in range(len(bb)):
print(bb['INCOME'][i])
if bb['INCOME'][i] != 'NaN':
i += 1
#print('NOT_EMPTY_CELL')
else:
#print('ive found an empty cell=)')
bb['INCOME'][i]=income_value
break
And here I met an errors, cause my df have a 0 length:
print(range(len(bb)))
range(0, 0)
I don't sure that my solution is right, and I'm sure there is more simply solution might be. In overall, my main idea is:
How I can check a next empty cell in certain column (in my case column 'INCOME') to add the value to this FREE cell?
Or more simply - I need to add a value to the next not filled cell=)
Will be glad for your replies.
To find the last valid value you can use last_valid_index(). It outputs a nan when the dataframe is empty, so you could do:
idx = bb["INCOME"].last_valid_index())
import numpy as np
if np.isnan(idx) or idx is None:
bb.loc[0, "INCOME"] = income_value
else:
bb.loc[idx + 1, "INCOME"] = income_value
There is a much simpler way to append a row, e.g.:
row = {'INCOME': 20000, 'EXdate': '14/09/2020'}
df= df.append(pd.DataFrame(row))
If you want to add data in one column only where the index number doesn't exist, use loc
Change line
bb['INCOME'][i]=income_value
to
bb.loc[i,'INCOME']=income_value
and it should work fine.
Related
I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])
The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.
I combined this answer, to find this answer to do the following:
amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()
print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)
df_cities.columns = df_cities.iloc[col_names_index]
The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.
The thing is that I do not really understand why. Can anyone help?
IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))
I'm new to Python.
I am trying to add prefix (Serial number) to an element within a data frame using for loop, to do with data cleaning/preparation before analysis.
The code is
a=pd.read_excel('C:/Users/HP/Desktop/WFH/PowerBI/CMM data.xlsx','CMM_unclean')
a['Serial Number'] = a['Serial Number'].apply(str)
print(a.iloc[72,1])
for index,row in a.iterrows():
if len(row['Serial Number']) == 6:
row['Serial Number'] = 'SR0' + row['Serial Number']
print(row['Serial Number'])
print(a.iloc[72,1])
The output is
C:\Users\HP\anaconda3\envs\test\python.exe C:/Users/HP/PycharmProjects/test/first.py
101306
SR0101306
101306
I don't understand why this is happening inside the for loop, value is changing, however outside it is the same.
This will never change the actual dataframe named a.
TL;DR: The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don't change your dataframe. However, you can use the index to access and edit the relevant row of the dataframe.
EXPLANATION
Why?
The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don't change your dataframe. However, you can use the index to access and edit the relevant row of the dataframe.
The solution is this:
import pandas as pd
a = pd.read_excel("Book1.xlsx")
a['Serial Number'] = a['Serial Number'].apply(str)
a.head()
# ID Serial Number
# 0 1 SR0101306
# 1 2 1101306
print(a.iloc[0,1])
#101306
for index,row in a.iterrows():
row = row.copy()
if len(row['Serial Number']) == 6:
# use the index and .loc method to alter the dataframe
a.loc[index, 'Serial Number'] = 'SR0' + row['Serial Number']
print(a.iloc[0,1])
#SR0101306
In the documentation, I read (emphasis from there)
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Maybe this means in your case that a copy is made and no reference used. So the change applies temporarily to the copy but not to the data in the data frame.
Since you're already using apply, you could do this straight inside the function you call apply with:
def fix_serial(n):
n_s = str(n)
if len(n_s) == 6:
n_s = 'SR' + n_s
return n_s
a['Serial Number'] = a['Serial Number'].apply(fix_serial)
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)