Related
this is my code where I need to access a certain tuple from the DataFrame df. Can you please help me with this matter as I can't find any answer regarding this issue.
import pandas as pd
import openpyxl
df_sheet_index = pd.read_excel("path/to/excel/file.xlsx")
df = df_sheet_index.itertuples()
for tuple in df:
print(tuple)
This is the output
Pandas(Index=0, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=1, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=2, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
...
EDIT: As a general rule, you should use pandas builtin functions to search inside and not iterate on it. It's more efficient and more readable.
But if you really want to access the tuples:
target_index= 10
for tu in df.itertuples():
if tu[0] == target_index:
print(tu)
in a more general view, this is a regular tuple so you can access each element by its position. index will be tuple[0] then you first column tuple[1], the second tuple[2] etc.
NOTE: do not use tuple as a variable name, this is a reserved name in Python for the tuple type and it may create issue (on top of not being a good practice)
if u are trying to get an element at a specific place. U can use .iloc()
It takes two pars the row & column
df.iloc[-1]["column"]
This will get the last row element at that column
For df.loc["row","column"]
df.loc[[]] returns a df while df.loc[] returns a series
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I'm trying to replace every occurrence of an empty list [] in my script output with an empty cell value, but am struggling with identifying what object it is.
So the data output after running .to_excel looks like:
Now the data originally exists in JSON format and I'm normalizing it with data_normalized = pd.json_normalize(data). I'm trying to filter out the empty lists occurrences right after that with filtered = data_normalized.loc[data_normalized['focuses'] == []] but that isn't working. I've also tried filtered = data_normalized.loc[data_normalized['focuses'] == '[]']
The dtype for column focuses is Object if that helps. So I'm stuck as to how to select this data.
Eventually, I want to just instead run data_normalized.replace('[]', '') but with the first parameter updated so that I can select the empty lists properly.
You could try to cast the df to string type with pd.DataFrame.astype(str), and then do the replace with regex parameter as False:
df.astype(str).replace('[]','',regex=False)
Example:
df=pd.DataFrame({'a':[[],1,2,3]})
df.astype(str).replace('[]','',regex=False)
a
0
1 1
2 2
3 3
I have really less experience with pandas but since you cannot identify the object,try converting the list obtained to a string,then compare it to '[]'
for example,try using this
filtered = data_normalized.loc[string(data_normalized['focuses']) == '[]']
For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index.
I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all
multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead
I'm looking for this
targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF
Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):
multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")
The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.
So using your original approach:
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue
(Notice I've also changed the first .loc where you were incorrectly using the boolean index)
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!