Remove excess info from dataframe - python

I have a big data frame that contains 6 columns. When I want to print the info out of one cell, I use the following code:
df = pd.read_excel(Path_files_data)
info_rol = df.loc[df.Rank == Ranknumber]
print(info_rol['Art_Nr'])
Here Rank is the column that gives the rank of every item and Ranknumber is the Rank of the item i try to look up. How what i get back looks like this:
0 10399
Name: Art_Nr, dtype: object
Here 0 is the rank and 10399 is Art_Nr. How do I get it to work that it only printsout the Art_Nr. and leaves al the crap like dtype: object.
PS. I tried strip but that didnt work for me.

I think you need select first value of Series by iat or iloc for scalar:
print(info_rol['Art_Nr'].iat[0])
print(info_rol['Art_Nr'].iloc[0])
If string or numeric output:
print(info_rol['Art_Nr'].values[0])
But after filtering is possible you get multiple values, then second, third.. values are lost.
So converting to list is more general solution:
print(info_rol['Art_Nr'].tolist())

Related

How to only extract Close prices from Dataframe Pandas

I can't find a method to loop over my data frame (df_yf) and extract all the "Close" prices and create a new df_adj. The df is group_by coin price.
Initially, I tried something like but throwing me error.
for i in range(len(df_yf.columns):
df_adj.append(df_yf[i]["Close"])
Also tried using .get and .filter but throws me errors
"list indices must be integers or slices, not str; perhaps you missed
a comma?"
EDIT!!
Thank you for the answers. It made me realize my mistake :D. I shouldn't group_by tickers so I changed it to group_by prices (Low, Close etc.) and then was able to simply extract the right columns by doing df_adj = df_yf["Close"] as was mentioned
df_adj = np.array(df_yf["Close"])
dataframe from tables will use dict to extract columns, and then use values to get ndarray form.
df_adj = df_yf["Close"].values
If you group by Tickers, you could use:
df_adj = pd.DataFrame()
for i in [ticker[0] for ticker in df_yf]:
df_adj[i] = df_yf[i]['Close']
Result:
Ticker1 Ticker2 Ticker3
0 1 1 1
1 3 3 3

Extract values from array type of column in pandas

I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

Pandas set values of multiple rows of a column

I have a dataframe(train) that has an Age column in it. This column has missing values. I have merged it with another dataframe- static_values which also has an Age column. I am using the below lines to substitute the missing values for the Age column in train df.
predicted_vals = pd.merge(static_vals, train, on=['Pclass','Sex'])
# num of missing values
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'].isna().sum() # 177
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] = predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
After running the above lines, I run the following to see if the values have been substituted-
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y']
And this is the putput I get -
Series([], Name: Age_x, dtype: float64)
Its empty. No assignment has happened. The strange part is that when I check the values for the Age_x column after running the above lines, I get a blank there too.
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
Series([], Name: Age_x, dtype: float64)
Below is what the column holds right before I run the lines where I am trying to assign the missing values
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x'].head()
3 34.240964
8 34.240964
15 34.240964
25 34.240964
34 34.240964
I searched here for similar questions here but all deal with assigning a single value to many rows. I can't figure what's wrong here. Any help?
Is there actually a problem here?
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] should be empty because you have filled the values! Try predicted_vals.loc[~predicted_vals['Age_y'].isna(),'Age_y']
This is an alternative solution, which avoids merging and handling column name suffixes. We align the 2 indices and use fillna to map from static_vals.
predicted_vals = predicted_vals.set_index(['Pclass','Sex'])
predicted_vals['Age'] = predicted_vals['Age'].fillna(static_vals.set_index(['Pclass','Sex'])['Age'])
predicted_vals = predicted_vals.reset_index()
If you would like to do an explicit merge, #jezrael's solution is the way to go.
I think you need combine_first:
predicted_vals['Age_y'] = predicted_vals['Age_y'].combine_first(predicted_vals['Age_x'])

What's wrong with fillna in Pandas running twice?

I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).

Categories