I create a code see below:
check = ['jonge man']
data.loc[
data['Zoekterm'].str.contains(
f"{'|'.join(check)}"
),"Zoekterm_new",
data['Zoekterm']
I get Too many indexers erorr
What did i wrong
Use DataFrame.loc with second argument for column name like:
data.loc[data['Zoekterm'].str.contains(f"{'|'.join(check)}"), "Zoekterm_new"]
If need assign values add = - so for Zoekterm_new are added data from Zoekterm if matching condition, else NaN:
data.loc[data['Zoekterm'].str.contains(f"{'|'.join(check)}"), "Zoekterm_new"] = data['Zoekterm']
working like:
data["Zoekterm_new"] = np.where(data['Zoekterm'].str.contains(f"{'|'.join(check)}"), data['Zoekterm'], np.NaN)
Related
I wrote the script below, and I'm 98% content with the output. However, the unorganized manner/ disorder of the 'Approved' field bugs me. As you can see, I tried to sort the values using .sort_values() but was unsuccessful. The output for the written script is below, as is the list of fields in the data frame.
df = df.replace({'Citizen': {1: 'Yes',
0: 'No'}})
grp_by_citizen = pd.DataFrame(df.groupby(['Citizen']).agg({i: 'value_counts' for i in ['Approved']}).fillna(0))
grp_by_citizen.rename(columns = {'Approved': 'Count'}, inplace = True)
grp_by_citizen.sort_values(by = 'Approved')
grp_by_citizen
Do let me know if you need further clarification or insight as to my objective.
You need to reassign the result of sort_values or use inplace=True. From documentation:
Returns: DataFrame or None
DataFrame with sorted values or None if inplace=True.
grp_by_citizen = grp_by_citizen.sort_values(by = 'Approved')
First go with:
f = A.columns.values.tolist()
To see what is the actual names of your columns are. Then you can try:
A.sort_values(by=f[:2])
And if you sort by column name keep in mind that 2L is a long int, so just go:
A.sort_values(by=[2L])
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I have a dataframe with some clubs and their nationality. Just like this one:
I created a function that I will use to create a new column based on the Nationality. I tested and it works fine if I want to find values that are equal. However, I needed to search for strings that contains a certain character. E.g.: If the string contains 'Br' than I want to create a new column which will receive a certain value. If contains another string, than it will receive another value.
This is what I've done so far (and it is working fine, but I needed something like a 'contains'):
# Function
def label_race (row):
if row['Nationality'] == 'Brazil':
return 'Brasil'
else:
return 'NA'
df.apply (lambda row: label_race(row), axis=1)
I would like to do something like this:
# Function
def label_race (row):
if row['Nationality'] contains'Br':
return 'Brasil'
if row['Nationality'] contains'Brl':
return 'Brasil2'
else:
return 'NA'
df.apply (lambda row: label_race(row), axis=1)
I found some tips, but most of them use things like is.find() or df[].str.contains. And I couldn't adapt to what I want.
if you wanted to create a new column with binary values (if condition met then A else B), you could do something like this
#create a column 'new' with value 'Brasil' if 'Nationality' value contains 'Bra', else put 'NA'
df['new'] = df['Nationality'].apply(lambda x: 'Brasil' if 'Bra' in x else 'NA')
otherwise, if you wanted to create a column and use multiple rules in the same column, you could do something like this...
#create a column 'new' and insert value 'ARG' whenever 'Nationality' contains 'Arg',
df.loc[df['Nationality'].str.contains('Arg'), 'new'] = 'ARG'
#and 'BRA' whenever Nationality contains 'Brazil', without overriding any other values
df.loc[df['Nationality'].str.contains('Brazil'), 'new'] = 'BRA'
IIUC, you can make do with str.extract and dot:
df = pd.DataFrame({'Nationality': ['Brazil', 'abBrl', 'abcd', 'BrX']})
new_df = df.Nationality.str.extract('(?P<Brazil2>Brl)|(?P<Brazil>Br)')
new_df.notnull().dot(new_df.columns)
Output:
0 Brazil
1 Brazil2
2
3 Brazil
dtype: object
I am a Java programmer and I am learning python for Data Science and Analysis purposes.
I wish to clean the data in a Dataframe, but I am confused with the pandas logic and syntax.
What I wish to achieve is the something like the following Java code:
for( String name : names ) {
if (name == "test") {
name = "myValue";}
}
How can do it with python and pandas dataframe.
I tried as following but it does not work
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset V02.csv')
array = df['Order Number'].unique()
#On average, one order how many items has?
for value in array:
count = 0
if df['Order Number'] == value:
......
I get error at df['Order Number']==value.
How can I identify the specific values and edit them?
In short, I want to:
-Check all the entries of 'Order Number' column
-Execute an action (example: replace the value, or count the value) each time the record is equal to a given value (example, the order code)
Just use the vectorised form for replacement:
df.loc[df['Order Number'] == 'test'
This will compare the entire column against a specific value, where this is True it will replace just those rows with the new value
For the second part if doesn't understand boolean arrays, it expects a scalar result. If you're just doing a unique value/frequency count then just do:
df['Order Number'].value_counts()
The code goes this way
import pandas as pd
df = pd.read_csv("Dataset V02.csv")
array = df['Order Number'].unique()
for value in array:
count = 0
if value in df['Order Number']:
.......
You need to use "in" to check the presence. Did I understand your problem correctly. If I did not, please comment, I will try to understand further.
My dataframe has a column contains various type values, I want to get the most counted one:
In this case, I want to get the label FM-15, so later on I can query data only labled by this.
How can I do that?
Now I can get away with:
most_count = df['type'].value_counts().max()
s = df['type'].value_counts()
s[s == most_count].index
This returns
Index([u'FM-15'], dtype='object')
But I feel this is to ugly, and I don't know how to use this Index() object to query df. I only know something like df = df[(df['type'] == 'FM-15')].
Use argmax:
lbl = df['type'].value_counts().argmax()
To query,
df.query("type==#lbl")