KeyError: "['China'] not in index" - python

I am trying to select China's nationality tourists, this is my code snippet-
import pandas as pd
China = ses[(ses["S1Country"] == 1)]
List_China = China[['Case','S1Country']]
List_China
This is what i put before the error
Here I am trying to selecting certain data - most sources that people used, the code snippet to perform it-
import pandas as pd
Ranking1 = ses[(ses["Q7Infor1"] == 1)]
List_Ranking1 = Ranking1[['China','Q7Infor1']]
List_Ranking1
Then I wrote this code and it reported back to me
'KeyError: "['China'] not in index'
How do I solve it?
Thanks for checking in!
sample of the data:

Assuming that you are trying to filter the column Q7Infor1 by the value China, then you can use df[df['col'] == value].
Thus, your original code:
List_Ranking1 = Ranking1[['China','Q7Infor1']]
Becomes this:
List_Ranking1 = Ranking1[Ranking1['Q7Infor1'] == 'China']
Check this answer on the different ways you can filter a column by a row value.
Updated with OP's dataset
'China' is not a valid value in column Q7Infor1. So assuming that China=1, then we can filter by value 1:
China = 1
List_Ranking1 = Ranking1[Ranking1['Q7Infor1'] == China]
To count number of rows:
print(len(List_Ranking1))

Related

numpy where with multiple conditions linked to dataframe

I'm using numpy where with multiple conditions to assign a category based on a text string a transaction description.
Part of the code is below
`import numpy as np
conditions = [
df2['description'].str.contains('AEGON', na=False),
df2['description'].str.contains('IB/PVV', na=False),
df2['description'].str.contains('Picnic', na=False),
df2['description'].str.contains('Jumbo', na=False),
]
values = [
'Hypotheek',
'Hypotheek',
'Boodschappen',
'Boodschappen']
df2['Classificatie'] = np.select(conditions, values, default='unknown')
I have many conditions which - only partly shown here.
I want to create a table / dataframe in stead of including every seperate condition and value in the code. So for instance the following dataframe:
import pandas as pd
Conditions = {'Condition': ['AEGON','IB/PVV','Picnic','Jumbo'],
'Value': ['Hypotheek','Hypotheek','Boodschappen','Boodschappen']
}
df_conditions = pd.DataFrame(Conditions, columns= ['Condition','Value'])
How can I adjust the condition to look for (in the str.contains) a text string as listed in df_condictions['condition'] and to apply the Value column to df2['Classificatie']?
The values are already a list in the variable explorer, but I can't find a way to have the str.contains to look for a value in a list / dataframe.
Desired outcome:
In [3]: iwantthis
Out[3]:
Description Classificatie
0 groceries Jumbo on date boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen
The first column is the input data frame, te second column is what I'm looking for.
Please note that my current code already allows me to create this column, but I want to use another more automated way using de df_condtions table.
I'm not yet really familiair with Python and I can't find anything online.
Try:
import re
df_conditions["Condition"] = df_conditions["Condition"].str.lower()
df_conditions = df_conditions.set_index("Condition")
tmp = df["Description"].str.extract(
"(" + "|".join(re.escape(c) for c in df_conditions.index) + ")",
flags=re.I,
)
df["Classificatie"] = tmp[0].str.lower().map(df_conditions["Value"])
print(df)
Prints:
Description Classificatie
0 groceries Jumbo on date Boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen

How to return result based on a string found on a list?

I'm trying to return all data from my excel sheet from the column TOURNAMENT that has the string FIFA. I keep getting no results back and am not sure how to fix this. Below is a sample of data from my excel. Any insight would be helpful thank you
My excel:
import pandas as pd
import numpy as np
filename = ("results.csv")
df = pd.read_csv(filename)
#convert to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y/%M/%D')
#Which country has scored the most goals in FIFA events (qualifiers, cups, etc.) since 2010?
#To get the most goals by sum
df['total_score'] = df['home_score'] + df['away_score']
#Not sure how to check all data with the string "FIFA" in the column "Tournament"
sub_df = df[(df['date'].dt.year >= 2010)]
if "FIFA" in df['tournament']:
sub_df2 = sub_df[sub_df['total_score'] == sub_df['total_score'].max()]
print(sub_df2)
else:
print("no results")
You can use Series.str.contains to check if a substring exists in the value, then use the masking to get only such occurrences:
>>> df[df['tournament'].str.contains('FIFA')]

Compare two values for different rows in Pandas dataframe

I have a dataset of submission records with different submission times that are grouped by id and sub_id. There will be several submissions with different sub_id under the id to indicate they are the sub-events of the original event. For instance:
id sub_id submission_time valuation_time amend_time
G1 Original 2021-05-13T00:11:05Z 2021-05-13T00:12:05Z
G1 Valuation 2021-05-13T06:11:05Z 2021-05-13T06:12:10Z
G1 Amend 2021-05-14T08:09:01Z 2021-05-14T09:09:05Z 2021-05-18T19:19:15Z
G2 Original 2021-04-12T00:11:05Z 2021-04-12T00:12:05Z
G2 Valuation 2021-04-12T06:11:05Z 2021-04-12T06:12:10Z
...
I would like to go through the dataset and examine if valuation_time of sub_id == "Valuation" is after the submission_time of sub_id == "Original" under the same id reference. If that is true, I would like to input a new column and populate sub_id == "Valuation" to be pass, otherwise fail.
I would really appreciate your help on this as I have no clue on this challenge. Thank you so much.
Please try this
import datetime
df=pd.read_excel('C:\MyCodes\samplepython.xlsx')
df['Status']=''
df_new=pd.DataFrame()
for index, row in df.iterrows():
sub_time = datetime.datetime.strptime(row['submission_time'], "%Y-%m-
%dT%H:%M:%SZ")
val_time = datetime.datetime.strptime(row['valuation_time'], "%Y-%m-
%dT%H:%M:%SZ")
if row['sub_id']=='Valuation' and val_time>sub_time:
row['Status']='Pass'
elif row['sub_id']=='Valuation' and val_time<=sub_time:
row['Status']='Fail'
df_new=df_new.append(row)
Code:
import datetime
import pandas as pd
list_values=[['G1','Original',datetime.datetime.strptime('2021-05-13T00:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-05-13T00:12:05Z', "%Y-%m-%dT%H:%M:%SZ")],
[< please load other values>],
['G2','Valuation',datetime.datetime.strptime('2021-04-12T06:11:05Z', "%Y-%m-%dT%H:%M:%SZ"),datetime.datetime.strptime('2021-04-12T06:12:10Z', "%Y-%m-%dT%H:%M:%SZ")]]
df=pd.DataFrame(list_values,columns = ['id', 'sub_id',
'submission_time', 'valuation_time'])
df.sort_values(by=['id', 'sub_id'])
status=[]
level=0
for index,row in df.iterrows():
if level==0 and row['sub_id']=='Original':
sub_time=row['submission_time']
status.append('')
level+=1
elif level==1 and row['sub_id']=='Valuation':
val_time=row['valuation_time']
if sub_time>val_time:
status.append('Fail')
else:
status.append('Pass')
level=0
else:
level=0
status.append('')
df["Status"]=status
print(df)
Result:

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Categories