Pandas Sort File and group up values - python

I'm learning pandas,but having some trouble.
I import data as DataFrame and want to bin the 2017 population values into four equal-size groups.
And count the number of group4
However the system print out:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-52-05d9f2e7ffc8> in <module>
2
3 df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
----> 4 df=df.sort_values('2017',ascending=True)
5 df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
6 splitData = [group for _, group in df.groupby('Group')]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index, key)
5453
5454 by = by[0]
-> 5455 k = self._get_label_or_level_values(by, axis=axis)
5456
5457 # need to rewrap column in Series to apply key function
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1682 values = self.axes[axis].get_level_values(key)._values
1683 else:
-> 1684 raise KeyError(key)
1685
1686 # Check for duplicates
KeyError: '2017'
What's wrong with it?
Thanks~
Here's the dataframe:
And I tried:
df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
df=df.sort_values('2017',ascending=True)
df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
splitData = [group for _, group in df.groupby('Group')]
print('The number of group4 is :',splitData[3].shape[0])

You are inserting the key for df.sort_values() as a str. You can either give it as an element in a list or not.
df = df.sort_values(by=['2017'], ascending=True)
or
df = df.sort_values(by='2017', ascending=True)
This only works if the column value is exactly matching the string you pass. If it is not a string or if that string contains some white spaces it won't work. You can remove any trailing white spaces before sorting by,
df.columns = df.columns.str.strip()
and if it is not a string you should use,
df = df.sort_values(by=[2017], ascending=True)

Firstly, you have problem in 4 line with the sort, you tell sort function to look for string 2017, but it's integer. Try this then move on on your code:
df=df.sort_values([2017],ascending=True)

Related

Select rows of data frame based on true false boolean list [duplicate]

This question already has an answer here:
Select rows from a Pandas DataFrame with exactly the same column values in another DataFrame
(1 answer)
Closed 5 months ago.
I want to select rows of a dataframe based on isin calculations I did using two seperate dataframes.
Here is the code:
file = r'file path for df'
df = pd.read_csv(file, encoding='utf-16le', sep='\t')
keepcolumns = ["CookieID", "CryptID"]
df = df[keepcolumns]
file = r'file path for mappe.csv'
dfmappe = pd.read_csv(file, sep=';')
mask = (dfmappe[['CryptIDs']].isin(df[['CryptID']])).all(axis=1)
dffound = df[mask]
However in the last line I get the following error:
IndexingError Traceback (most recent call last)
Untitled-1.ipynb Zelle 3 in <cell line: 9>()
6 dfmappe = pd.read_csv(file, sep=';')
8 mask = (dfmappe[['CryptIDs']].isin(df[['CryptID']])).all(axis=1)
----> 9 dffound = df[mask]
File c:\Users\pchauh04\Anaconda3\envs\python\lib\site-packages\pandas\core\frame.py:3496, in DataFrame.__getitem__(self, key)
3494 # Do we have a (boolean) 1d indexer?
3495 if com.is_bool_indexer(key):
-> 3496 return self._getitem_bool_array(key)
3498 # We are left with two options: a single key, and a collection of keys,
3499 # We interpret tuples as collections only for non-MultiIndex
3500 is_single_key = isinstance(key, tuple) or not is_list_like(key)
File c:\Users\pchauh04\Anaconda3\envs\python\lib\site-packages\pandas\core\frame.py:3549, in DataFrame._getitem_bool_array(self, key)
3543 raise ValueError(
3544 f"Item wrong length {len(key)} instead of {len(self.index)}."
3545 )
3547 # check_bool_indexer will throw exception if Series key cannot
3548 # be reindexed to match DataFrame rows
-> 3549 key = check_bool_indexer(self.index, key)
3550 indexer = key.nonzero()[0]
3551 return self._take_with_is_copy(indexer, axis=0)
...
2388 return result.astype(bool)._values
2389 if is_object_dtype(key):
2390 # key might be object-dtype bool, check_array_indexer needs bool array
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Here are the two files:
dfmappe
df
Problem is condition and filtered DataFrame has different index values:
#condition has index from dfmappe
mask = (dfmappe[['CryptIDs']].isin(df[['CryptID']])).all(axis=1)
#filtered df - both DataFrames has different indices, so raise error
dffound = df[mask]
Possible solutions - because tested one columns is removed [[]] and all(axis=1):
mask = dfmappe['CryptIDs'].isin(df['CryptID'])
#filtered dfmappe
dffound = dfmappe[mask]
Or:
#mask test df by dfmappe columns
mask = df['CryptIDs'].isin(dfmappe['CryptID'])
#filtered df
dffound = df[mask]

Fill empty Pandas column based on condition on substring

I have this dataset with the following data. I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles. For example, all the job titles that contains the word 'Analytics' will be categorize as Data. This label Data will appear on the Categories table.
I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.
#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
key, value = line.split(";")
cat_type_dic[key] = value
print(cat_type_dic)
Then, I tried to create a loop based on a condition. Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value. This is what I tried:
for i in range(len(df)):
if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
Of course, it's not working. I think I am not accessing correctly to the key and value. Any clue?
This is the message error that I am getting:
TypeError Traceback (most recent call
last) in
1 for i in range(len(df)):
----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]):
3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in
getitem(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in
_getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key):
-> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate
the location
TypeError: Cannot index by location index with a non-integer key
Thanks a lot!
Does the following code give you what you need?
import pandas as pd
df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']
cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}
list_keys = list(cat_type_dic.keys())
def label_extracter(x):
list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
return category_label
df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)
print(df)
Job_Title Categories
0 Business Analyst CatB1 CatB2
1 Data Scientist CatS1 CatS2 CatS3
2 Server Analyst
EDIT: Explaination added. #SofyPond
apply helps when loop necessary.
I defined a function which checks if Job_Title contains a key in the dictionary which is assigned earlier. I preferred convert keys to a list to make checking process easier.
(list_label renamed to category_label since it is not list anymore) category_label in function label_extracter gets values assigned to key in list format. It is converted to str by putting ' ' (white space) between values. In the case, length of list_matched_keys is greater than 0, it will contains list of string which are created by inner ' '.join. So outer ' '.join convert it to string format.

Need to strip CSV Column Number Data of Letters - Pandas

I am working on a .csv that has columns in which numerical data includes letters. I want to strip the letters so that the column can be a float or int.
I have tried the following:
using the loop/def process to strip object columns of string data, in "MPG" column and leave only numerical values.
it should print the names of the columns where there is at least one entry ending in the characters 'mpg'
CODING IN JUPYTER NOTEBOOK CELLS:
Step 1:
MPG_cols = []
for colname in df.columns[df.dtypes == 'object']:
if df[colname].str.endswith('mpg').any():
MPG_cols.append(colname)
print(MPG_cols)
using .str so I can use an element-wise string method
only want to consider the string columns
THIS GIVES ME OUTPUT:
[Power]. #good so far
STEP 2:
#define the value to be removed using loop
def remove_mpg(pow_val):
"""For each value, take the number before the 'mpg'
unless it is not a string value. This will only happen
for NaNs so in that case we just return NaN.
"""
if isinstance(pow_val, str):
i=pow_val.replace('mpg', '')
return float(pow_val.split(' ')[0])
else:
return np.nan
position_cols = ['Vehicle_type']
for colname in MPG_cols:
df[colname] = df[colname].apply(remove_mpg)
df[Power_cols].head()
The Error I get:
ValueError Traceback (most recent call last)
<ipython-input-37-45b7f6d40dea> in <module>
15
16 for colname in MPG_cols:
---> 17 df[colname] = df[colname].apply(remove_mpg)
18
19 df[MPG_cols].head()
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-37-45b7f6d40dea> in remove_mpg(pow_val)
8 if isinstance(pow_val, str):
9 i=pow_val.replace('mpg', '')
---> 10 return float(pow_val.split(' ')[0])
11 else:
12 return np.nan
ValueError: could not convert string to float: 'null'
I applied similar code to a different column and it worked on that column, but not here.
Any guidance will be greatly appreciated.
Bests,
i think you would need to revisit the logic for function remove_mpg one way you can tweak as follows:
import re
import numpy as np
def get_me_float(pow_val):
my_numbers = re.findall(r"(\d+.*\d+)mpg", pow_val)
if len(my_numbers) > 0 :
return float(my_numbers[0])
else:
return np.nan
for example, need to test the function.
my_pow_val=['34mpg','34.6mpg','0mpg','mpg','anything']
for each_pow in my_pow_val:
print(get_me_float(each_pow))
output:
34.0
34.6
nan
nan
nan
This will work,
import pandas as pd
pd.to_numeric(pd.Series(['$2', '3#', '1mpg']).str.replace('[^0-9]', '', regex=True))
0 2
1 3
2 1
dtype: int64
for complete solution,
for i in range(df.shape[1]):
if(df.iloc[:,i].dtype == 'object'):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes
Select columns not to be changed
for i in range(df.shape[1]):
# 'colA', 'colB' are columns which should remain same.
if((df.iloc[:,i].dtype == 'object') & df.column[i] not in ['colA','colB']):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes
Why don't you use the converters paremeter to the read_csv function to strip the extra characters when you load the csv file?
def strip_mpg(s):
return float(s.rstrip(' mpg'))
df = read_csv(..., converters={'Power':strip_mpg}, ...)

Deleting empty index column pandas dataframe - labels [' '] not contained in axis

I have the following dataframe:
The column numeroLote is between a range of 5 to 25 values.
I want to create an export csv file to each data when numeroLote change their value and I perform the following:
for i in range(5,26):
print(i)
a = racimitos[racimitos['numeroLote']==i][['peso','fecha','numeroLote']]
a.to_csv('racimitos{}.csv'.format(i), sep=',', header=True, index=True)
And then, I get datasets similar to:
An additional column is generated like the one enclosed in the red box above …
I try to remove this column of the following way:
for i in range(5,26):
print(i)
a = racimitos[racimitos['numeroLote']==i][['peso','fecha','numeroLote']]
a.to_csv('racimitos{}.csv'.format(i), sep=',', header=True, index=True)
a.drop(columns=[' '], axis=1,)
But I get this error:
KeyError Traceback (most recent call last)
<ipython-input-18-e3ad718d5396> in <module>()
9 a = racimitos[racimitos['numeroLote']==i][['peso','fecha','numeroLote']]
10 a.to_csv('racimitos{}.csv'.format(i), sep=',', header=True, index=True)
---> 11 a.drop(columns=[' '], axis=1,)
~/anaconda3/envs/sioma/lib/python3.6/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
4385 if errors != 'ignore':
4386 raise KeyError(
-> 4387 'labels %s not contained in axis' % labels[mask])
4388 indexer = indexer[~mask]
4389 return self.delete(indexer)
KeyError: "labels [' '] not contained in axis"
How to can I remove this empty column index which is generated when I execute the export to.csv ?
You instead want index=False, like so:
for i in range(5,26):
a = racimitos[racimitos['numeroLote']==i][['peso','fecha','numeroLote']]
a.to_csv('racimitos{}.csv'.format(i), sep=',', header=True, index=False)
As an aside, I don't think it's necessary to include the numeroLote column when printing to the .csv file, simply because you capture it's value in the filename.
Here is a much more efficient solution IMO using groupby():
grouped = racimitos.groupby('numeroLote')[['peso','fecha']]
[grouped.get_group(key).to_csv('racimitos{}.csv'.format(key), index=False) for key, item in grouped]
Instead of trying to drop that unnamed column, you could select all columns starting from index 1.
a = a.iloc[:, 1:]

Pandas Error Matching String

I have data like the SampleDf data below. I'm trying to check values in one column in my dataframe to see if they contain 'sum' or 'count' or 'Avg' and then create a new column with the value 'sum', 'count', or 'Avg'. When I run the code below on my real dataframe I'm getting the error below. When I run dtypes on my real dataframe it says all the columns are objects. The code below is related to the post below. Unfortunately I don't get the same errors when I run the code on the SampleDf I've provided, but I couldn't post my whole dataframe.
post:
Pandas and apply function to match a string
Code:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Avg(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then LOS end),0)"]],columns=['ReportField','OtherField'])
search1='Sum'
search2='Count'
search3='Avg'
def Agg_type(x):
if search1 in x:
return 'sum'
elif search2 in x:
return 'count'
elif search3 in x:
return 'Avg'
else:
return 'Other'
SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
SampleDf.head()
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-a2b4920246a7> in <module>()
17 return 'Other'
18
---> 19 SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
20
21 #SampleDf.head()
C:\Users\Name\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:66124)()
<ipython-input-17-a2b4920246a7> in Agg_type(x)
8
9 def Agg_type(x):
---> 10 if search1 in x:
11 return 'sum'
12 elif search2 in x:
TypeError: argument of type 'float' is not iterable
You can try this:
SampleDf['new_col'] = np.where(SampleDf.OtherField.str.contains("Avg"),"Avg",
np.where(SampleDf.OtherField.str.contains("Count"),"Count",
np.where(SampleDf.OtherField.str.contains("Sum"),"Sum","Nothing")))
please notice that this will work properly if you don't have both Avg and Count or Sum in the same string.
If you do, please notice me i'll look for a better approach.
Of course if something doesn't suit your needs also report it back.
Hope this was helpful
explanation:
what's happening is that you're looking for indexes where Avg is in the string inside OtherField column and fill new_col with "Avg" in these indexes. for the remaining fields( where there isn't "Avg", you look for Count and do the same and last you do the same for Sum.
documentation:
np.where
pandas.series.str.contains

Categories