I need help working with pandas dataframe - python

I have a big dataframe of items which is simplified as below. I am looking for good way to find the the item(A, B, C) in each row which is repeated more than or equal to 2 times.
for example in row1 it is A and in row2 result is B.
simplified df:
df = pd.DataFrame({'C1':['A','B','A','A','C'],
'C2':['B','A','A','C','B'],
'C3':['A','B','A','C','C']},
index =['ro1','ro2','ro3','ro4','ro5']
)

Like mozway suggested, we don't know what will be your output. I will assume you need a list.
You can try something like this.
import pandas as pd
from collections import Counter
holder = []
for index in range(len(df)):
temp = Counter(df.iloc[index,:].values)
holder.append(','.join([key for key,value in temp.items() if value >= 2]))

As you have three columns and always a non unique, you can conveniently use mode.
df.mode(1)[0]
Output:
ro1 A
ro2 B
ro3 A
ro4 C
ro5 C
Name: 0, dtype: object
If you might have all unique values (e.g. A/B/C), you need to check that the mode is not unique:
m = df.mode(1)[0]
m2 = df.eq(m, axis=0).sum(1).le(1)
m.mask(m2)

Related

How to drop Ith row of a data frame

How do I drop row number i of a DF ?
I did the thing below but it is not working.
DF = DF.drop(i)
So I wonder what I miss there.
You must pass a label to drop. Here drop tries to use i as a label and fails (ith KeyError) as your index probably has other values. Worse, if the index was composed of integers in random order you might drop an incorrect row without noticing it.
Use:
df.drop(df.index[i])
Example:
df = pd.DataFrame({'col': range(4)}, index=list('ABCD'))
out = df.drop(df.index[2])
output:
col
A 0
B 1
D 3
pitfall
In case of duplicated indices, you might remove unwanted rows!
df = pd.DataFrame({'col': range(4)}, index=list('ABAD'))
out = df.drop(df.index[2])
output (A is incorrectly dropped!):
col
B 1
D 3
workaround:
import numpy as np
out = df[np.arange(len(df)) != i]
drop several indices by position:
import numpy as np
out = df[~np.isin(np.arange(len(df)), [i, j])]
You need to add square brackets:
df = df.drop([i])
Try This:
df.drop(df.index[i])

Create a new dataframe based off of strings lengths of values from existing dataframe

Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.
This is the approach I'm currently taking:
incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]
and the current output:
old_id new_id
111 NaN
NaN NaN
NaN 777
NaN NaN
I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:
id
111
777
In general, DataFrame.applymap is pretty slow, so you should avoid it. I would stack both columns in a single one, and select the ids with length 4:
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
ids = df.stack()
bad_ids = ids[ids.str.len() != 4]
Output:
>>> bad_ids
0 old_id 111
2 new_id 777
dtype: object
The advantage of this approach is that now you have the location of the bad IDs which might be useful later. If you don't need it you can just use ids = df.stack().reset_index().
here's part of an answer
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids
Or if you are not completely sure what are you doing, you can always use brutal force method :D
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
rows,colums= df.shape
#print (df)
for row in range(rows):
k= (df.loc[row])
for colum in range(colums):
#print(k.iloc[colum])
if len(k.iloc[colum])!=4:
print("Bad size of ID on row:"+str(row)+" colum:"+str(colum))
As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:
>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0 111
1 777
dtype: object
To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().

How to generate new column based on multiple values from another column in pandas

How can I create a new column in a Pandas DataFrame that compresses/collapses multiple values at once from another column? Also, is it possible to use a default value so that you don't have to explicitly write out all the value mappings?
I'm referring to a process that is often called "variable recoding" in statistical software such as SPSS and Stata.
Example
Suppose I have a DataFrame with 1,000 observations. The only column in the DataFrame is called col1 and it has 26 unique values (the letters A through Z). Here's a reproducible example of my starting point:
import pandas as pd
import numpy as np
import string
np.random.seed(666)
df = pd.DataFrame({'col1':np.random.choice(list(string.ascii_uppercase),size=1000)})
I want to create a new column called col2 according to the following mapping:
If col1 is equal to either A, B or C, col2 should receive AA
If col1 is equal to either D, E or F, col2 should receive MM
For all other values in col1, col2 should receive ZZ
I know I can partially do this using Pandas' replace function, but it has two problems. The first is that the replace function doesn't allow you to condense multiple input values into one single response value. This forces me to write out df['col1'].replace({'A':'AA','B':'AA','C':'AA'}) instead of something simpler like df['col1'].replace({['A','B','C']:'AA'}).
The second problem is that the replace function doesn't have an all_other_values keyword or anything like that. This forces me to manually write out the ENTIRE value mappings like this df['col1'].replace({'A':'AA','B':'AA',...,'G':'ZZ','H':'ZZ','I':'ZZ',...,'X':'ZZ','Y':'ZZ','Z':'ZZ'}) instead of something simpler like df['col1'].replace(dict_for_abcdef, all_other_values='ZZ')
Is there another way to use the replace function that I'm missing that would allow me to do what I'm asking? Or is there another Pandas function that enables you to so similar things to what I describe above?
Dirty implementation
Here is a "dirty" implementation of what I'm looking for using loc:
df['col2'] = 'ZZ' # Initiate the column with the default "all_others" value
df.loc[df['col1'].isin(['A','B','C']),'col2'] = 'AA' # Mapping from "A","B","C" to "AA"
df.loc[df['col1'].isin(['D','E','F']),'col2'] = 'MM' # Mapping from "D","E","F" to "MM"
I find this solution a bit messy and was hoping something a bit cleaner existed.
Can try with np.select which takes a list of conditions, a list of values, and also a default:
conds = [df['col1'].isin(['A', 'B', 'C']),
df['col1'].isin(['D', 'E', 'F'])]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Can also use between instead of isin:
conds = [df['col1'].between('A', 'C'),
df['col1'].between('D', 'F')]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Sample Input and Output:
import string
import numpy as np
import pandas as pd
letters = string.ascii_uppercase
df = pd.DataFrame({'col1': list(letters)[:10]})
df:
col1 col2
0 A AA
1 B AA
2 C AA
3 D MM
4 E MM
5 F MM
6 G ZZ
7 H ZZ
8 I ZZ
9 J ZZ
np.select(condition, choice, alternative). For conditions, check numerals between a defined range
c=[df['col1'].between('A','C'),df['col1'].between('E','F')]
CH=['AA','MM']
df=df.assign(col2=np.select(c,CH,'ZZ'))

How can I count occurrences of a string in a dataframe in Python?

I'm trying to count the number of ships in a column of a dataframe. In this case I'm trying to count the number of 77Hs. I can do it for individual elements but actions on the whole column don't seem to work
E.g. This works with an individual element in my dataframe
df = pd.DataFrame({'Route':['Callais','Dover','Portsmouth'],'shipCode':[['77H','77G'],['77G'],['77H','77H']]})
df['shipCode'][2].count('77H')
But when I try and perform the action on every row using either
df['shipCode'].count('77H')
df['shipCode'].str.count('77H')
It fails with both attempts, any help on how to code this would be much appreciated
Thanks
what if you did something like this??
assuming your initial dictionary...
import pandas as pd
from collections import Counter
df = pd.DataFrame(df) #where df is the dictionary defined in OP
you can generate a Counter for all of the elements in the lists in each row like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x))
output:
Route shipCode counts
0 Callais [77H, 77G] {'77H': 1, '77G': 1}
1 Dover [77G] {'77G': 1}
2 Portsmouth [77H, 77H] {'77H': 2}
or if you want one in particular, i.e. '77H', you can do something like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x)['77H'])
output:
Route shipCode counts
0 Callais [77H, 77G] 1
1 Dover [77G] 0
2 Portsmouth [77H, 77H] 2
or even this using the first method (full Counter in each row):
[count['77H'] for count in df['counts']]
output:
[1, 0, 2]
The data frame has a shipcode column with a list of values.
First show a True or False value to identify rows that contain the string '77H' in the shipcode column.
> df['shipcode'].map(lambda val: val.count('77H')>0)
Now filter the data frame based on those True/False values obtained in the previous step.
> df[df['shipcode'].map(lambda val: val.count('77H')>0)]
Finally, get a count for all values in the data frame where the shipcode list contains a value matching '77H' using the python len method.
> len(df[df['shipcode'].map(lambda val: val.count('77H')>0)])
Another way that makes it easy to remember what's been analyzed is to create a column in the same data frame to store the True/False value. Then filter by the True/False values. It's really the same as above but a little prettier in my opinion.
> df['filter_column'] = df['shipcode'].map(lambda val: val.count('77H')>0)
> len(df[df['filter_column']])
Good luck and enjoy working with Python and Pandas to process your data!

Pandas dataframe, empty or with 3 column to pickle

I'm not used to pandas at all, thus the several question on my problem.
I have a function computing computing a list called solutions. This list can either be made of tuples of 3 values (a, b, c) or empty.
solutions = [(a,b,c), (d,e,f), (g,h,i)]
To save it, I first turn it into a numpy array, and then I save it with pandas after naming the columns.
solutions = np.asarray(solutions)
df = pd.DataFrame(solutions)
df.columns = ["Name1", "Name2", "Name3"]
df.to_pickle(path)
My issue is that I sometimes have a empty solutions list: solutions = []. Thus, the line df.columns raises an error. To bypass it, I currently check the size of solutions, and if it is empty, I do:
pickle.dump([], path, "wb")
I would like to be a more consistent between my data type, and to save the SAME format between both scenario.
=> If the list is empty, I would like to save the 3 columns name with an empty data frame. Ultimate goal, is to reopen the file with pd.read_pickle() and to access easily the data in it.
Second issue, I would like to reopen the files pickled, and to add a column. Could you show me the right way to do so?
And third question, how can I select a part of the dataframe. For instance, I want all lines in which the column Name1 value % 0.25 == 0.
Thanks
Create your dataframe using:
df = pandas.DataFrame(data=solutions, columns=['name1', 'name2', 'name3'])
If solutions is empty, it will nevertheless create a dataframe with 3 columns and 0 row.
In [2]: pd.DataFrame(data=[(1,2,3), (4,5,6)], columns=['a','b','c'])
Out[2]:
a b c
0 1 2 3
1 4 5 6
In [3]: pd.DataFrame(data=[], columns=['a','b','c'])
Out[3]:
Empty DataFrame
Columns: [a, b, c]
Index: []
For your third question:
df["Name1"] % 0.25 == 0
computes a series of booleans which are true where the value in the first column can be divided by 0.25. You can use it to select the rows of your dataframe:
df[ df["Name1"] % 0.25 == 0 ]

Categories