I have a pandas dataframe of 182 rows that comes from read_csv. The first column, sys_code, contains various alphanumeric codes. I want to drop ones that start with 'FB' (there are 14 of these). I loop through the dataframe, adding what I assume would be the index to a list, then try to drop by index using the list. But this doesn't work unless I add 18 to each index number.
Without adding 18, I get a list containing numbers from 84 - 97. When I try to drop the rows using this list for indexes, I get KeyError: '[84] not found in axis'. But when I add 18 to each number, it works fine, at least for this particular dataset. But why is this? Shouldn't i be the same as the index number?
fb = []
i = 0
df.reset_index(drop=True)
for x in df['sys_code']:
if x[:2] == 'FB':
fb.append(i+18) #works
fb.append(i) # doesn't work
i += 1
df.drop(fb, axis=0, inplace=True)
You could use Series.str.startswith. Here's an example:
df = pd.DataFrame({'col1':['some string', 'FBsomething', 'FB', 'etc']})
print(df)
col1
0 some string
1 FBsomething
2 FB
3 etc
You could remove those strings that do not start with FB using:
df[~df.col1.str.startswith('FB')]
col1
0 some string
3 etc
Related
I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448
Sorry if the title is unclear - I wasn't too sure how to word it. So I have a dataframe that has two columns for old IDs and new IDs.
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
I'm trying to figure out a way to check the string length of each column/row and return any id's that don't match the required string length of 4 into a new dataframe. This will eventually turn into a dictionary of incorrect IDs.
This is the approach I'm currently taking:
incorrect_id_df = df[df.applymap(lambda x: len(x) != 4)]
and the current output:
old_id new_id
111 NaN
NaN NaN
NaN 777
NaN NaN
I'm not sure where to go from here and I'm sure there's a much better approach but this is the output I'm looking for where it's a single column dataframe with just the IDs that don't match the required string length and also with the column name id:
id
111
777
In general, DataFrame.applymap is pretty slow, so you should avoid it. I would stack both columns in a single one, and select the ids with length 4:
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
ids = df.stack()
bad_ids = ids[ids.str.len() != 4]
Output:
>>> bad_ids
0 old_id 111
2 new_id 777
dtype: object
The advantage of this approach is that now you have the location of the bad IDs which might be useful later. If you don't need it you can just use ids = df.stack().reset_index().
here's part of an answer
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
all_ids = df.values.flatten()
bad_ids = [bad_id for bad_id in all_ids if len(bad_id) != 4]
bad_ids
Or if you are not completely sure what are you doing, you can always use brutal force method :D
import pandas as pd
df = pd.DataFrame({'old_id':['111', '2222','3333', '4444'], 'new_id':['5555','6666','777','8888']})
rows,colums= df.shape
#print (df)
for row in range(rows):
k= (df.loc[row])
for colum in range(colums):
#print(k.iloc[colum])
if len(k.iloc[colum])!=4:
print("Bad size of ID on row:"+str(row)+" colum:"+str(colum))
As commented by Jon Clements, stack could be useful here – it basically stacks (duh) all columns on top of each other:
>>> df[df.applymap(len) != 4].stack().reset_index(drop=True)
0 111
1 777
dtype: object
To turn that into a single-column df named id, you can extend it with a .rename('id').to_frame().
I have a column and I need to return the nth character. I used string.find() to get the indexes but I cannot find an answer how to return the values in a pd dataframe column.
Values
str.find() index
Outcome should be
asdfa 5-23
7
-
kj 1-13 adlkadg
5
-
.....
.....
...
Column "Values" is scraped from the internet.
My code to find the second column is :
df["str.find() index"] = df["Values"].str.find("-")
Return nth value in pandas column with str.find() string
There are many ways to do this, but since you are asking for the method which uses str.find(), here are some ways you can do this.
You can do it without explicitly using the str.find() index column you created by using apply
df['Outcome1'] = df['Values'].apply(lambda x: x[x.find("-")])
But, if you want to use the column you have created with index, then you can use df.apply with lambda over each row -
df['Outcome2'] = df.apply(lambda row: row['Values'][row['str.find() index']], axis=1)
Values str.find() index Outcome1 Outcome2
0 asdfa 5-23 7 - -
1 kj 1-13 adlkadg 4 - -
I have a function which removes a string from a df column, if the string is present in the column of another df:
df1['col'] = df1['col'][~df1['col'].isin(df2['col'])]
The problem is that I now have to use this function on a column of tuples, which the function does not work with. Is there a way to easily transform the above function to accommodate for tuples? Data:
df1: df2:
index col1 index col
0 ('carol.clair', 'mark.taylor') 0 ('james.ray', 'tom.kopeland')
1 ('james.ray', 'tom.kopeland') 1 ('john.grisham', 'huratio.kane')
2 ('andrew.french', 'jack.martin')
3 ('john.grisham', 'huratio.kane')
4 ('ellis.taylor', 'sam.johnson')
Desired output:
df1
index col1
0 ('carol.clair', 'mark.taylor')
1 ('andrew.french', 'jack.martin')
2 ('ellis.taylor', 'sam.johnson')
The function does work if the column is first converted to string, however this raises an error later on in my code (I've tried using the .astype(tuple) command to solve this after removing the tuples, however the same error arose):
ValueError: too many values to unpack (expected 2)
This will give you desired output:
df1.loc[~df1['col1'].isin(df2['col'])].reset_index(drop=True)
# col1
#0 (carol.clair, mark.taylor)
#1 (andrew.french, jack.martin)
#2 (ellis.taylor, sam.johnson)
I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])