I have a pandas dataframe in python that I want to remove rows that contain letters in a certain column. I have tried a few things, but nothing has worked.
Input:
A B C
0 9 1 a
1 8 2 b
2 7 cat c
3 6 4 d
I would then remove rows that contained letters in column 'B'...
Expected Output:
A B C
0 9 1 a
1 8 2 b
3 6 4 d
Update:
After seeing the replies, I still haven't been able to get this to work. I'm going to just place my entire code here. Maybe I'm not understanding something...
import pandas as pd
#takes file path from user and removes quotation marks if necessary
sysco1file = input("Input path of FS1 file: ").replace("\"","")
sysco2file = input("Input path of FS2 file: ").replace("\"","")
sysco3file = input("Input path of FS3 file: ").replace("\"","")
#tab separated files, all values string
sysco_1 = pd.read_csv(sysco1file, sep='\t', dtype=str)
sysco_2 = pd.read_csv(sysco2file, sep='\t', dtype=str)
sysco_3 = pd.read_csv(sysco3file, sep='\t', dtype=str)
#combine all rows from the 3 files into one dataframe
sysco_all = pd.concat([sysco_1,sysco_2,sysco_3])
#Also dropping nulls from CompAcctNum column
sysco_all.dropna(subset=['CompAcctNum'], inplace=True)
#ensure all values are string
sysco_all = sysco_all.astype(str)
#implemented solution from stackoverflow
#I also tried putting "sysco_all = " in front of this
sysco_all.loc[~sysco_all['CompanyNumber'].str.isalpha()]
#writing dataframe to new csv file
sysco_all.to_csv(r"C:\Users\user\Desktop\testcsvfile.csv")
I do not get an error. However, the csv still has rows with letters in this column.
Assuming the B column be string type, we can use str.contains here:
df[~df["B"].str.contains(r'^[A-Za-z]+$', regex=True)]
here is another way to do it
# use isalpha to check if value is alphabetic
# use negation to pick where value is not alphabetic
df=df.loc[~df['B'].str.isalpha()]
df
A B C
0 9 1 a
1 8 2 b
3 6 4 d
OR
# output the filtered result to csv, preserving the original DF
df.loc[~df['B'].str.isalpha()].to_csv('out.csv')
Related
I have a data frame with repeating string values. I want to reorder in a desired order.
My code:
df =
name
0 Fix
1 1Ax
2 2Ax
3 2Ax
4 1Ax
5 Fix
df.sort_values(by=['name'],ignore_index=True,ascending=False))
print(df)
df =
name
0 Fix
1 Fix
2 2Ax
3 2Ax
4 1Ax
5 1Ax
Expected answer:
df =
name
0 Fix
1 Fix
2 1Ax
3 1Ax
4 2Ax
5 2Ax
Currently you are sorting in reverse alphabetical order: so 'F' comes before '2' which comes before '1'. Changing ascending to True will place 'Fix' at the bottom.
It's a bit of a hack, but you could pull out the rows where the first character is number of sort them separately...
import pandas as pd
df = pd.DataFrame(['Fix', '1Ax','2Ax','2Ax','1Ax','Fix'], columns=['name'])
# Sort alphabetically
df = df.sort_values(by=['name'],ignore_index=True,ascending=True)
# Get first character of string
first_digit = df['name'].str[0]
# Get cases where first character is (not) a number
starts_with_digits = df[first_digit.str.isdigit()]
not_starts_with_digits = df[~first_digit.str.isdigit()]
# Switch order of row with first character which start with number/not a number
pd.concat([not_starts_with_digits, starts_with_digits]).reset_index(drop=True)
I have successfully changed a single column name in the dataframe using this:
df.columns=['new_name' if x=='old_name' else x for x in df.columns]
However i have lots of columns to update (but not all 240 of them) and I don't want to have to write it out for each single change if i can help it.
I have tried to follow the advice from #StefanK in this thread:
Changing multiple column names but not all of them - Pandas Python
my code:
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
but i am getting an error message:
File "<ipython-input-17-2808488b712d>", line 3
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
^
SyntaxError: can't assign to literal
So having googled the error and read many more S.O. questions here it looks to me like it is trying to read the numbers as integers instead of an index? I'm not certain here though.
So how do i fix it so it looks at the numbers as the index?! The column names I am replacing are at least 10 words long each so I'm keen not to have to type them all out! my only ideas are to use iloc somehow but i'm going into new territory here!
really appreciate some help please
Remove the '=' after df.columns in your code and use this instead:
df.columns.values[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
Because index does not support mutable operations convert it to numpy array, reassign and set back:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
arr = df.columns.to_numpy()
arr[[0,2,3]] = list('RTG')
df.columns = arr
print (df)
R B T G E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
So with your data use:
idx = [4,18,181,182,187,188,189,190,203,204]
names = ['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
arr = df.columns.to_numpy()
arr[idx] = names
df.columns = arr
I need to find all the duplicates in one column of a csv file, and then export these to a different csv file. I've tried answers from this:How do I get a list of all the duplicate items using pandas in python? but am not getting the correct result.
Example of my csv file:
filename,ID,status
71.wav,107e,accepted
85.wav,9a99,accepted
85.wav,d27a,accepted
86.wav,ea4f,accepted
86.wav,9f9b,accepted
75.wav,b734,accepted
75.wav,3dfb,accepted
I would like an output of:
85.wav,9a99,accepted
86.wav,ea4f,accepted
75.wav,b734,accepted
I tried:
ids = df["filename"]
dups = df[ids.isin(ids[ids.duplicated()])].sort_values("filename")
print dups
The output of this gave unique values as well as duplicate values.
My expected output would be a csv file with the first duplicate listed as shown above (I edited the question to clarify).
This method should definitively help.
data = {'Test':[1,2,3,4,5,6,2,4,2,5,6,3,2,7,8,9]}
df = pd.DataFrame(data)
dups = df[df.duplicated()]
returns
Test
6 2
7 4
8 2
9 5
10 6
11 3
12 2
Are you looking for something like this?
df = pd.DataFrame({"id":[1,1,1,1,2,2,3,4,5],
"name":["Georgia","Georgia","Georgia","Georgia","Camila","Camila","Diego","Luis","Jose"]})
duplicates = df[df.duplicated(["id"])]
Returns
id name
1 1 Georgia
2 1 Georgia
3 1 Georgia
5 2 Camila
I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)
My CSV file contains 20 columns and I need to take data of only those addresses that are relevant to my study, so I compare the column containing all addresses to a column containing only specific address.
I am getting "key error' saying the index selected_city does not exist:
import csv
import os
import pandas as pd
data_new = pd.read_csv('file1.csv', encoding= "ISO-8859–1")
print(data_new)
for i in rows:
if str(data.loc['selected_city'] == data.loc['Charge_Point_City'])
print(data.Volume,data.Charge_Point_City)
Consider using the builtin function .isin().
For example:
s = pd.Series(['a','b','c', 'b','c','a','b'])
So now s looks like:
0 a
1 b
2 c
3 b
4 c
5 a
6 b
Say you only want to keep the rows where s is in a smaller series:
smol = pd.Series(['a','b'])
s[s.isin(smol)]
Output:
0 a
1 b
3 b
5 a
6 b
For your specific use case, you probably want
data = data[data['selected_city'].isin(data['Charge_Point_City'])]