How to compare two columns of strings in python? - python

My CSV file contains 20 columns and I need to take data of only those addresses that are relevant to my study, so I compare the column containing all addresses to a column containing only specific address.
I am getting "key error' saying the index selected_city does not exist:
import csv
import os
import pandas as pd
data_new = pd.read_csv('file1.csv', encoding= "ISO-8859–1")
print(data_new)
for i in rows:
if str(data.loc['selected_city'] == data.loc['Charge_Point_City'])
print(data.Volume,data.Charge_Point_City)

Consider using the builtin function .isin().
For example:
s = pd.Series(['a','b','c', 'b','c','a','b'])
So now s looks like:
0 a
1 b
2 c
3 b
4 c
5 a
6 b
Say you only want to keep the rows where s is in a smaller series:
smol = pd.Series(['a','b'])
s[s.isin(smol)]
Output:
0 a
1 b
3 b
5 a
6 b
For your specific use case, you probably want
data = data[data['selected_city'].isin(data['Charge_Point_City'])]

Related

Pandas Dataframe Remove all Rows with Letters in Certain Column

I have a pandas dataframe in python that I want to remove rows that contain letters in a certain column. I have tried a few things, but nothing has worked.
Input:
A B C
0 9 1 a
1 8 2 b
2 7 cat c
3 6 4 d
I would then remove rows that contained letters in column 'B'...
Expected Output:
A B C
0 9 1 a
1 8 2 b
3 6 4 d
Update:
After seeing the replies, I still haven't been able to get this to work. I'm going to just place my entire code here. Maybe I'm not understanding something...
import pandas as pd
#takes file path from user and removes quotation marks if necessary
sysco1file = input("Input path of FS1 file: ").replace("\"","")
sysco2file = input("Input path of FS2 file: ").replace("\"","")
sysco3file = input("Input path of FS3 file: ").replace("\"","")
#tab separated files, all values string
sysco_1 = pd.read_csv(sysco1file, sep='\t', dtype=str)
sysco_2 = pd.read_csv(sysco2file, sep='\t', dtype=str)
sysco_3 = pd.read_csv(sysco3file, sep='\t', dtype=str)
#combine all rows from the 3 files into one dataframe
sysco_all = pd.concat([sysco_1,sysco_2,sysco_3])
#Also dropping nulls from CompAcctNum column
sysco_all.dropna(subset=['CompAcctNum'], inplace=True)
#ensure all values are string
sysco_all = sysco_all.astype(str)
#implemented solution from stackoverflow
#I also tried putting "sysco_all = " in front of this
sysco_all.loc[~sysco_all['CompanyNumber'].str.isalpha()]
#writing dataframe to new csv file
sysco_all.to_csv(r"C:\Users\user\Desktop\testcsvfile.csv")
I do not get an error. However, the csv still has rows with letters in this column.
Assuming the B column be string type, we can use str.contains here:
df[~df["B"].str.contains(r'^[A-Za-z]+$', regex=True)]
here is another way to do it
# use isalpha to check if value is alphabetic
# use negation to pick where value is not alphabetic
df=df.loc[~df['B'].str.isalpha()]
df
A B C
0 9 1 a
1 8 2 b
3 6 4 d
OR
# output the filtered result to csv, preserving the original DF
df.loc[~df['B'].str.isalpha()].to_csv('out.csv')

How to select specific rows based on specific condition

I am working in Pandas Python. I am trying to select specific rows based on specific condition. From the below dataset, I want the system groups which has Type 1 in it. System groups which don't have type 1 can be ignored.
System
Type
A
1
A
2
A
2
A
3
B
1
B
2
C
2
D
3
Required Output
System
Type
A
1
A
2
A
2
A
3
B
1
B
2
System A and B is obtained in the required output becuase it contain the Type 1 value. C and D groups has been ignore due to no Type 1 in them. I am trying to do with groupby but unable to extend this function to check for presence type 1 in it in the condition. Please help
Code to generate dataframe
import pandas as pd
data = [['A', 1], ['A', 2], ['A', 2],['A',3],['B',1],['B',2],['C',2],['C',3],['D',3]]
df = pd.DataFrame(data, columns=['System', 'Type'])
df[df['System'].isin(df[df['Type'] == 1]['System'])]
System Type
0 A 1
1 A 2
2 A 2
3 A 3
4 B 1
5 B 2
First, you filter the df to only rows with Type=1, and select the System column. Then you filter the df to only includes rows where System is in that list
It might be faster to get the values into a set and search in it
df[df['System'].isin(set(df[df['Type'] == 1]['System'].values))]
let's say the table variable is df . Then the code will be
sys = list(df[df['Type'] == 1]['System'].values)
ans = df[df['System'].isin(sys)]
ans is your preferred table. I am sure there are better ways but hopefully this works.
You can use where clause it will be much faster than conditional selection
selectedvariables=df.where(df['Type']==1)["System"].dropna()
print(df[df["System"].isin(selectedvariables)])

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Categories