print rows from read_csv object with conditions - python

Trying to print records from a .csv file with 2 conditions based on 2 columns (Georgraphy and Comment). It works when I put one condition on Geography column but does not work when I put conditions on Geography and Comments columns. Is this a syntax mistake? Thanks!
Works fine:
import pandas as pd
dt = pd.read_csv("data.csv", low_memory=False)
print(dt)
print(list(dt))
geo_ont = dt[dt.Geography=="Ontario"]
print(geo_ont)
Does not Work:
import pandas as pd
dt = pd.read_csv("data.csv", low_memory=False)
print(dt)
print(list(dt))
geo_ont = dt[dt.Geography=="Ontario" & dt.Comment=="TRUE"]
print(geo_ont)

I believe the comment column is Boolean. So, Either you convert comment column to string or just use it as '1' or True. Here is the code,
import pandas as pd
dt = pd.read_csv("test.csv", low_memory=False)
print(dt.Comment.dtype)
geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment)]
#OR
#geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment==True)]
#OR
#geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment==1)]
print(geo_ont)

Related

Filtering a CSV File using two columns

I am a newbie to python. I am working on a CSV file where it has over a million records. In the data, every Location has a unique ID (SiteID). I want to filter for and remove any records where there is no value or mismatch between SiteID and Location in my CSV file. (Note: This script should print the lines number and mismatch field values for each record.)
I have the following code. Please help me out:
import pandas as pd
pd = pandas.read_csv ('car-auction-data-from-ghana', delimiter = ";")
pd.head()
date_time = (pd['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
comparison_column = pd.where(pd['SiteID'] == pd['Location'], True, False)
comparison_column
This should be your solution:
df = pd.read_csv('car-auction-data-from-ghana', delimiter = ";")
print(df.head())
date_time = (df['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
df = df[df['SiteID'] == df['Location']]
print(df)
You need to call read_csv as a member of pd because it is the alias to the imported package, and use df as the variable for your data frame. The line with the comparison drops rows in which the boolean is not equal, the two not being equal in this case.

Formatting of JSON file

Can we convert the highlighted INTEGER values to STRING value (refer below link)?
https://i.stack.imgur.com/3JbLQ.png
CODE
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
import pandas as pd
df = pd.read_csv ('newsample2.csv')
df.to_json('myjson2.json', indent=4)
print(df)
Try doing something like this.
import pandas as pd
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
df = pd.read_csv ('newsample2.csv')
df['index'] = df.index
df.to_json('myjson2.json', indent=4)
print(df)
This will take indices of your data and store them in the index column, so they will become a part of your data.

How to delete CSV specific row.(like user_name = Max)

I have a CSV file of around 40K rows. And I want to delete 10K rows with conditions(eg: user_name = Max). And my data is like :
user1_name,user2_name,distance
"Unews","CCSSuptConnelly",""
"Unews","GwapTeamFre",""
"Unews","WilsonRecDept","996.27"
"Unews","ChiOmega_ISU","1025.03"
"Unews","officialtshay",""
"Unews","hari",""
"Unews","lashaunlester7",""
"Unews","JakeSlaughter5","509.53"
Thank you!
import pandas as pd
Read the csv
df = pd.read_csv('filename')
Create an index
index_names = df[ df['user2_name'] == 'Max' ].index
Drop it
df.drop(index_names, inplace = True)
You can use the Pandas library for this kind of problems and then use the .loc[] function. Link to the docs: Loc Function in pandas
import pandas as pd
df = pd.read_csv('name.csv')
df_filtered = df.loc[!(df['user_name'] == 'Max']),:]

Deleting rows with the word 'hate' in a column I called 'Bio' in a spreadsheet

I want to add a line to the code below that I use in Phyton so that it will delete all lines with the word 'hate' in column I, which is called 'Bio':
import pandas as pd
from datetime import datetime
INPUT_FILE = 'Sample spreadsheet.xlsx'
OUTPUT_FILE = 'Output.xlsx'
df = pd.read_excel(INPUT_FILE)
df.dropna(subset=['Location', 'Full name'], inplace=True)
df = df[(df['Followers'] > 200) & (df['Friends'] > 200) & (df['Last tweet'] > '2011-04-12') & (df['Created'] < '2018-12-31')]
with pd.ExcelWriter(OUTPUT_FILE) as writer:
df.to_excel(writer)
I would add lowercasing before calling contains! This means Hate, hate, HATE would be caught:
import pandas as pd
df = pd.DataFrame({'foo':[1,2],
'bio':['i love pandas',
'i HATE ms excel']})
# normalize words to lowercase
#df = df[~ df['bio'].str.lower().str.contains('hate')]
df = df[~ df['bio'].str.contains('hate',case=False)]
Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
Results:
If you want to remove strings that contains the word "hate":
df = df[~df["Bio"].str.contains("hate")]
use this line
df = df[df['Bio'] != 'hate']
if the column can contains multiple values you can use
df = df[~df["Bio"].str.lower().contains("hate")]

combine/merge two csv using pandas/python

I have two csvs, I want to combine or merge these csvs as left join...
my key column is "id", I have same non-key column as "result" in both csvs, but I want to override "result" column if any value exists in "result" column of 2nd CSV . How can I achieve that using pandas or any scripting lang. Please see my final expected output.
Input
input.csv:
id,scenario,data1,data2,result
1,s1,300,400,"{s1,not added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,
output.csv:
id,result
1,"{s1,added}"
3,"{s3,added}"
Expected Output
final_output.csv
id,scenario,data1,data2,result
1,s1,300,400,"{s1,added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,"{s3,added}"
Current Code:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='test_id',how='left')
merged.to_csv("final_output.csv", index=False)
Question:
Using this code I am getting the result column twice. I want only once and it should override if value exists in that column. How do I get a single result column?
try this, this works as well
import pandas as pd
import numpy as np
c=pd.merge(a,b,on='id',how='left')
lst=[]
for i in c.index:
if(c.iloc[i]['result_x']!=''):
lst.append(c.iloc[i]['result_x'])
else:
lst.append(c.iloc[i]['result_y'])
c['result']=pd.Series(lst)
del c['result_x']
del c['result_y']
This will combine the columns as desired:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='id', how='outer')
def merge_results(row):
y = row['result_y']
return row['result_x'] if isinstance(y, float) else y
merged['result'] = merged.apply(merge_results, axis=1)
del merged['result_x']
del merged['result_y']
merged.to_csv("final_output.csv", index=False)
You can also use concat as below.
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
frames=[a,b]
mergedFrames=pd.DataFrame()
mergedFrames=pd.concat(frames, sort=True)
mergedFrames.to_csv(path/to/location)
NOTE: The sort=True is added to avoid some warnings

Categories