How to delete CSV specific row.(like user_name = Max) - python

I have a CSV file of around 40K rows. And I want to delete 10K rows with conditions(eg: user_name = Max). And my data is like :
user1_name,user2_name,distance
"Unews","CCSSuptConnelly",""
"Unews","GwapTeamFre",""
"Unews","WilsonRecDept","996.27"
"Unews","ChiOmega_ISU","1025.03"
"Unews","officialtshay",""
"Unews","hari",""
"Unews","lashaunlester7",""
"Unews","JakeSlaughter5","509.53"
Thank you!

import pandas as pd
Read the csv
df = pd.read_csv('filename')
Create an index
index_names = df[ df['user2_name'] == 'Max' ].index
Drop it
df.drop(index_names, inplace = True)

You can use the Pandas library for this kind of problems and then use the .loc[] function. Link to the docs: Loc Function in pandas
import pandas as pd
df = pd.read_csv('name.csv')
df_filtered = df.loc[!(df['user_name'] == 'Max']),:]

Related

print rows from read_csv object with conditions

Trying to print records from a .csv file with 2 conditions based on 2 columns (Georgraphy and Comment). It works when I put one condition on Geography column but does not work when I put conditions on Geography and Comments columns. Is this a syntax mistake? Thanks!
Works fine:
import pandas as pd
dt = pd.read_csv("data.csv", low_memory=False)
print(dt)
print(list(dt))
geo_ont = dt[dt.Geography=="Ontario"]
print(geo_ont)
Does not Work:
import pandas as pd
dt = pd.read_csv("data.csv", low_memory=False)
print(dt)
print(list(dt))
geo_ont = dt[dt.Geography=="Ontario" & dt.Comment=="TRUE"]
print(geo_ont)
I believe the comment column is Boolean. So, Either you convert comment column to string or just use it as '1' or True. Here is the code,
import pandas as pd
dt = pd.read_csv("test.csv", low_memory=False)
print(dt.Comment.dtype)
geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment)]
#OR
#geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment==True)]
#OR
#geo_ont = dt[(dt.Geography=="Ontario") & (dt.Comment==1)]
print(geo_ont)

Filtering a CSV File using two columns

I am a newbie to python. I am working on a CSV file where it has over a million records. In the data, every Location has a unique ID (SiteID). I want to filter for and remove any records where there is no value or mismatch between SiteID and Location in my CSV file. (Note: This script should print the lines number and mismatch field values for each record.)
I have the following code. Please help me out:
import pandas as pd
pd = pandas.read_csv ('car-auction-data-from-ghana', delimiter = ";")
pd.head()
date_time = (pd['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
comparison_column = pd.where(pd['SiteID'] == pd['Location'], True, False)
comparison_column
This should be your solution:
df = pd.read_csv('car-auction-data-from-ghana', delimiter = ";")
print(df.head())
date_time = (df['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
df = df[df['SiteID'] == df['Location']]
print(df)
You need to call read_csv as a member of pd because it is the alias to the imported package, and use df as the variable for your data frame. The line with the comparison drops rows in which the boolean is not equal, the two not being equal in this case.

Is there a pandas function to select latest available date of a data frame?

I have the following code, but I want to automate or simplify the latest date available selection. Here's my code:
import pandas as pd
covid_url = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
covid_total = pd.read_csv(covid_url, index_col=0)
covid_total = covid_total.loc['Colima', '17-03-2020':02-12-2020']
covid_total
Output:
Is there a way to substitute the 02-12-2020 loc for something that can select the latest available date?
I thank you guys in advance.
Change to date
df.index = pd.to_datetime(df.index, format='%d-%m-%Y')
#df.loc[df.index.max()]
: this solved my problem without sacrificing the data filter from 17-03-2020:
import pandas as pd
covid_total = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
df = pd.read_csv(covid_total, index_col=0)
df = df.loc['Colima', '17-03-2020':]
df

combine/merge two csv using pandas/python

I have two csvs, I want to combine or merge these csvs as left join...
my key column is "id", I have same non-key column as "result" in both csvs, but I want to override "result" column if any value exists in "result" column of 2nd CSV . How can I achieve that using pandas or any scripting lang. Please see my final expected output.
Input
input.csv:
id,scenario,data1,data2,result
1,s1,300,400,"{s1,not added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,
output.csv:
id,result
1,"{s1,added}"
3,"{s3,added}"
Expected Output
final_output.csv
id,scenario,data1,data2,result
1,s1,300,400,"{s1,added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,"{s3,added}"
Current Code:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='test_id',how='left')
merged.to_csv("final_output.csv", index=False)
Question:
Using this code I am getting the result column twice. I want only once and it should override if value exists in that column. How do I get a single result column?
try this, this works as well
import pandas as pd
import numpy as np
c=pd.merge(a,b,on='id',how='left')
lst=[]
for i in c.index:
if(c.iloc[i]['result_x']!=''):
lst.append(c.iloc[i]['result_x'])
else:
lst.append(c.iloc[i]['result_y'])
c['result']=pd.Series(lst)
del c['result_x']
del c['result_y']
This will combine the columns as desired:
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
merged = a.merge(b, on='id', how='outer')
def merge_results(row):
y = row['result_y']
return row['result_x'] if isinstance(y, float) else y
merged['result'] = merged.apply(merge_results, axis=1)
del merged['result_x']
del merged['result_y']
merged.to_csv("final_output.csv", index=False)
You can also use concat as below.
import pandas as pd
a = pd.read_csv("input.csv")
b = pd.read_csv("output.csv")
frames=[a,b]
mergedFrames=pd.DataFrame()
mergedFrames=pd.concat(frames, sort=True)
mergedFrames.to_csv(path/to/location)
NOTE: The sort=True is added to avoid some warnings

pandas Combine Excel Spreadsheets

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Categories