I have a csv file like below
a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
I'm trying to get output like below, by removing repeated words in each row
a v s f
0 china usa and uk france
1 india australia usa uk
2 japan south africa new zealand
for which I'm doing
import pandas as pd
from io import StringIO
data="""a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand"""
df= pd.read_csv(StringIO(data).decode('UTF-8') )
from collections import Counter
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
print df.apply(lambda x:trans(x),axis=1 )
It works fine as long as I input the data into variable 'data'. But if I want to import that from csv file by doing data = pd.read_csv("trial.csv"), it doesn't work. I get an error message saying 'DataFrame' object has no attribute 'decode'. How can I read the data from a CSV file and write output to CSV file using pandas? Where am I going wrong?
Related
I have this .csv file (that I can't edit):
,Denmark,Norway,Sweden
TotalCases,"78,354 ","35,546 ","243,129 "
Deaths,"823","328","6,681"
Recovered,"61,461","20,956",N/A
I want to make 3 separate bar charts/graphs , one for each section (TotalCases, Deaths, Recovered). However most guides I found online have the data presented the other way round, where the TotalCases are the columns instead of rows like in this scenario. What is the right way to do this?
Then just transpose your data frame and follow the examples!
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO("""
country,Denmark,Norway,Sweden
TotalCases,"78,354 ","35,546 ","243,129 "
Deaths,"823","328","6,681"
Recovered,"61,461","20,956",N/A
""")
df = pd.read_csv(s)
df_t = df.transpose()
df_t.columns = df_t.iloc[0, :]
df_t = df_t.iloc[1:, :]
df_t['country'] = df_t.index
Use df_t then following their examples.
In [45]: df_t
Out[45]:
country TotalCases Deaths Recovered country
Denmark 78,354 823 61,461 Denmark
Norway 35,546 328 20,956 Norway
Sweden 243,129 6,681 NaN Sweden
I am trying to read the last line from a CSV file stored in GCS.
My Code -
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('my-bucket/my_file.csv') as f:
file = pd.read_csv(f)
print(file.tail(1))
Output:
John Doe 120 jefferson st. Riverside NJ 08075
5 business-name Internal 6 NaN NaN NaN
Public Sample CSV file -
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
business-name,Internal,6
I just want to get the last line - business-name,Internal,6 but that's not I'm getting. I'm not sure why tail(1) is not working.
Can anyone please help me?
The below pandas code should solve your issue. You can use the pandas read_scv function to get the csv data instead of reading the file.
import pandas as pd
df = pd.read_csv('my-bucket/my_file.csv')
df.tail(1)
By looks of it the code is working correctly. By default it is printing header column. If you want to disable header printing use the follwing.
file.tail(1).to_string(header=False))
Sl. No.,Name,Address
1.,Stuart,Wall Street
2.,Charlie,Broadway
3.,Oliver,Hollywood Boulevard
4.,Harry,Las Vegas Boulevard
5.,Kyle,Bourbon Street
o/p will be like:
print(Stuart) >>> Wall Street
import csv into pandas dataframe and use loc
import pandas as pd
df = pd.read_csv('/Users/prince/Downloads/test2.csv', sep=',')
df = df.set_index('Name')
print(df.loc['Stuart']['Address'])
which gives the following output
Wall Street
I'm new to Python and trying to do the following.
I have a csv file like below, (input.csv)
a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
where I'd like to remove duplicates with respect to each row to get the below.
a,v,s,f (output.csv)
china,usa, and uk,france
india,australia,usa,uk
japan,south africa,,new zealand
Notice that though 'usa' is repeated in two different rows, it still is kept intact, unlike 'china' and 'japan', which are repeated in same rows.
I tried doing using OrderedDict from collections in the following way
from collections import OrderedDict
out = open ("output.csv","w")
items = open("input.csv").readlines()
print >> out, list(OrderedDict.fromkeys(items))
but it moved all the data into one single row
we might hurt the dataset while iterating rows and deleting items without caring the related original position. There is related index (Column/Rows) to every item, deleting it can move the next items to other position.
Try to use pandas in such scenarios. by selecting items in the same row, you can apply a function to re-construct the row respecting their position. We use in operator to deal with such scenarios china and uk, and we replace the duplicated values with a an empty str.
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
Your code would look like:
import pandas as pd
from io import StringIO
data="""a,v,s,f
china,usa,china and uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand"""
df= pd.read_csv(StringIO(data.decode('UTF-8')) )
from collections import Counter
def trans(x):
d=[y for y in x]
i=0
while i<len(d):
j=i+1
item=d[i]
while j<len(d):
if item in d[j]:
d[j]=d[j].replace(item,'')
j+=1
i+=1
return d
print df.apply(lambda x:trans(x),axis=1 )
a v s f
0 china usa and uk france
1 india australia usa uk
2 japan south africa new zealand
In order to read your csv file, you just need to replace the name. More details should be found here
df= pd.read_csv("filename.csv")
This can actually be asked more specifically as, "How to remove duplicate items from lists." For which there's an existing solution: Removing duplicates in lists
So, assuming that your CSV file looks like this:
items.csv
a,v,s,f
china,usa,china,uk,france
india,australia,usa,uk
japan,south africa,japan,new zealand
I intentionally changed "china and uk" in line 2 to "china,uk". Note below.
Then the script to remove duplicates could be:
import sys
with open('items.csv', 'r') as csv:
for line in csv.readlines():
print list(set(line.split(',')))
Note: Now, if the 2nd really does contain "china and uk", you'd have to do something different than processing the file as a CSV.
I am trying to load in a really messy text file into Python/Pandas. Here is an example of what the data in the file looks like
('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:24','viewed_home_page'),('9ebabd77-45f5-409c-b4dd-6db7951521fd','9da3f80c-6bcd-44ae-bbe8-760177fd4dbc','Seattle, WA','2014-08-05 10:06:36','viewed_search_results'),('41aa8fac-1bd8-4f95-918c-413879ed43f1','bcca257d-68d3-47e6-bc58-52c166f3b27b','Madison, WI','2014-08-16 17:42:31','visit_start')
Here is my code
import pandas as pd
cols=['ID','Visit','Market','Event Time','Event Name']
table=pd.read_table('C:\Users\Desktop\Dump.txt',sep=',', header=None,names=cols,nrows=10)
But when I look at the table, it still does not read correctly.
All of the data is mainly on one row.
You could use ast.literal_eval to parse the data into a Python tuple of tuples, and then you could call pd.DataFrame on that:
import pandas as pd
import ast
cols=['ID','Visit','Market','Event Time','Event Name']
with open(filename, 'rb') as f:
data = ast.literal_eval(f.read())
df = pd.DataFrame(list(data), columns=cols)
print(df)
yields
ID Visit \
0 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
1 9ebabd77-45f5-409c-b4dd-6db7951521fd 9da3f80c-6bcd-44ae-bbe8-760177fd4dbc
2 41aa8fac-1bd8-4f95-918c-413879ed43f1 bcca257d-68d3-47e6-bc58-52c166f3b27b
Market Event Time Event Name
0 Seattle, WA 2014-08-05 10:06:24 viewed_home_page
1 Seattle, WA 2014-08-05 10:06:36 viewed_search_results
2 Madison, WI 2014-08-16 17:42:31 visit_start