How to split csv data - python

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column

Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks

I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

Related

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Sorting in pandas by multiple column without distorting index

I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50

Why are my columns showing up as NaN when they are not empty?

I am trying to use pandas to create a data frame from a .csv file I have downloaded. Every time I try to make a predictors data frame, it empties one of the columns I am looking for. I downloaded the .csv file from here: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
It is the fourth file down titled "film.csv"
I have done this in the following way before with a different dataset and it worked flawlessly. This time my data is being deleted and I cannot figure out why.
import pandas as pd
file=pd.read_csv('film.csv',sep=';',encoding="ISO 8859-1")
#print(file)
df=pd.DataFrame(file)
df=df.dropna(axis=0,how='any')
predictors=pd.DataFrame(df.Director,df.Length)
#prints directors as NaN
print(predictors)
#prints both columns fully
print(df.Director)
print(df.Length)
Printing the predictors data frame above prints out the Length column correctly, but the Director column with all files as NaN. All I want is a data frame of the two columns Director and Length. Any help would be greatly appreciated!
Edit:
These are the first 10 lines of the csv file.
Year;Length;Title;Subject;Actor;Actress;Director;Popularity;Awards
INT;INT;STRING;CAT;CAT;CAT;CAT;INT;BOOL;STRING
1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril,
Victoria;Almodóvar, Pedro;68;No
1991;113;High Heels;Comedy;Bosé, Miguel;Abril, Victoria;Almodóvar,
Pedro;68;No
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams,
Brooke;Cronenberg, David;79;No
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick,
Terrence;14;No
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No
1984;101;Target Eagle;Action;Connors, Chuck;Adams, Maud;Loma, José
Antonio de la;14;No
1989;99;American Angels: Baptism of Blood, The;Drama;Bergen, Robert
D.;Adams, Trudy;Sebastian, Beverly;28;No
The issue is in this line predictors=pd.DataFrame(df.Director,df.Length)
TO create a new dataframe from old, use something like:
predictors=df[['Director', 'Length']].copy()

Handling data from csv file with Python

I know Python is almost made for these kind of purposes, but I am really struggling to understand how I get access to specific values in the dataset, and I tried both with pandas and csv modules. It is probably a matter of syntax. Here's the thing: I have a csv file in the form of
Nation, Year, No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
...
...
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
...
and so on. What I would like to do is to get the total amount of refugees per year, i.e. selecting all the rows with a certain year and summing all the elements in the related "no. of refugees" column. I managed to do this:
import csv
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
print headers
#2013
list2013=[]
for line in d_reader:
if (line['Year']=='2013'):
list2013.append(line['Refugees'])
list2013=map(int,list2013) #I have str values in my file
ref13=sum(list2013)
but I am looking for a more elegant (and, above all, iterative) solution. Moreover, if I perform that procedure multiple times for different years, I always get 0: it works for 2013 only, not sure why.
Edit: I tried this as well, without success, but I think this could be totally wrong:
import csv
refugees_dict={}
a=range(2005,2014)
a=map(str, a)
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
for element in a:
for line in d_reader:
if (line['Year']==element):
print 'hello!'
temp_list=[]
temp_list.append(line['Refugees'])
temp_list=map(int, temp_list)
refugees_dict[a]=sum(temp_list)
print refugees_dict
The next step of my work will involve further studies on the dataset, eg I am probably gonna need to access data nation-wise instead of year-wise, and I really appreciate any hint so I understand how to manipulate data.
Thanks a lot.
Since you tagged pandas in the question, here's a pandas solution to getting the number of refugees per year.
Let's say my input csv looks like this (note that I've eliminated the extra space before the column names):
Nation,Year,No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
You can read that into a pandas DataFrame like this:
df = pd.read_csv('data.csv')
You can then get the total like this:
df.groupby(['Year']).sum()
This gives:
No. of refugees
Year
2012 7370
2013 7150
Consider:
from collections import defaultdict
by_year = defaultdict(int) # a dict that has a 0 under every key.
and then
by_year[line['year']] += int(line['Refugees'])
Now you can just look at by_year['2013'] and see your sum (same for other years).
To sum by year you can try this:
f = open('file.csv').readlines()
f = [i.strip('\n').split(',') for i in f]
years = {i[1]:0 for i in f}
for i in f:
years[i[1]] += int(i[-1])
Now, you have a dictionary that has the sum of all the refugees by year.
To access nation-wise:
nations = {i[0]:0 for i in f}
for i in f:
nations[i[0]] += int(i[-1])

Python: create edge list for bipartite graph using pandas

I have a simple file that lists texts by name and then words that are a part of that text:
text,words
ANC088,woods dig spirit controller father treasure_lost
ANC089,controller dig spirit
ANC090,woods ag_work tomb
ANC091,well spirit_seen treasure
Working with pandas I have this, albeit klugey solution for getting a list of nodes for the two sides of a bipartite graph, one side listing the texts and the other the words, in this case associated with the text:
import pandas as pd
df = pd.read_csv(open('tales-02.txt', 'r'))
node_list_0 = df['text'].values.tolist()
node_list_1 = filter(None, sorted(set(' '.join(df['words'].values.tolist()).split(' '))))
It ain't pretty, but it works, and it's fast enough for my small data set.
What I need is a list of edges between those two nodes. I can this in csv, but I can't figure out how to do this in pandas. Here's my working csv:
texts = csv.reader(open('tales-01.txt', 'rb'), delimiter=',', skipinitialspace=True)
for row in texts:
for item in row[1:]:
edge_list.append((row[0], item))
I should note that this version of the input is csv all the way:
ANC088,woods,dig,spirit,controller,father,treasure_lost
ANC089,controller,dig,spirit
I adjusted the file format to make it easier for me to write the pandas stuff -- if someone can also show me how to grab the node lists out of the pure csv file, that would be awesome.
I'd rather this be done either all csv or all pandas. I tried to write a script that would get me the node lists using csv but I kept getting an empty list. That's when I turned to pandas, which everyone tells me I should be using anyway.
The following code creates a DataFrame where with the text and the word columns from your file tales-01.txt. It's not very pretty (is there a prettier solution?), but it seems to do the job.
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0).apply(
lambda x : pd.DataFrame ([[x.iloc[0,0],v]
for v in x.iloc[0,1:]]))
.reset_index(drop=True)
.dropna()
.rename_axis({0:'text',1:'word'},axis=1)
)
Here is a second solution based on the same idea that uses zip instead of the for loop. It might be faster.
def my_zip(d):
t,w = d.iloc[0,0],d.iloc[0,1:]
return pd.DataFrame(zip([t]*len(w), w)).dropna()
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0)
.apply(my_zip)
.reset_index(drop=True)
.rename_axis({0:'text',1:'word'},axis=1)
)
The result is the same in both cases:
text word
0 ANC088 woods
1 ANC088 dig
2 ANC088 spirit
3 ANC088 controller
4 ANC088 father
5 ANC088 treasure_lost
6 ANC089 controller
7 ANC089 dig
8 ANC089 spirit

Categories