convert comment (list) to dataframe ,pandas - python

I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines

Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover

Related

How to split csv data

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Why are my columns showing up as NaN when they are not empty?

I am trying to use pandas to create a data frame from a .csv file I have downloaded. Every time I try to make a predictors data frame, it empties one of the columns I am looking for. I downloaded the .csv file from here: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
It is the fourth file down titled "film.csv"
I have done this in the following way before with a different dataset and it worked flawlessly. This time my data is being deleted and I cannot figure out why.
import pandas as pd
file=pd.read_csv('film.csv',sep=';',encoding="ISO 8859-1")
#print(file)
df=pd.DataFrame(file)
df=df.dropna(axis=0,how='any')
predictors=pd.DataFrame(df.Director,df.Length)
#prints directors as NaN
print(predictors)
#prints both columns fully
print(df.Director)
print(df.Length)
Printing the predictors data frame above prints out the Length column correctly, but the Director column with all files as NaN. All I want is a data frame of the two columns Director and Length. Any help would be greatly appreciated!
Edit:
These are the first 10 lines of the csv file.
Year;Length;Title;Subject;Actor;Actress;Director;Popularity;Awards
INT;INT;STRING;CAT;CAT;CAT;CAT;INT;BOOL;STRING
1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril,
Victoria;Almodóvar, Pedro;68;No
1991;113;High Heels;Comedy;Bosé, Miguel;Abril, Victoria;Almodóvar,
Pedro;68;No
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams,
Brooke;Cronenberg, David;79;No
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick,
Terrence;14;No
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No
1984;101;Target Eagle;Action;Connors, Chuck;Adams, Maud;Loma, José
Antonio de la;14;No
1989;99;American Angels: Baptism of Blood, The;Drama;Bergen, Robert
D.;Adams, Trudy;Sebastian, Beverly;28;No
The issue is in this line predictors=pd.DataFrame(df.Director,df.Length)
TO create a new dataframe from old, use something like:
predictors=df[['Director', 'Length']].copy()

Python Sorting and Organising

I'm trying to sort data from a file and not quiet getting what i need. I have a text file with race details ( name placement( ie 1,2,3). I would like to be able to organize the data by highest placement first and also alphabetically by name. I can do this if i split the lines but then the name and score will not match up.
Any help and suggestion would be very welcomed, I've hit that proverbial wall.
My apologies ( first time user for this site , and python noob, steep learning curve ) Thank you for your suggestions , i really do appreciate the help.
comp=[]
results = open('d:\\test.txt', 'r')
for line in results:
line=line.split()
# (name,score)= line.split()
comp.append(line)
sorted(comp)
results.close()
print (comp)
Test file was in this format:
Jones 2
Ranfel 7
Peterson 5
Smith 1
Simons 9
Roberts 4
McDonald 3
Rogers 6
Elliks 8
Helm 10
I completely agree with everyone who has down-voted this question for being badly posed. However, I'm in a good mood so I'll try and at least steer you in the right direction:
Let's assume your text file looks like this:
Name,Placement
D,1
D,2
C,1
C,3
B,1
B,3
A,1
A,4
I suggest importing the data and sorting it using Pandas http://pandas.pydata.org/
import pandas as pd
# Read in the data
# Replace <FULL_PATH_OF FILE> with something like C:/Data/RaceDetails.csv
# The first row is automatically used for column names
data=pd.read_csv("<FULL_PATH_OF_FILE>")
# Sort the data
sorted_data=data.sort(['Placement','Name'])
# Create a re-indexed data frame if you so desire
sorted_data_new_index=sorted_data.reset_index(drop=True)
This gives me:
Name Placement
A 1
B 1
C 1
D 1
D 2
B 3
C 3
A 4
I'll leave you to figure out the rest..
As #Jack said, I am very limited to how I can help if you don't post code or the txt file. However, I've run into a similar problem before, so I know the basics (again, will need code/files before I can give an exact type-this-stuff answer!)
You can either develop an algorithm yourself, or use the built-in sorted feature
Put the names and scores in a list (or dictionary) such as:
name_scores = [['Matt', 95], ['Bob', 50], ['Ashley', 100]]
and then call sorted(name_scores) and it will sort by names: [['Ashley', 100], ['Bob', 50], ['Matt', 95]]

Extracting Groups

Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.
I have an excel spreadsheet that is setup the following way:
unique ID person_1 person _2 person_3 person_4 ... person_20
12 Tom Sally Mike
16 Joe Mike
5 Joe Sally
1 Sally Mike Tom
6 Sally Tom Mike
2 Jared Joe Mike John ... Carl
I want to do a few things:
1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.
2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.
3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).
Thanks for the help and suggestions
The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.
import csv
fi = open('xgroup.csv')
fo = open('xgroup3.csv', 'w')
w = csv.writer(fo)
r = csv.reader(fi)
li = 0
print "Opened reader and writer"
for row in r:
gname = '/'.join(row[1:])
sname = '/'.join(sorted(row[1:]))
tname = '/'.join([row[1]]+sorted(row[2:]))
w.writerow([row[0], gname, sname, tname]+row[1:])
li += 1
fi.close()
fo.close()
print "Closed reader and writer after",li,"lines"
File xgroup.csv is shown next.
unique-ID,person_1,person,_2,person_3,person_4,...,person_20
12,Tom,Sally,Mike
16,Joe,Mike
5,Joe,Sally
1,Sally,Mike,Tom
6,Sally,Tom,Mike
2,Jared,Joe,Mike,John,...,Carl
Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.
unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
Note, given a data line like
5,Joe,Sally,,,,,
instead of
5,Joe,Sally
the program as above produces
5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
instead of
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
If that's a problem, filter out empty entries. For example, if
row=['5', 'Joe', 'Sally', '', '', '', '', ''], then
'/'.join(row[1:]) produces
'Joe/Sally/////', while
'/'.join(filter(lambda x: x, row[1:])) and
'/'.join(x for x in row[1:] if x) and
'/'.join(filter(len, row[1:])) produce
'Joe/Sally' .
You could do the following:
Export your file to a .csv file from Excel
Open that input file using python's csv module, using csv.reader
Open another file (output) to write to it using csv.writer
Iterate over each row in your reader, do your treatment, and write that using your writer
Import the output file in Excel

Categories