I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.
I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines
Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover
I have a simple file that lists texts by name and then words that are a part of that text:
text,words
ANC088,woods dig spirit controller father treasure_lost
ANC089,controller dig spirit
ANC090,woods ag_work tomb
ANC091,well spirit_seen treasure
Working with pandas I have this, albeit klugey solution for getting a list of nodes for the two sides of a bipartite graph, one side listing the texts and the other the words, in this case associated with the text:
import pandas as pd
df = pd.read_csv(open('tales-02.txt', 'r'))
node_list_0 = df['text'].values.tolist()
node_list_1 = filter(None, sorted(set(' '.join(df['words'].values.tolist()).split(' '))))
It ain't pretty, but it works, and it's fast enough for my small data set.
What I need is a list of edges between those two nodes. I can this in csv, but I can't figure out how to do this in pandas. Here's my working csv:
texts = csv.reader(open('tales-01.txt', 'rb'), delimiter=',', skipinitialspace=True)
for row in texts:
for item in row[1:]:
edge_list.append((row[0], item))
I should note that this version of the input is csv all the way:
ANC088,woods,dig,spirit,controller,father,treasure_lost
ANC089,controller,dig,spirit
I adjusted the file format to make it easier for me to write the pandas stuff -- if someone can also show me how to grab the node lists out of the pure csv file, that would be awesome.
I'd rather this be done either all csv or all pandas. I tried to write a script that would get me the node lists using csv but I kept getting an empty list. That's when I turned to pandas, which everyone tells me I should be using anyway.
The following code creates a DataFrame where with the text and the word columns from your file tales-01.txt. It's not very pretty (is there a prettier solution?), but it seems to do the job.
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0).apply(
lambda x : pd.DataFrame ([[x.iloc[0,0],v]
for v in x.iloc[0,1:]]))
.reset_index(drop=True)
.dropna()
.rename_axis({0:'text',1:'word'},axis=1)
)
Here is a second solution based on the same idea that uses zip instead of the for loop. It might be faster.
def my_zip(d):
t,w = d.iloc[0,0],d.iloc[0,1:]
return pd.DataFrame(zip([t]*len(w), w)).dropna()
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0)
.apply(my_zip)
.reset_index(drop=True)
.rename_axis({0:'text',1:'word'},axis=1)
)
The result is the same in both cases:
text word
0 ANC088 woods
1 ANC088 dig
2 ANC088 spirit
3 ANC088 controller
4 ANC088 father
5 ANC088 treasure_lost
6 ANC089 controller
7 ANC089 dig
8 ANC089 spirit
Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.
I have an excel spreadsheet that is setup the following way:
unique ID person_1 person _2 person_3 person_4 ... person_20
12 Tom Sally Mike
16 Joe Mike
5 Joe Sally
1 Sally Mike Tom
6 Sally Tom Mike
2 Jared Joe Mike John ... Carl
I want to do a few things:
1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.
2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.
3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).
Thanks for the help and suggestions
The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.
import csv
fi = open('xgroup.csv')
fo = open('xgroup3.csv', 'w')
w = csv.writer(fo)
r = csv.reader(fi)
li = 0
print "Opened reader and writer"
for row in r:
gname = '/'.join(row[1:])
sname = '/'.join(sorted(row[1:]))
tname = '/'.join([row[1]]+sorted(row[2:]))
w.writerow([row[0], gname, sname, tname]+row[1:])
li += 1
fi.close()
fo.close()
print "Closed reader and writer after",li,"lines"
File xgroup.csv is shown next.
unique-ID,person_1,person,_2,person_3,person_4,...,person_20
12,Tom,Sally,Mike
16,Joe,Mike
5,Joe,Sally
1,Sally,Mike,Tom
6,Sally,Tom,Mike
2,Jared,Joe,Mike,John,...,Carl
Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.
unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
Note, given a data line like
5,Joe,Sally,,,,,
instead of
5,Joe,Sally
the program as above produces
5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
instead of
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
If that's a problem, filter out empty entries. For example, if
row=['5', 'Joe', 'Sally', '', '', '', '', ''], then
'/'.join(row[1:]) produces
'Joe/Sally/////', while
'/'.join(filter(lambda x: x, row[1:])) and
'/'.join(x for x in row[1:] if x) and
'/'.join(filter(len, row[1:])) produce
'Joe/Sally' .
You could do the following:
Export your file to a .csv file from Excel
Open that input file using python's csv module, using csv.reader
Open another file (output) to write to it using csv.writer
Iterate over each row in your reader, do your treatment, and write that using your writer
Import the output file in Excel
here is a snapshot of my csv:
alex 123f 1
harry fwef 2
alex sef 3
alex gsdf 4
alex wf35 6
harry sdfsdf 3
i would like to get the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:
alex 123f 1
alex sef 3
alex gsdf 4
alex wf35 6
Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[0]] += 1
and finally do the selection loop...:
for row in data:
if counter[row[0]] >= 4:
print row
Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.
if Python is not a must
$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file
And yes, 70MB file is fine.