Ordering data in python or excel - python

I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.
An example of what I have:
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
The output I am looking for would be like:
Band1, hiphop, pop, rap
Band2, NaN, pop, rap, rock
Band3 hiphop, NaN, rap
What is the best way to sort the data like this?
I have tried using pandas and doing basic sorts in excel.

Here's an optional way that avoids for loops, just melting and pivoting the data to get to your output:
import pandas as pd
import numpy as np
df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])
+--------+-------+--------+-----+-----+------+
| genres | col1 | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
| 0 | Band1 | hiphop | pop | rap | NaN |
| 1 | Band2 | NaN | pop | rap | rock |
| 2 | band3 | hiphop | NaN | rap | NaN |
+--------+-------+--------+-----+-----+------+

Read the file (simulated below). As you read each row, update a fieldnames set so that when you write the rows you can pass this set of generes to your dictwriter.
import csv
text_in = """
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
"""
rows = [
[col.strip() for col in row.split(",")]
for row in text_in.split("\n")
if row
]
fieldnames = set()
rows_reshaped = []
for row in rows:
name = row[0]
genres = row[1:]
fieldnames.update(genres)
rows_reshaped.append(dict([("name", name)] + [(genre, True) for genre in genres]))
fieldnames = ["name"] + sorted(fieldnames)
with open("band,csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.DictWriter(file_out, fieldnames=fieldnames, restval=False)
writer.writeheader()
writer.writerows(rows_reshaped)
This should give you a file like:
name,hiphop,pop,rap,rock
Band1,True,True,True,False
Band2,False,True,True,True
band3,True,False,True,False

Basically removing your wide format and turning the data into a long format then turning the data into a one hot encoded dataframe which you can use as you please
import pandas as pd
df = pd.read_csv('./band_csv.csv',header=None)
new_df = pd.DataFrame(columns=['band','genre'])
for col in list(df.columns[1:]):
temp_df = pd.DataFrame(columns=['band','genre'])
temp_df.loc[:,'band'] = df.loc[:,df.columns[0]]
temp_df.loc[:,'genre'] = df.loc[:,col]
new_df = pd.concat([new_df,temp_df])
grouped_df = pd.get_dummies(new_df, columns=['genre']).groupby(['band'], as_index=False).sum()
Your grouped_df should look like
band genre_hiphop genre_pop genre_rap genre_rock
0 Band1 1 1 1 0
1 Band2 0 1 1 1
2 band3 1 0 1 0

Related

Locating column that correspond to a value in a dataframe

Suppose I have a define dataframe
+---+----+-------+-------+--------+
|ID |Fear |Happy |Angry |Excited |
+---+-----+------+-------+--------+
| | | | | |
+---+-----+------+-------+--------+
I did emotional analysis on a text using NRCLex.
Let say it returns
text_emotion = [Fear, Happy]
How do locate the values in the list and put it to the corresponding columns a 1 if exist and 0 if it doesn't?
+---+----+-------+-------+--------+
|ID |Fear |Happy |Angry |Excited |
+---+-----+------+-------+--------+
| A |1 |0 |0 |0 |
+---+-----+------+-------+--------+
I tried using get_dummies. But then it is not working on my situation given i want it to correspond to the defined dataframe. It gives me this instead:
+---+----+-------+
|ID |Fear |Happy |
+---+-----+------+
| A | 1 | 1 |
+---+-----+------+
I would appreicate any help. Thank You
You could do the following:
frame = pd.DataFrame(columns = ["Fear", "Angry", "Happy", "Excited"])
mylist = ["Fear", "Happy"]
pattern = '|'.join(mylist)
row = frame.columns.str.contains(pattern).astype(int)
frame.loc[0] = row
If you have a whole list you can loop through and append each row to the dataframe using frame.loc[i]. Like so:
frame = pd.DataFrame(columns = ["Fear", "Angry", "Happy", "Excited"])
mylist = frame.columns
mylists = [["Fear", "Happy"], ["Angry", "Excited"]]
for i in range(len(mylists)):
the_list = mylists[i]
pattern = '|'.join(the_list)
row = frame.columns.str.contains(pattern).astype(int)
frame.loc[i] = row
It depends on how you represent your data. Let's say you have the following dataframe constructed from your sentiment analysis results:
df = pd.DataFrame({
'A':['Fear', 'Happy', 'Emotional'],
'B':['Excited', 'Emotional', 'Angry'],
})
Then you could do:
df_dummies = pd.get_dummies(df.T, prefix=['']*len(df.T.columns), prefix_sep='')
out = df_dummies.groupby(level=0, axis=1).sum()
print(out):
Angry Emotional Excited Fear Happy
A 0 1 0 1 1
B 1 1 1 0 0
If you want the index as separate ID then
out = out.rename_axis('ID').reset_index()
print(out):
ID Angry Emotional Excited Fear Happy
0 A 0 1 0 1 1
1 B 1 1 1 0 0
I'd never heard of get_dummies() before, but here's what I had come up with. It also uses loc. It's nice because you could have a predefined or an undefined/empty dataframe and it'll still work.
Since the emotions in text_emotion are the same as the dataframe column names, you can just loop through text_emotion and make that dataframe row/column value equal to 1 with loc.
import numpy as np
import pandas as pd
df = pd.DataFrame()
text_emotion_1 = ['Fear', 'Happy', 'Angry']
text_emotion_2 = ['Happy', 'Excited']
# for row 0, or you can do boolean indexing to assign it
# to the row where index = A
for em in text_emotion_1:
df.loc[0, em] = 1
# for row 1
for em in text_emotion_2:
df.loc[1, em] = 1
If you started with an empty dataframe, you'd have nulls:
Fear Happy Angry Excited
0 1.0 1.0 1.0 NaN
1 NaN 1.0 NaN 1.0
So you could use fillna() and astype() to replace the nulls with 0's and convert everything to integer, respectively.
df.fillna(0, inplace=True)
df = df.astype('int')
Then your dataframe will look like this (just missing the index column):
Fear Happy Angry Excited
0 1 1 1 0
1 0 1 0 1
Edit: Removed a stray comma
Let's say you have a base dataframe df like this (doesn't need to have rows)
df = pd.DataFrame([["A", 0, 0, 0, 1]], columns=["ID", "Fear", "Happy", "Angry", "Excited"])
ID Fear Happy Angry Excited
0 A 0 0 0 1
and your analysis data are organized in another dataframe data_df like
data_df = pd.DataFrame({"ID": ["B", "C"], "text_emotion": [["Happy"], ["Angry", "Fear"]]})
ID text_emotion
0 B [Happy]
1 C [Angry, Fear]
Then you could use .str.get_dummies(): This
print(data_df["text_emotion"].str.join("|").str.get_dummies())
results in
Angry Fear Happy
0 0 0 1
1 1 1 0
So when you concatenate this to df
df_new = pd.concat(
[data_df["ID"], data_df["text_emotion"].str.join("|").str.get_dummies()], axis=1
)
df = pd.concat([df, df_new], ignore_index=True).fillna(0)
you'll get
ID Fear Happy Angry Excited
0 A 0 0 0 1.0
1 B 0 1 0 0.0
2 C 1 0 1 0.0
The columns are exactly like df's columns. To fix the float parts you could do
df[df.columns[1:]] = df[df.columns[1:]].astype("int")
and get
ID Fear Happy Angry Excited
0 A 0 0 0 1
1 B 0 1 0 0
2 C 1 0 1 0

pandas merge header rows if one is not NaN

I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.
Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']
This should work for you:
df.columns = list(df.columns.get_level_values(0))
Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])

How to compare a specific column in two csv files and output differences to a third file

I have two csv files named test1.csv and test2.csv and they both have a column named 'Name'. I would like to compare each row in this Name column between both files and output the ones that don't match to a third file. I have seen some examples using pandas, but none worked for my situation. Can anyone help me get a script going for this?
Test2 will be updated to include all values from test1 plus new values not included in test1 (which are the ones i want saved to a third file)
An example of what the columns look like is:
test1.csv:
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
test2.csv
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
vb556h 665 Disposed
See below.
The idea is to read the names into s python set data structure and find the new names by doing set substruction.
1.csv:
Name Number
A 12
B 34
C 45
2.csv
Name Number
A 12
B 34
C 45
D 77
Z 67
The code below will print {'D', 'Z'} which are the new names.
def read_file_to_set(file_name):
with open(file_name) as f:
return set(l.strip().split()[0] for x,l in enumerate(f.readlines()) if x > 0)
set_1 = read_file_to_set('1.csv')
set_2 = read_file_to_set('2.csv')
new_names = set_2 - set_1
print(new_names)
This answer assumes that the data is lined up as in your example:
import pandas as pd
# "read" each file
df1 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy']})
df2 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy', 'fdvs']})
# make column names unique
df1 = df1.rename(columns={'Name': 'Name1'})
df2 = df2.rename(columns={'Name': 'Name2'})
# line them up next to each other
df = pd.concat([df1, df2], axis=1)
# get difference
diff = df[df['Name1'].isnull()]['Name2'] # or df[df['Name1'] != df['Name2']]['Name2']
# write
diff.to_csv('test3.csv')
This should be straight forward - the solution assumes that the content of file2 is the same or longer, so items are only appended to file2.
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
# print(df1)
# print(df2)
df = pd.concat([df1, df2], axis=1)
df['X'] = df['A'] == df['B']
print(df[df.X==False])
df3 = df[df.X==False]['B']
print(df3)
df3.to_csv(r"C:\path\to\file3.csv")
If the items are in arbitrary order, you could use df.isin() as follows:
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
df = pd.concat([df1, df2], axis=1)
df['X'] = df['B'].isin(df['A'])
df3 = df[df.X==False]['B']
df3.to_csv(r"C:\path\to\file3.csv")
I have created the following 2 files:
A
1_in_A
2_in_A
3_in_A
4_in_A
and file2.csv:
B
2_in_A
1_in_A
3_in_A
4_in_B
5_in_B
for testing. The dataframe df looks as follows:
| | A | B | X |
|---:|:-------|:-------|:------|
| 0 | 1_in_A | 2_in_A | True |
| 1 | 2_in_A | 1_in_A | True |
| 2 | 3_in_A | 3_in_A | True |
| 3 | 4_in_A | 4_in_B | False |
| 4 | nan | 5_in_B | False |
and we select only the items that are flagged as False.

Maintaining column order when adding two dataframes with similar formats

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

A fast method for comparing a value in one Panda’s row to another in a previous row?

I have a DataFrame, df, that looks like:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTR
ID is a student ID, TERM is an academic term, and DISC_1 is the discipline of their major. For each student, I’d like to identify the TERM when (and if) they changed DISC_1, and then create a new DataFrame that reports when. Zero indicates they did not change. The output looks like:
ID | Change
1 | 0
2 | 2004-01
3 | 0
My code below works, but it’s very slow. I tried to do this using Groupby, but was unable to. Could someone explain how I might accomplish this task more efficiently?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()
Here's a start:
df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)
I've never been a big pandas user, so my solution would involve spitting that df out as a csv, and iterating over each row, while retaining the previous row. If it is properly sorted (first by ID, then by Term date) I might write something like this...
import csv
with open('inputDF.csv', 'rb') as infile:
with open('outputDF.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
previousline = reader.next() #grab the first row to compare to the second
termChange = 0
for line in reader:
if line[0] != previousline[0]: #new ID means print and move on to next person
writer.writerow([previousline[0], termChange]) #print to file ID, termChange date
termChange = 0
elif line[2] != previousline[2]: #new discipline
termChange = line[1] #set term changed date
#termChange = previousline[1] #in case you want to rather retain the last date they were in the old dicipline
previousline = line #store current line as previous and continue loop

Categories