I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.
Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']
This should work for you:
df.columns = list(df.columns.get_level_values(0))
Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])
Related
I am trying to merge rows with each other to get one row containing all the values that are present. Currently the df look like this:
dataframe
What i want is something like:
| index | scan .. | snel. | kool .. | note .. |
| ----- | ------- | ----- | ------- | ------- |
| 0 | 7,8 | 4,0 | 20.0 | Fiasp, ..|
I can get that output in the code example below but it just seems really messy.
I tried to use groupby, agg, sum, max, and all those do is that it removes columns and looks like this:
df2.groupby('Tijdstempel apparaat').max().reset_index()
I tried filling the row with the values of the previous rows, and then drop the rows that dont contain every value. But this seems like a long work around and really messy.
df2 = df2.loc[df['Tijdstempel apparaat'] == '20-01-2023 13:24']
df2 = df2.reset_index()
del df2['index']
df2['Snelwerkende insuline (eenheden)'].fillna(method='pad', inplace=True)
df2['Koolhydraten (gram)'].fillna(method='pad', inplace=True)
df2['Notities'].fillna(method='pad', inplace=True)
df2['Scan Glucose mmol/l'].fillna(method='pad', inplace=True)
print(df2)
# df2.loc[df2[0,'Snelwerkende insuline (eenheden)']] = df2.loc[df2[1, 'Snelwerkende insuline (eenheden)']]
df2.drop([0, 1, 2])
Output:
When i have to do this for the entire data.csv (whenever a time stamp like "20-01-2023 13:24" is found multiple times) i am worried it wil be really slow and time consuming.
sample data as your data
df = pd.DataFrame(data={
"times":["date1","date1","date1","date1","date1"],
"type":[1,2,3,4,5],
"key1":[1,None,None,None,None],
"key2":[None,"2",None,None,None],
"key3":[None,None,3,None,None],
"key4":[None,None,None,"val",None],
"key5":[None,None,None,None,5],
})
solution
melt = df.melt(id_vars="times",
value_vars=df.columns[1:],)
melt = melt.dropna()
pivot = melt.pivot_table(values="value", index="times", columns="variable", aggfunc=lambda x: x)
change type column location
index = list(pivot.columns).index("type")
pivot = pd.concat([pivot.iloc[:,index:], pivot.iloc[:,:index]], axis=1)
I have a large csv file of unordered data. It consists of music tags. I am trying to group all of the similar tags together for easier analysis.
An example of what I have:
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
The output I am looking for would be like:
Band1, hiphop, pop, rap
Band2, NaN, pop, rap, rock
Band3 hiphop, NaN, rap
What is the best way to sort the data like this?
I have tried using pandas and doing basic sorts in excel.
Here's an optional way that avoids for loops, just melting and pivoting the data to get to your output:
import pandas as pd
import numpy as np
df = pd.read_csv("./test.csv", names=['col1','col2','col3','col4'])
#melt on all but the first column
df = pd.melt(df, id_vars='col1', value_vars=df.columns[1:], value_name='genres')
#pivot using the new genres column as column names
df = pd.pivot_table(df, values='variable', index='col1', columns='genres', aggfunc='count').reset_index()
#swap non-null values with the column name
cols = df.columns[1:]
df[cols] = np.where(df[cols].notnull(), cols, df[cols])
+--------+-------+--------+-----+-----+------+
| genres | col1 | hiphop | pop | rap | rock |
+--------+-------+--------+-----+-----+------+
| 0 | Band1 | hiphop | pop | rap | NaN |
| 1 | Band2 | NaN | pop | rap | rock |
| 2 | band3 | hiphop | NaN | rap | NaN |
+--------+-------+--------+-----+-----+------+
Read the file (simulated below). As you read each row, update a fieldnames set so that when you write the rows you can pass this set of generes to your dictwriter.
import csv
text_in = """
Band1, hiphop, pop, rap
Band2, rock, rap, pop
band3, hiphop, rap
"""
rows = [
[col.strip() for col in row.split(",")]
for row in text_in.split("\n")
if row
]
fieldnames = set()
rows_reshaped = []
for row in rows:
name = row[0]
genres = row[1:]
fieldnames.update(genres)
rows_reshaped.append(dict([("name", name)] + [(genre, True) for genre in genres]))
fieldnames = ["name"] + sorted(fieldnames)
with open("band,csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.DictWriter(file_out, fieldnames=fieldnames, restval=False)
writer.writeheader()
writer.writerows(rows_reshaped)
This should give you a file like:
name,hiphop,pop,rap,rock
Band1,True,True,True,False
Band2,False,True,True,True
band3,True,False,True,False
Basically removing your wide format and turning the data into a long format then turning the data into a one hot encoded dataframe which you can use as you please
import pandas as pd
df = pd.read_csv('./band_csv.csv',header=None)
new_df = pd.DataFrame(columns=['band','genre'])
for col in list(df.columns[1:]):
temp_df = pd.DataFrame(columns=['band','genre'])
temp_df.loc[:,'band'] = df.loc[:,df.columns[0]]
temp_df.loc[:,'genre'] = df.loc[:,col]
new_df = pd.concat([new_df,temp_df])
grouped_df = pd.get_dummies(new_df, columns=['genre']).groupby(['band'], as_index=False).sum()
Your grouped_df should look like
band genre_hiphop genre_pop genre_rap genre_rock
0 Band1 1 1 1 0
1 Band2 0 1 1 1
2 band3 1 0 1 0
I have two csv files named test1.csv and test2.csv and they both have a column named 'Name'. I would like to compare each row in this Name column between both files and output the ones that don't match to a third file. I have seen some examples using pandas, but none worked for my situation. Can anyone help me get a script going for this?
Test2 will be updated to include all values from test1 plus new values not included in test1 (which are the ones i want saved to a third file)
An example of what the columns look like is:
test1.csv:
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
test2.csv
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
vb556h 665 Disposed
See below.
The idea is to read the names into s python set data structure and find the new names by doing set substruction.
1.csv:
Name Number
A 12
B 34
C 45
2.csv
Name Number
A 12
B 34
C 45
D 77
Z 67
The code below will print {'D', 'Z'} which are the new names.
def read_file_to_set(file_name):
with open(file_name) as f:
return set(l.strip().split()[0] for x,l in enumerate(f.readlines()) if x > 0)
set_1 = read_file_to_set('1.csv')
set_2 = read_file_to_set('2.csv')
new_names = set_2 - set_1
print(new_names)
This answer assumes that the data is lined up as in your example:
import pandas as pd
# "read" each file
df1 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy']})
df2 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy', 'fdvs']})
# make column names unique
df1 = df1.rename(columns={'Name': 'Name1'})
df2 = df2.rename(columns={'Name': 'Name2'})
# line them up next to each other
df = pd.concat([df1, df2], axis=1)
# get difference
diff = df[df['Name1'].isnull()]['Name2'] # or df[df['Name1'] != df['Name2']]['Name2']
# write
diff.to_csv('test3.csv')
This should be straight forward - the solution assumes that the content of file2 is the same or longer, so items are only appended to file2.
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
# print(df1)
# print(df2)
df = pd.concat([df1, df2], axis=1)
df['X'] = df['A'] == df['B']
print(df[df.X==False])
df3 = df[df.X==False]['B']
print(df3)
df3.to_csv(r"C:\path\to\file3.csv")
If the items are in arbitrary order, you could use df.isin() as follows:
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
df = pd.concat([df1, df2], axis=1)
df['X'] = df['B'].isin(df['A'])
df3 = df[df.X==False]['B']
df3.to_csv(r"C:\path\to\file3.csv")
I have created the following 2 files:
A
1_in_A
2_in_A
3_in_A
4_in_A
and file2.csv:
B
2_in_A
1_in_A
3_in_A
4_in_B
5_in_B
for testing. The dataframe df looks as follows:
| | A | B | X |
|---:|:-------|:-------|:------|
| 0 | 1_in_A | 2_in_A | True |
| 1 | 2_in_A | 1_in_A | True |
| 2 | 3_in_A | 3_in_A | True |
| 3 | 4_in_A | 4_in_B | False |
| 4 | nan | 5_in_B | False |
and we select only the items that are flagged as False.
I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.
I have a pandas dataframe where cells in columns have multiple values and are separated by ';'. I'm trying to split the multiple values (in one cell) and create new rows for those that split off. Something like the example below:
> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 | CA | egg; pork | sp1;sp2 |
The result I am trying to achieve looks like this:
> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 | CA | egg | sp1 |
| 1998 | CA | egg | sp1 |
| 1998 | CA | pork | sp2 |
| 1998 | CA | pork | sp2 |
I have found a method to split the dataframe like this, but it only works once. The code I used is shown below:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values
When I execute this on the 'Species' column first, using the original dataframe (df), it works.
However, when I execute this code again on df1, trying to split up all the 'Ingredient', it gives me an error saying that length of value does not match length of index. As shown below:
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = fd.values
I did many trials to find why it returns that error message to me, and I realized that when I execute this called again on df1 to create df2, it doubles the number of rows/index when I execute df2 = df1.loc[j].copy(). Therefore, giving me more rows than I need. However, if I substitute 'df1' with 'df' (the original dataframe) then this error doesn't appear and it works.
Is there a solution to fix this? Or is there any other way of splitting it?
Thank you.
ps. This is my first time posting on Stack Overflow, and I'm also new to Python. Sorry if the formatting is bad.
I gave your problem a try. I wasn't able to fix the issue in your approach. I was able to come up with another approach since you provided the expected output. Hopefully this is concise and resolves your issue.
df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2'] # Same input df as problem
print df
sp = df['Species'][0].split(';') # Separating by species
df = pd.concat([df]*len(sp), ignore_index=True) # Add len(sp) more rows
df['Species'] = sp
ing = df['Ingredient'][0].split(';')
df = pd.concat([df]*len(ing), ignore_index=True)
df['Ingredient'] = ing*len(sp) # Replicate ingredient len(sp) number of times
print df
Year State Ingredient Species
0 1998 CA egg; pork sp1;sp2
Year State Ingredient Species
0 1998 CA egg sp1
1 1998 CA pork sp2
2 1998 CA egg sp1
3 1998 CA pork sp2
PS: This is my first time answering ... please let me know if I should make any changes to this answer to add more detail or format. Thanks!
Edit: I was able find out what was going wrong in your approach. You have to reset the index when you create the copy of the dataframes otherwise when you get the number of indices with value 0, you get multiple values since they are all currently 0. See below.
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j
df1 = df.loc[i].copy().reset_index(drop=True)
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j
Output:
Year State Ingredient Species
0 1998 CA egg; pork sp1;sp2
0 1998 CA egg; pork sp1;sp2
Int64Index([0, 0, 0, 0], dtype='int64')
Year State Ingredient Species
0 1998 CA egg; pork sp1;sp2
1 1998 CA egg; pork sp1;sp2
Int64Index([0, 0, 1, 1], dtype='int64')
Original code with fix:
df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']
#print df
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index(drop=True, inplace=False)
df1['Species'] = sp.values
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy().reset_index(drop=True, inplace=False)
df2['Ingredient'] = fd.values
print df2
Hope that helps!
With the help of vk's "Original code with fix" shown above. It helped me solve the error "length of values don't match with length of index". The solution is: I needed to place reset_index() at the appropriate locations in the code.
Original code:
## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values
## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
Fixed code:
## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values
del df1['index'] ## a column called "index" is generated when you execute reset_index()
## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values
And I got the output I wanted with the 'Fixed code'.