I have an .xlsx file whose format is similar to... (Note that the first row is descriptive and not meant to be the column headers. Headers are on row 2)
SHEET SUBJECT, Listings for 2010,,,,
Date, Name, Name_2, Abr, Number, <--- I want this as column headers
12/01/2010, Company Name, Somecity, Chi, 36,
12/02/2010, Company Name, Someothercity, Nyc, 156,
So when I do this_df = pd.read_excel('filename.xlsx') I get SHEET SUBJECT and Listings for 2010 followed by a series of unnamed column headers. Expected, not what I want.
And when I do this_df.columns = this_df.iloc[1], assuming I'll get the column headers set from the row at index 1, it instead gives me the data values from the row at index 2.
What am I missing? Thanks.
Simply specify the row index of the header when you read the Excel file:
pd.read_excel('filename.xlsx', header = 1)
Maybe you can fix it when you read the excel
df=pd.read_excel(r'TT.xlsx',skiprows=1)
df
Out[367]:
Date Name Name_2 Abr Number
0 2010-12-01 Company Name Somecity Chi 36 NaN
1 2010-12-02 Company Name Someothercity Nyc 156 NaN
Related
I have the below dataframe with data that looks like this:(Made up data for clarification)
Info
Values
ID
53312
State
Mass
Address
Stackoverflowtown
ID
56120
State:
Bos
Address
Georgetown
Name:
James
There are a lot of lines but the issue I am having is that I want the data that is under a certain ID to be in width form because ID, State, Address belong to the same ID- then for the next line where we also have a Name (ID, State, Address, Name) all belong to the same ID.
Question:
Is there a way using Python Pandas or Excel to make it so that table would look something more like this:(Where we use ID as the index)
ID
State
Address
Name
53312
Mass
Stackoverflowtown
56120
Bos
Georgetown
James
I tried to do a pivot in Python but the issue is that since the values are all originally under the same column , when pivoted for me they end up showing each Value (ID, sTATE, Address..etc) on a different line and values in the same line would be blank- it looks something like this which is not right:
Example for ID 53312:
ID
State
Address
Name
53312
Null
Null
Null
Null
Mass
NULL
NULL
Null
NULL
Stackoverflowtown
NULL
Null
NULL
NULL
NULL
Check Below Code:
import pandas as pd
df = pd.DataFrame({'Info':['ID','State','Address','ID','State','Address','Name'],
'Values':[53312,'Mass','Stackoverflowtown',56120,'Bos','Georgetown','James']})
df['cumcount'] = df.groupby('Info')['Info'].transform('cumcount')
df['cumcount'] = np.where(df['cumcount'] < df['cumcount'].shift(), df['cumcount'].shift(), df['cumcount'])
pd.pivot_table(df,index='cumcount',
columns='Info',
values='Values',
aggfunc='first' ).reset_index()[['ID','State','Address','Name']].\
fillna('').\
rename_axis(None, axis=1)
Output:
The only way to achieve this is if you are certain that the ID row comes before all of the other info for that ID.
If you do have this assurance, the following will work, where df is the original DataFrame in your post:
ids = df.query("Info == 'ID'")['Values']
ids.name = 'ID'
df = df.join(ids)
df['ID'] = df['ID'].ffill()
df.query("Info != 'ID'", inplace=True)
res = df.pivot('ID', 'Info', 'Values')
Output of res:
I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115
I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes
I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)
I have a csv file with 900000 rows and 30 columns. The header is in the first row:
"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position", etc...
I want to extract only certain columns using pandas.
Now my problem is that the header repeats itself every 50 rows or so, so when I extract the columns I get only the first 50 rows. How can get the complete columns while skipping all the headers but the first one?
This is the code I have so far, but works nicely only until the second header:
import pandas
data = pandas.read_csv('data1.csv', usecols = ['dbSNP RS ID', 'Physical Position'])
import sys
sys.stdout = open("data2.csv", "w")
print data
This is an example representing some rows of the extracted columns:
dbSNP RS ID Physical Position
0 rs4147951 66943738
1 rs2022235 14326088
2 rs6425720 31709555
3 rs12997193 106584554
4 rs9933410 82323721
...
48 rs5771794 49157118
49 rs1061497 1415331
50 rs12647065 136012580
dbSNP RS ID Physical Position
...
dbSNP RS ID Physical Position
...
and so on...
Thanks very much in advance !
You could read the file with header=None, drop the duplicate rows (which keeps the first per default), and then set the remaining first row as header like so:
df = read_csv(path, header=None).drop_duplicates()
df.columns = df.iloc[0]
df = df.iloc[1:]