This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I have two dataframes df and df2 with contents as follows
dataframe df
dataframe df2
I'd like to add to df1 the two columns from df2 "NUMSESSIONS_ANDROID" and "AVGSESSDUR_ANDROID"
I do this as follows:
df['NUMSESSIONS_ANDROID'] = df2['NUMSESSIONS_ANDROID']
df['AVGSESSDUR_ANDROID'] = df2['AVGSESSDUR_ANDROID']
However when I print the resulting df I see ... in place of AVGSESSDUR_IOS (i.e. it appears to have swallowed that column)
Appreciate any help resolving this ....
As ALollz stated, the fact you are seeing ... in the output means there's "hidden" data that is part of the dataframe, but not showing in your console or IDE. However you can perform an easy print to check all the columns that your dataframe contains with:
print(list(df))
And this will show you all the names of the columns in your df that way you can check whether the ones you want are there or not.
Furthermore you can print an specific column as a series (first line) or dataframe (second):
print(df['column_name'])
print(df[['column_name']])
If successful you will see the series/dataframe, if the column actually doesn't exist in your original dataframe, then you will get a KeyError.
Leveraging #ALollz's hint above ...
"The ... indicates that only part of the DataFrame is being shown in your terminal/output, so 'AVGSESSDUR_IOS' is almost certainly still there it's just not shown. You can look at print(df.iloc[:, 0:3]) to see the first 3 columns for instance."
I added the following two lines to increase the number of columns and width of console display and it worked:
pd.set_option('display.max_columns',20)
pd.set_option('display.width', 1000)
print(df.iloc[:,0:5])
I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.
This question already has an answer here:
Pandas: parse merged header columns from Excel
(1 answer)
Closed 4 years ago.
I have the following data inputs that I am reading via pandas.
I want to take the cell 'Month Ending .....' and drop into a newly formed 'Date' Column and append the two input files together into one dataframe.
This is what I have tried so far...
import pandas as pd
import glob
import os
### List Source Files That I need to Import###
path = os.getcwd()
files = os.listdir(path)
### Loading Files by Variable ###
data = pd.DataFrame()
for files in glob.glob('../Sales_Master_Data/Sales_Data/* customer *.xls'): #searches for customer .xls files in the folder
data = pd.read_excel(files,'sheet1',skiprows=0).fillna(method='ffill')#reads all files in df
date = data.columns[4] # This is where the date value is located
data['Date'] = date # Assigns date value to new ['Date'] column
df = df.append(data) # all files are appended together
df.to_csv('Output.csv')
Unfortunately it produces the output below. All cols beginning with 'Month' need to be merged into 1 column and called ['Sales Qty'] and I'm also having trouble tiding up the column headers so that they are uniform
Ideal output would look like this.....
It is never a good idea to feed merged cells into pandas. First thing I would suggest is to flatten your inputs. If there is no easy way to do so and to answer your original question, you need to create a multiindex dataframe to handle your data best. This has already been covered in StackOverflow here: https://stackoverflow.com/a/27424102/9754169
I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.
I’m trying get my Python program to verify an excel spreadsheet that looks like this:
The first column is the order number and there may be one or more rows with the same number. Then there's the last column which indicates the row status (OK or not).
I want to check if all rows for a given order number have been
marked as OK.
I have found something called pandas and if anyone could give me a help with idea how to handle it? There's also an option called groupby - could I use this to group by order numbers and then verify if all rows for this order number have been marked as OK?
You are on the right track. Just import the data and pivot it using pandas and check if number of empty counts are > 0. I used a dummy data since I couldn't take from your image:-
import pandas as pd
df = pd.DataFrame()
df['no'] = [1,1,1,2,1,2,1,3]
df['ok'] = ['OK','Empty','OK','Empty','Empty','OK','OK','OK']
df['cnt'] = 1
a = df.pivot_table(index=['no'],columns=['ok'],values='cnt', aggfunc='count')
a.reset_index(inplace=True)
a.fillna(0, inplace=True)
print a