I've read in some csv files using pandas. For now it's only two files, but in a few weeks I'll be working with several hundred csv files with the same data variables.
I've used a for loop to read in the files and appended the dataframes to a single list, and then used this for loop to differentiate the names some:
for i, df in enumerate(separate_data, 1):
df.columns = [col_name+'_df{}'.format(i) for col_name in df.columns]
My question is this, how can I compare the variables between the files using a bar plot? For example, one of the common variables is temperature, so after differentiating the column names I now have temp_df1 and temp_df2. How would I go about calling all temperature columns to compare them in a bar plot?
I tried using this, but could not get it to work:
for df in separate_data:
temp_comp = separate_data.plot.bar(y='temp*')
Let's say you have the three dataframes below, each with a temp column. Here is how you iteratively combine the temp columns into a single new dataframe and plot them:
import matplotlib.pyplot as plt
import pandas as pd
df1 = pd.DataFrame({'temp':[100,150,200], 'pressure': [10,20,30]})
df2 = pd.DataFrame({'temp':[50,70,100], 'pressure': [10,25,40]})
df3 = pd.DataFrame({'temp':[110,80,120], 'pressure': [8,20,50]})
df_list = [df1,df2,df3]
df_combined = pd.DataFrame()
for i, df in enumerate(df_list):
df_combined[f'df{i+1}'] = df['temp']
print('Combined Dataframe\n', df_combined)
df_combined.plot(kind = 'bar')
plt.ylabel('Temp')
plt.show()
#Combined Dataframe
df1 df2 df3
0 100 50 110
1 150 70 80
2 200 100 120
Note that this assumes that all your dataframes have the same length. If this is not true, you can just read the first n (e.g. 50) rows from each dataframe to ensure equal lengths with:
df = pd.read_csv('sample.csv', nrows=50).
If you can, I would read into one data frame with an identifier to make the aggregation easier. For example:
import pandas as pd
filenames = ["file_1.csv", "file_2.csv"]
df = pd.concat(
[
pd.read_csv(filename).assign(filename=filename.split(".")[0])
for filename in filenames
]
)
df.groupby("filename")["column_to_plot"].mean().plot.bar()
Related
I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:
As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs
Data
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
Notes on the original code
There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.
So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
Let Excel do formatting
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
Full code
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
P.S.
Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
Update
If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:
I have two datasets. Below you can see codes and data
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
data = {'type_sale': ['group_1','group_2','group_3','group_4','group_5','group_6','group_7','group_8','group_9','group_10'],
'id':[70,20,24,80,20,20,60,20,20,20],
}
df1 = pd.DataFrame(data, columns = ['type_sale',
'id',])
data = {'type_sale': ['group_1','group_2','group_3'],
'id':[70,20,24],
}
df2 = pd.DataFrame(data, columns = ['type_sale',
'id',])
These codes created two datasets that are shown below :
Now I want to create a new data set df3 with values from df1 that are different (distinct values) from the values df2 in the column id.
The final results should as pic below
I tried with these codes but are not giving desired results.
df = pd.concat((df1, df2))
print(df.drop_duplicates('id'))
So can anybody help me how to solve this problem?
Try as follows:
Use df.isin to check for each value in df['id'] whether it is contained in df2['id'].
Next, invert the resulting boolean pd.Series by using the unary operator ~ (tilde) and select from d1.
Finally, reset the index.
In a one-liner:
df3 = df1[~df1['id'].isin(df2['id'])].reset_index(drop=True)
print(df3)
type_sale id
0 group_4 80
1 group_7 60
I have number of csv files that have differing numbers of columns.
Majority of the csv files are 4 columns wide and gets read and concatenated.
However, when it encounters files that exceeds 4 columns the script errors out.
I get the following error message:Error tokenizing data. C error: Expected 4 fields in line 125, saw 8.
If I refactor the code (below) to include error_bad_lines=False for the pd.read_csv,the code completes and outputs a combined csv that includes only the lines that contain 4 columns.
How can I solve this error, and concatenate everything?There're no indexes, so i'd just have to stack the csv info on top of one another.
Thank you so much
import os
import glob
import pandas as pd
all_filenames = [
# think this is working correctly with bunch of replies.csv extensions
i for i in glob.glob('C:\\Users\\tkim1\\Python Scripts\\output\\*\\replies.csv')
]
print(all_filenames)
# combine all files in the list
combined_csv = pd.concat([
pd.read_csv(f, error_bad_lines=False) for f in all_filenames
], sort=False)
# export to csv
combined_csv.to_csv("combined_replies.csv", index=False, encoding='utf-8-sig')
The issue here is with pandas.concat, not pandas.read_csv. The concat function does not allow you to concatenate DataFrame objects with differing number of columns.
The only way I can think of solving this is to find out the DataFrames that have lesser number of columns (than the DataFrame with max number of columns), set the required extra columns in each DataFrame to NaN, then apply pd.concat.
# for example, if df1 has 3 columns and df2 has 2 columns, set the third column in df2
# to NaN, then apply concat.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'0': np.arange(1, 100),
'1': np.arange(100, 1, -1)})
df2 = pd.DataFrame({'0': np.arange(100, 200),
'1': np.arange(200, 100, -1),
'2': np.arange(400, 500)})
df2['2'] = np.nan
df3 = pd.concat([df1, df2])
I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)
I have a dataframe that I imported using pandas.read_csv that is two columns. I manipulated one column, and now would like to save all three columns as a .csv file. I have been able to save one column at a time, but am unable to get all three (df.Time, df.Distance, and df.Velocity). Here is what I'm working with.
`import pandas as pd
df=pd.read_csv('/Users/path/file.csv', delimiter=',', usecols=['A', 'B'])
df.columns = ['Time', 'Range']
df.Time = df['Time'].round(14)
df.Range = df['Range'].round(14)
df.Velocity = (df.Range.shift(1) - df.Range) / (df.Time.shift(1) -df.Time)
df2 = [df.Time, df.Range, df.Velocity]
df2.to_csv('test5.csv', columns = header)`
your assignment makes df2 a list and not a dataframe (df2 = [df.Time, df.Range, df.Velocity]).
You probably want:
df[['Time', 'Range', 'Velocity']].to_csv('test5.csv')
import pandas as pd
data=pd.read_csv('filename.csv')
data[['column1','column2','column3',...]].to_csv('fileNameWhereYouwantToWrite.csv')
You can use like this