I am outputting items from a dataframe to a csv. The rows, however, are too long. I need to have the csv add line breaks (/n) every X items (columns) so that individual rows in the output aren't too long. Is there a way to do this?
A,B,C,D,E,F,G,H,I,J,K
Becomes in the file (X=3) -
A,B,C
D,E,F
G,H,I
J,K
EDIT:
I have a 95% solution (assuming you have only 1 column):
size=50
indexes = np.arange(0,len(data),size) #have to use numpy since range is now an immutable type in python 3
indexes = np.append(indexes,[len(data)]) #add the uneven final index
i=0
while i < len(indexes)-1:
holder = pd.DataFrame(data.iloc[indexes[i]:indexes[i+1]]).T
holder.to_csv(filename, index=False, header=False)
i += 1
The only weirdness is that, despite not throwing any errors, the final loop of the while (with the uneven final index) does not write to the file, even though the information is in holder perfectly. Since no errors are thrown, I cannot figure out why the final information is not being written.
Assuming you have a number of values that is a multiple of 3 (note how I added L):
s = pd.Series(["A","B","C","D","E","F","G","H","I","J","K","L"])
df = pd.DataFrame(s.reshape((-1,3)))
You can they write df to CSV.
Related
I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list.
df_data_sample:
df_data_sample = pd.DataFrame({
'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
'ord_m1':[0,1,1,0,0,0,1,0,1,0],
'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
})
attribute_dict = {
'new_video': 'CAT',
'ord_m1':'NUM',
'rev_m1':'NUM',
'equip_m1':'NUM',
'oev_m1':'NUM',
'irev_m1':'NUM'
}
Then I read each column and do some data processing as below:
cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
# Check and remove duplicate columns if any:
df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]
for target_column in list(df_target_attribute.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target_attribute[[target_column]]
elif (attribute_dict[column] == 'NUM'):
#Let's impute with 0 for numeric variables:
df_target_attribute = df_column
df_target_attribute.fillna(value=0,inplace=True)
df_final[column] = df_target_attribute
attribute_dict is a dictionary containing the mapping of variable name : variable type as :
{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}
However, this column by column operation takes long time to run on a dataset of size**(5million rows * 3400 columns)**. Currently the run time is approximately
12+ hours.
I want to reduce this as much as possible and one of the ways I can think of is to do processing for all NUM columns at once and then go column by column
for the CAT variables.
However I am neither sure of the code in Python to achieve this nor if this will really fasten up the process.
Can someone kindly help me out!
For numeric columns it is simple:
num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']
df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)
But first part code is performance problem, especially in get_dummies called for 5 million rows:
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)
Unfortunately there is problem processes get_dummies in chunks.
There are three things i would advice you to speed up your computaions:
Take a look at pandas HDF5 capabilites. HDF is a binary file
format for fast reading and writing data to disk.
I would read in bigger chunks (several columns) of your csv file at once (depending on
how big your memory is).
There are many pandas operations you can apply to every column at once. For example nunique() (giving you the number of unique values, so you don't need unique().size). With these column-wise operations you can easily filter columns by selecting with a binary vector. E.g.
df = df.loc[:, df.nunique() > 100]
#filter out every column where less then 100 unique values are present
Also this answer from the author of pandas on large data workflow might be interesting for you.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I'm trying to write a program that opens a text file with only numbers in rows and columns, to save them in a new file. The part where I select columns works, while the part of the rows don't. I must select the lines with the condition x > 10e13 (where x is the value in a specific column).
I have some problems, especially in rows selection.
Since they are very large files, I have been advised to use numpy, so I would like to run the code this way.
This is the code I have written:
import numpy as np
matrix = np.loadtxt('file.dat')
#select columns
column_indicies = [0]
selected_columns = matrix[:,column_indicies]
x=1E14 #select lines
for line in matrix:
if float(line) > x:
#any ideas?
selected_matrix = matrix[selected_lines,selected_columns]
np.savetxt('new_file.dat', selected_matrix, fmt='%1.4f')
This is a small sample of my input data:
185100000000000.0000
121300000000000.0000
257800000000000.0000
43980000000000.0000
There are a lot of ways to do this. Here is what you might want to try which is along the same direction you have already taken:
matrix = <your matrix>
new_matrix = np.array([]).reshape(matrix.shape)
column_index = <the column index you want to compare>
x = 1E14
for line in range(matrix.shape[0]):
if matrix[line][column_index] > x:
new_matrix.append(matrix[line])
Hope that makes sense. We are attaching only those rows to the new matrix that meets your condition. I didn't run the code, so there might be some minor bugs. But, I am hoping you got the idea of how you can accomplish this task.
EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).
I currently use CSV reader to create a two dimensional list. First, I strip off the header information, so my list is purely data. Sadly, a few columns are text (dates, etc) and some are just for checking against other data. What I'd like to do is take certain columns of this data and obtain the mean. Other columns I just need to ignore. What are the different ways that I can do this? I probably don't care about speed, I'm doing this once after I read the csv and my CSV files are maybe 2000 or so rows and only 30 or so columns.
This is assuming that all rows are of equal length, if they're not, you may have to add a few try / except cases in
lst = [] #This is the rows and columns, assuming the rows contain the columns
column = 2
temp = 0
for row in range (len(lst)):
temp += lst [row][column]
mean = temp / len (lst)
To test if the element is a number, for most cases, I use
try:
float(element) # int may also work depending on your data
except ValueError:
pass
Hope this helps; I can't test this code, as I'm on my phone.
Try this:
def avg_columns(list_name, *column_numbers):
running_sum = 0
for col in column_numbers:
for row in range(len(list_name)):
running_sum += list_name[row][col]
return running_sum / (len(list_name)*len(column_numbers))
You pass it the name of the list, and the indexes of the columns (starting at 0), and it will return the average of those columns.
l = [
[1,2,3],
[1,2,3]
]
print(avg_columns(l, 0)) # returns 1.0, the avg of the first column (index 0)
print(avg_columns(l, 0, 2)) # returns 2.0, the avg of column indices 0 and 2 (first and third)