merge 2 csv files by columns error related to strings? - python

I am trying to merge 2 csv files by column.
my both csv ends with '_4.csv' as filename, and the final result of the merged csv is something like below:
0-10 ,83.72,66.76,86.98 ,0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04 ,11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94 ,21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03 ,31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0 ,over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0 ,UHF case ,0.0,0.0,0.0
my code:
#combine 2 csv into 1 by columns
files_in_dir = [f for f in os.listdir(os.getcwd()) if f.endswith('_4.csv')]
temp_data = []
for filenames in files_in_dir:
temp_data.append(np.loadtxt(filenames,dtype='str'))
temp_data = np.array(temp_data)
np.savetxt('_mix.csv',temp_data.transpose(),fmt='%s',delimiter=',')
however the error said:
temp_data.append(np.loadtxt(filenames,dtype='str'))
for x in read_data(_loadtxt_chunksize):
raise ValueError("Wrong number of columns at line %d"
ValueError: Wrong number of columns at line 2
not sure if it is related to the first column being strings rather than values.
Does anyone know how to fix it? much appreciation

I think you're looking for the join method. If we have two .csv files of the form:
0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0
Assuming they both have similar structure, we'll work with one of these named data.csv:
import pandas as pd
# Assumes there are no headers
df1 = pd.read_csv("data.csv", header=None)
df2 = pd.read_csv("data.csv", header=None)
# By default: DataFrame headers are assigned numbers 0, 1, 2, 3
# In the second data frame, we will rename columns so they do not clash.
# meaning `df2` will now have columns named: 4, 5, 6, 7
df2 = df2.rename(
columns={
x: y for x, y in zip(df1.columns, range(len(df2.columns), len(df2.columns) * 2))
}
)
print(df1.join(df2))
Example output:
0 1 2 3 4 5 6 7
0 0-10 83.72 66.76 86.98 0-10 83.72 66.76 86.98
1 11-20 15.01 31.12 12.04 11-20 15.01 31.12 12.04
2 21-30 1.14 2.05 0.94 21-30 1.14 2.05 0.94
3 31-40 0.13 0.07 0.03 31-40 0.13 0.07 0.03
4 over 40 0.00 0.00 0.00 over 40 0.00 0.00 0.00
5 UHF case 0.00 0.00 0.00 UHF case 0.00 0.00 0.00

Related

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Plotting variables by classes

I would need to plot as bar chart the following columns
%_Var1 %_Var2 %_Val1 %_Val2 Class
2 0.00 0.00 0.10 0.01 1
3 0.01 0.01 0.07 0.05 0
17 0.00 0.00 0.02 0.01 0
24 0.00 0.00 0.11 0.04 0
27 0.00 0.00 0.02 0.03 1
44 0.00 0.00 0.05 0.02 0
53 0.00 0.00 0.03 0.01 1
67 0.00 0.00 0.06 0.02 0
87 0.00 0.00 0.22 0.01 1
115 0.00 0.00 0.03 0.02 0
comparing the values having Class 1 and Class 0 respectively (i.e. bars which show each column of the dataframe, putting one beside the other the column for only Class 1 ad the column for Class 0.
So I should have 8 bars: 4 where 4 bars are for Class 1 and the remaining 4 for Class 0.
One column of Class 1 should be beside the same column for Class 0.
I tried as follows:
ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].plot(kind='bar')
but the output is completely wrong, also writing ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].Label.plot(kind='bar')
I think I should consider a groupby in my code, in order to group by Classes, but I do not know how to set the order (plots are not my top skill)
If you want to try the seaborn way, melt the dataframe to long format and then hue on the class.
data = df.melt(id_vars=['class'], value_vars=['var1','var2','val1','val2'])
import seaborn as sns
sns.barplot(x='variable', y='value', hue='class', data=data, ci=0)
gives:
Or if you want to get the plot based on the class, simply change the hue and x axis..
sns.barplot(x='class', y='value', hue='variable', data = data, ci=0)
gives:
Using groupby:
df.groupby('Class').mean().plot.bar()
With pivot_table method you can summarise the data per group as well.
df.pivot_table(index='Class').plot.bar()
# df.pivot_table(columns='Class').plot.bar() # invert order
By default, it calculates the mean of your target-columns, but you can specify another aggregation method with aggfunc='myfunc' parameter.

Calculation is done only on part of the table

I am trying to calculate the kurtosis and skewness over a data and I managaed to create table but for some reason teh result is only for few columns and not for the whole fields.
For example, as you cann see, I have many fields (columns):
I calculate the skenwess and kurtosis using the next code:
sk=pd.DataFrame(data.skew())
kr=pd.DataFrame(data.kurtosis())
sk['kr']=kr
sk.rename(columns ={0: 'sk'}, inplace =True)
but then I get result that contains about half of the data I have:
I have tried to do head(10) but it doesn't change the fact that some columns dissapeard.
How can I calculte this for all the columns?
It is really hard to reproduce the error since you did not give the original data. Probably your dataframe contains non-numerical values in the missing columns which would result in this behavior.
dat = {"1": {'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"2":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"3":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"4":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"5":{'lg1':0.12, 'lg2':0.23, 'lg3': 'po', 'lg4':0.45}}
df = pd.DataFrame.from_dict(dat).T
print(df)
lg1 lg2 lg3 lg4
1 0.12 0.23 0.34 0.45
2 0.12 0.23 0.34 0.45
3 0.12 0.23 0.34 0.45
4 0.12 0.23 0.34 0.45
5 0.12 0.23 po 0.45
print(df.kurtosis())
lg1 0
lg2 0
lg4 0
The solution would be to preprocess the data.
One word of advice would be to check for consistency in the error, i.e. are always the same lines missing?

Reading in specific date lines from a file with pandas python

I am attempting to read in many files. Each file is a daily data file with data every 10 minutes. the data in each file is kind of "chunked up" like this:
2015-11-08 00:10:00 00:10:00
# z speed dir W sigW bck error
30 3.32 111.9 0.15 0.12 1.50E+05 0
40 3.85 108.2 0.07 0.14 7.75E+04 0
50 4.20 107.9 0.06 0.15 4.73E+04 0
60 4.16 108.5 0.03 0.19 2.73E+04 0
70 4.06 93.6 0.03 0.23 9.07E+04 0
80 4.06 93.8 0.07 0.28 1.36E+05 0
2015-11-08 00:20:00 00:10:00
# z speed dir W sigW bck error
30 3.79 120.9 0.15 0.11 7.79E+05 0
40 4.36 115.6 0.04 0.13 2.42E+05 0
50 4.71 113.6 0.07 0.14 6.84E+04 0
60 5.00 113.3 0.13 0.17 1.16E+04 0
70 4.29 94.2 0.22 0.20 1.38E+05 0
80 4.54 94.1 0.11 0.25 1.76E+05 0
2015-11-08 00:30:00 00:10:00
# z speed dir W sigW bck error
30 3.86 113.6 0.13 0.10 2.68E+05 0
40 4.34 116.1 0.09 0.11 1.41E+05 0
50 5.02 112.8 0.04 0.12 7.28E+04 0
60 5.36 110.5 0.01 0.14 5.81E+04 0
70 4.67 95.4 0.14 0.16 7.69E+04 0
80 4.56 95.0 0.15 0.21 9.84E+04 0
...
The file continues on like this every 10 minutes for the whole day. The file name for this file is 151108.mnd. I want my code to read in all files that are for november so 1511??.mnd and I want my code to read in each day file for a whole month grab all of the datetime lines so for the partial data file example I just showed I would want my code to grab 2015-11-08 00:10:00, 2015-11-08 00:20:00, 2015-11-08 00:30:00, etc. store as variables and then go to the next day file (151109.mnd) and grab all the datetime lines and store as date variable and append on to the previously stored dates. And so on and so forth for the whole month. Here is the code I have so far:
import pandas as pd
import glob
import datetime
filename = glob.glob('1511??.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
dates = []
counter = 1
for i in filename:
f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
if line.startswith("20"):
print line
date_object = datetime.datetime.strptime(line[:-6], '%Y-%m-%d %H:%M:%S %f')
dates.append(date_object)
counter = 0
else:
counter += 1
frames.append(f_nov15_hereford)
data_nov15_hereford = pd.concat(frames,ignore_index=True)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)
print dates
This code has some problems because when I print dates it prints out two copies of every date and it also only prints out the first date of every file so 2015-11-08 00:10:00, 2015-11-09 00:10:00, etc. It isn't going line-by-line in every file then once all dates in that file are stored moving on to the next file like I want. Instead it is just grabbing the first date in each file. Any help on this code? Is there an easier way to do what I want? Thanks!
A few observations:
First: Why you are only getting the first date in a file:
f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
if line.startswith("20"):
The first line reads the file into a pandas dataframe. The second line iterates over the columns of a dataframe, not the rows. As a result, the last line checks to see if the column starts with "20". This only happens once per file.
Second: counter is initialized and it's value gets changed, but it is never used. I presume it was intended to be used to skip over lines in the files.
Third: It might be simpler to collect all the dates into a Python list and then converting that to a pandas dataframe if needed.
import pandas as pd
import glob
import datetime as dt
# number of lines to skip before the first date
offset = 32
# number of lines from one date to the next
recordlength = 9
pattern = '1511??.mnd'
dates = []
for filename in glob.iglob(pattern):
with open(filename) as datafile:
count = -offset
for line in datafile:
if count == 0:
fmt = '%Y-%m-%d %H:%M:%S %f'
date_object = dt.datetime.strptime(line[:-6], fmt)
dates.append(date_object)
count += 1
if count == recordlength:
count = 0
data_nov15_hereford = pd.DataFrame(dates, columns=['Dates'])
print dates
Consider modifying the csv data line by line prior to reading in as a dataframe. Below opens original file in the glob list and writes to a temp file moving over dates to last column, removing multiple headers and empty lines.
CSV Data (assuming the text view of csv file looks like below; if different than actual, adjust py code)
2015-11-0800:10:0000:10:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.32,111.9,0.15,0.12,1.50E+05,0
40,3.85,108.2,0.07,0.14,7.75E+04,0
50,4.2,107.9,0.06,0.15,4.73E+04,0
60,4.16,108.5,0.03,0.19,2.73E+04,0
70,4.06,93.6,0.03,0.23,9.07E+04,0
80,4.06,93.8,0.07,0.28,1.36E+05,0
,,,,,,
2015-11-0800:10:0000:20:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.79,120.9,0.15,0.11,7.79E+05,0
40,4.36,115.6,0.04,0.13,2.42E+05,0
50,4.71,113.6,0.07,0.14,6.84E+04,0
60,5,113.3,0.13,0.17,1.16E+04,0
70,4.29,94.2,0.22,0.2,1.38E+05,0
80,4.54,94.1,0.11,0.25,1.76E+05,0
,,,,,,
2015-11-0800:10:0000:30:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.86,113.6,0.13,0.1,2.68E+05,0
40,4.34,116.1,0.09,0.11,1.41E+05,0
50,5.02,112.8,0.04,0.12,7.28E+04,0
60,5.36,110.5,0.01,0.14,5.81E+04,0
70,4.67,95.4,0.14,0.16,7.69E+04,0
80,4.56,95,0.15,0.21,9.84E+04,0
Python Script
import glob, os
import pandas as pd
filenames = glob.glob('1511??.mnd')
temp = 'temp.csv'
# INITIATE EMPTY DATAFRAME
data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W',
'sigW', 'bck', 'error', 'date'])
# ITERATE THROUGH EACH FILE IN GLOB LIST
for file in filenames:
# DELETE PRIOR TEMP VERSION
if os.path.exists(temp): os.remove(temp)
header = 0
# READ IN ORIGINAL CSV
with open(file, 'r') as txt1:
for rline in txt1:
# SAVE DATE VALUE THEN SKIP ROW
if "2015-11" in rline: date = rline.replace(',',''); continue
# SKIP BLANK LINES (CHANGE IF NO COMMAS)
if rline == ',,,,,,\n': continue
# ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES
if 'z,speed,dir,W,sigW,bck,error' in rline:
if header == 1: continue
rline = rline.replace('\n', ',date\n')
with open(temp, 'a') as txt2:
txt2.write(rline)
continue
header = 1
# APPEND LINE TO TEMP CSV WITH DATE VALUE
with open(temp, 'a') as txt2:
txt2.write(rline.replace('\n', ','+date))
# APPEND TEMP FILE TO DATA FRAME
data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp))
Output
z speed dir W sigW bck error date
0 30 3.32 111.9 0.15 0.12 150000 0 2015-11-0800:10:0000:10:00
1 40 3.85 108.2 0.07 0.14 77500 0 2015-11-0800:10:0000:10:00
2 50 4.20 107.9 0.06 0.15 47300 0 2015-11-0800:10:0000:10:00
3 60 4.16 108.5 0.03 0.19 27300 0 2015-11-0800:10:0000:10:00
4 70 4.06 93.6 0.03 0.23 90700 0 2015-11-0800:10:0000:10:00
5 80 4.06 93.8 0.07 0.28 136000 0 2015-11-0800:10:0000:10:00
6 30 3.79 120.9 0.15 0.11 779000 0 2015-11-0800:10:0000:20:00
7 40 4.36 115.6 0.04 0.13 242000 0 2015-11-0800:10:0000:20:00
8 50 4.71 113.6 0.07 0.14 68400 0 2015-11-0800:10:0000:20:00
9 60 5.00 113.3 0.13 0.17 11600 0 2015-11-0800:10:0000:20:00
10 70 4.29 94.2 0.22 0.20 138000 0 2015-11-0800:10:0000:20:00
11 80 4.54 94.1 0.11 0.25 176000 0 2015-11-0800:10:0000:20:00
12 30 3.86 113.6 0.13 0.10 268000 0 2015-11-0800:10:0000:30:00
13 40 4.34 116.1 0.09 0.11 141000 0 2015-11-0800:10:0000:30:00
14 50 5.02 112.8 0.04 0.12 72800 0 2015-11-0800:10:0000:30:00
15 60 5.36 110.5 0.01 0.14 58100 0 2015-11-0800:10:0000:30:00
16 70 4.67 95.4 0.14 0.16 76900 0 2015-11-0800:10:0000:30:00
17 80 4.56 95.0 0.15 0.21 98400 0 2015-11-0800:10:0000:30:00

How do I apply a lambda function on pandas slices, and return the same format as the input data frame?

I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91

Categories