Transform Pandas DataFrame to LIBFM format txt file - python

I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format.
Here the format needs to look like this:
4 0:1.5 3:-7.9
2 1:1e-5 3:2
-1 6:1
This file contains three cases. The first column states the target of each of the three case: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x.
In total the data from the example describes the following design matrix X and target vector y:
1.5 0.0 0.0 −7.9 0.0 0.0 0.0
X: 0.0 10−5 0.0 2.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0
4
Y: 2
−1
This is also explained in the Manual file under chapter 2.
Now here is my problem: I have a pandas dataframe that looks like this:
overall reviewerID asin brand Positive Negative \
0 5.0 A2XVJBSRI3SWDI 0000031887 Boutique Cutie 3.0 -1
1 4.0 A2G0LNLN79Q6HR 0000031887 Boutique Cutie 5.0 -2
2 2.0 A2R3K1KX09QBYP 0000031887 Boutique Cutie 3.0 -2
3 1.0 A19PBP93OF896 0000031887 Boutique Cutie 2.0 -3
4 4.0 A1P0IHU93EF9ZK 0000031887 Boutique Cutie 2.0 -2
LDA_0 LDA_1 ... LDA_98 LDA_99
0 0.000833 0.000833 ... 0.000833 0.000833
1 0.000769 0.000769 ... 0.000769 0.000769
2 0.000417 0.000417 ... 0.000417 0.000417
3 0.000137 0.014101 ... 0.013836 0.000137
4 0.000625 0.000625 ... 0.063125 0.000625
Where "overall" is the target column and all other 105 columns are features.
The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. So each unique 'ReviewerID', 'Asin' and brand gets his own column. This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero.
All other columns don't need to get reformatted. So the index for those columns can just be the column number.
So the first 3 rows in the above pandas data frame need to be transformed to the following output:
5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417
In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. However this program can't get along with this many columns.
Is there an easy way to do this? I have 1 million rows in total.

LibFM executable expects the input in libSVM format that you have explained here. If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

Related

Pandas.pivot replace some existing values with NaN

I have to read some table from a SQL Server and merge their values in a single data structure for a machine learning project. I'm using Pandas and in particular pd.read_sql_query for read the values and pd.merge to fuse them.
One table has at least 80 milion rows and if I try to read it entirely it occupies all my memory storage (it's not so much, only 20gb but I've got a small ssd), so I decided to divide it in chunk of 100000 rows:
df_bilanci = pd.read_sql_query('SELECT [IdBilancioRiclassificato], [IdBilancio], [Importo] FROM dbo.Bilanci', conn, chunksize=100000)
One single chunk will be like this:
IdBilancioRiclassificato
IdBilancio
Importo
0
10.0
7001.0
0.0
1
11.0
7001.0
502643.0
2
12.0
7001.0
-4550.0
3
10.0
7002.0
654654.0
4
11.0
7002.0
0.0
5
12.0
7002.0
0.0
I'm interested to have the values of IdBilancioRiclassificato as columns (there are a total of 97 unique values for this column, so they have to be 97 columns), so I used pd.pivot on every chunk and then pd.concat plus merge to put together all the data:
for chunk in df_bilanci:
chunk.reset_index()
chunk_pivoted = pd.pivot(data=chunk,
index='IdBilancio',
columns='IdBilancioRiclassificato',
values='Importo'
)
df_aziende_bil = pd.concat([df_aziende_bil, pd.merge(left=df_aziende_anagrafe, right=chunk_pivoted, left_on='ID', right_index=True)])
At this point however, the chunk_pivoted dataframe has some values replaced with NaN values but if I look in the table the values exist.
The result expected is a table like this one:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
502643.0
-4550.0
7002.0
654654.0
0.0
0.0
but i've got something like this:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
NaN
-4550.0
7002.0
654654.0
NaN
NaN

Python: merging pandas data frames results in only NaN in column

I am currently working on a Big Data project requiring the merger of a number of files into a single table that can be analyzed through SAS. The majority of work is completed with the final fact tables needing to be added to the final output.
I have hit a snag when attempting to combine a fact table to the final output. In this csv file that has been loaded to its own dataframe the following columns are present.
table name: POP
year | Borough | Population
within the dataset on which they are to be enjoined these fields also exist along with around 26 others. When first attempted to merge through the following line:
#Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
the following error was returned
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
i understood this to simply be a data type mismatch and so added the following line ahead of the merge command:
POP['year'] = POP['year'].astype(object)
doing so allows for ""successfull"" execution of the program, however, the output file has the population column, but it is filled with NaN when it should have the appropriate population for each row where a combination of "year" and "borough" match that found in the POP table.
any help would be greatly appreciated and i provide below a fuller excerpt of the code for those who find that easier to parse:
import pandas as pd
#
# Add Population Data
#
#rename columns for easier joining
POP.rename(columns={"Area name":"Borough"}, inplace=True)
POP.rename(columns={"Persons":"Population"}, inplace=True)
POP.rename(columns={"Year":"year"}, inplace=True)
#convert type of output column to allow join
POP['year'] = POP['year'].astype(object)
#add to output file
Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
Additionally find below some information about the data types and shape of the items and tables involved in case it is of use:
> Output table info
>
> <class 'pandas.core.frame.DataFrame'> Int64Index: 34241 entries, 0 to
> 38179 Data columns (total 2 columns): year 34241 non-null object
> Borough 34241 non-null object dtypes: object(2) memory usage:
> 535.0+ KB None table shape: (34241, 36)
> ----------
>
> POP table info <class 'pandas.core.frame.DataFrame'> RangeIndex: 357
> entries, 0 to 356 Data columns (total 3 columns): year 357
> non-null object Borough 357 non-null object Population 357
> non-null object dtypes: object(3) memory usage: 4.2+ KB None table
> shape: (357, 3)
finally, my apologies if any of this is asked or presented incorrectly, I'm relatively new to python and its my first time using stack as a contributor
EDITS:
(1)
as requested here is samples from the data:
this is the Output Dataframe
[357 rows x 3 columns]
Borough Date totalIncidents Calculated Mean Closure \
0 Camden 2013-11-06 2.0 613.5
1 Kensington and Chelsea 2013-11-06 NaN NaN
2 Westminster 2013-11-06 1.0 113.0
PM2.5 (ug/m3) PM10 (ug/m3) (R) SO2 (ug/m3) (R) CO (mg m-3) (R) \
0 9.55 16.200 5.3 NaN
1 10.65 21.125 1.7 0.2
2 19.90 30.600 NaN 0.7
NO (ug/m3) (R) NO2 (ug/m3) (R) O3 (ug/m3) (R) Bus Stops \
0 135.866670 82.033333 24.4 447.0
1 80.360000 65.680000 29.3 270.0
2 171.033333 109.000000 21.3 489.0
Cycle Parking Points \
0 67.0
1 27.0
2 45.0
Average Public Transport Access Index 2015 (AvPTAI2015) \
0 24.316782
1 23.262691
2 39.750796
Public Transport Accessibility Levels (PTAL) Catagorisation \
0 5
1 5
2 6a
Underground Stations in Borough \
0 16.0
1 12.0
2 31.0
PM2.5 Daily Air Quality Index(DAQI) classification \
0 Low
1 Low
2 Low
PM2.5 Above WHO 24hr mean Guidline PM10 Above WHO 24hr mean Guidline \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
O3 Above WHO 8hr mean Guidline* NO2 Above WHO 1hr mean Guidline* \
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
SO2 Above WHO 24hr mean Guidline SO2 Above EU 24hr mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
NO2 Above EU 24hr mean Allowence CO Above EU 8hr mean Allowence \
0 1.0 0.0
1 1.0 0.0
2 1.0 0.0
O3 Above EU 8hr mean Allowence year NO2 Year Mean (ug/m3) \
0 0.0 2013 50.003618
1 0.0 2013 50.003618
2 0.0 2013 50.003618
PM2.5 Year Mean (ug/m3) PM10 Year Mean (ug/m3) \
0 15.339228 24.530299
1 15.339228 24.530299
2 15.339228 24.530299
NO2 Above WHO Annual mean Guidline NO2 Above EU Annual mean Allowence \
0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
PM2.5 Above EU Annual mean Allowence PM10 Above EU Annual mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
Number of Bicycle Hires (All Boroughs)
0 18,431
1 18,431
2 18,431
here is the Population dataframe
year Borough Population
0 2010 Barking and Dagenham 182,838
1 2011 Barking and Dagenham 187,029
2 2012 Barking and Dagenham 190,560
edit(2):
so this seems that it was a date type issue but im still not entirely sure why as i had attempted recasting the datatypes. However, the solution that finall got me going was to save the output dataframe as a csv and reload it into the programme, from there the merge started working again.

How to make different dataframes of different lengths become equal in length (downsampling and upsampling)

I have many dataframes (timeseries) that are of different lengths ranging between 28 and 179. I need to make them all of length 104. (upsampling those below 104 and downsampling those above 104)
For upsampling, the linear method can be sufficient to my needs. For downsampling, the mean of the values should be good.
To get all files to be the same length, I thought that I need to make all dataframes start and end at the same dates.
I was able to downsample all to the size of the smallest dataframe (i.e. 28) using below lines of code:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
resampled=df.resample('120D').mean()
However, this will not give me good results when I feed them into the model I need them for as it shrinks the longer files so much thus distorting the data.
This is what I tried so far:
df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'), inplace=True)
if df.shape[0]>100: resampled=df.resample('D').mean()
elif df.shape[0]<100: resampled=df.astype(float).resample('33D').interpolate(axis=0, method='linear')
else: break
Now, in the above lines of code, I am getting the files to be the same length (length 100). The downsampling part works fine too.
What's not working is the interpoaltion on the upsampling part. It just returns dataframes of length 100 with the first value of every column just copied over to all the rows.
What I need is to make them all size 104 (average size). This means any df of length>104 needs to downsampled and any df of length<104 needs to be upsampled.
As an example, please consider the two dfs as follows:
>>df1
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
>>df2
index
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 6 -3 -2
4 4 0 -4
5 8 2 -6
6 10 4 -8
7 12 6 -10
Suppose the avg length is 6, the expected output would be:
df1 upsampled to length 6 using interpolation - for e.g. resamle(rule).interpolate().
And df2 downsampled to length 6 using resample(rule).mean() .
Update:
If I could get all the files to be upsampled to 179, that would be fine as well.
I assume the problem is when you do resample in the up-sampling case, the other values are not kept. With you example df1, you can see it by using asfreq on one column:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').asfreq().isna().sum(0))
#99 rows are nan on the 100 length resampled dataframe
So when you do interpolate instead of asfreq, it actually interpolates with just the first value, meaning that the first value is "repeated" over all the rows
To get the result you want, then before interpolating, use also mean even in the up-sampling case, such as:
print (df1.set_index(pd.date_range(start='1/1/1991' ,periods=len(df1), end='1/1/2000'))[1]
.resample('33D').mean().interpolate().head())
1991-01-01 3.000000
1991-02-03 3.060606
1991-03-08 3.121212
1991-04-10 3.181818
1991-05-13 3.242424
Freq: 33D, Name: 1, dtype: float64
and you will get values as you want.
To conclude, I think in both up-sampling and down-sampling cases, you can use the same command
resampled = (df.set_index(pd.date_range(start='1/1/1991' ,periods=len(df), end='1/1/2000'))
.resample('33D').mean().interpolate())
Because the interpolate would not affect the result in the down-sampling case.
Here is my version using skimage.transform.resize() function:
df1 = pd.DataFrame({
'a': [3,5,9,11],
'b': [-1,-3,-5,-7],
'c': [0,2,0,-2]
})
df1
a b c
0 3 -1 0
1 5 -3 2
2 9 -5 0
3 11 -7 -2
import pandas as pd
import numpy as np
from skimage.transform import resize
def df_resample(df1, num=1):
df2 = pd.DataFrame()
for key, value in df1.iteritems():
temp = value.to_numpy()/value.abs().max() # normalize
resampled = resize(temp, (num,1), mode='edge')*value.abs().max() # de-normalize
df2[key] = resampled.flatten().round(2)
return df2
df2 = df_resample(df1, 20) # resampling rate is 20
df2
a b c
0 3.0 -1.0 0.0
1 3.0 -1.0 0.0
2 3.0 -1.0 0.0
3 3.4 -1.4 0.4
4 3.8 -1.8 0.8
5 4.2 -2.2 1.2
6 4.6 -2.6 1.6
7 5.0 -3.0 2.0
8 5.8 -3.4 1.6
9 6.6 -3.8 1.2
10 7.4 -4.2 0.8
11 8.2 -4.6 0.4
12 9.0 -5.0 0.0
13 9.4 -5.4 -0.4
14 9.8 -5.8 -0.8
15 10.2 -6.2 -1.2
16 10.6 -6.6 -1.6
17 11.0 -7.0 -2.0
18 11.0 -7.0 -2.0
19 11.0 -7.0 -2.0

count values in one pandas series based on date column and other column

I have several columns of data and they are in pandas dataframe. the data looks like
cus_id timestamp values second_val
0 10173 2010-06-12 39.0 1
1 95062 2010-09-11 35.0 2
2 171081 2010-07-05 39.0 1
3 122867 2010-08-18 39.0 1
4 107186 2010-11-23 0.0 3
5 171085 2010-09-02 0.0 2
6 169767 2010-07-03 28.0 2
7 80170 2010-03-23 39.0 2
8 154178 2010-10-02 37.0 2
9 3494 2010-11-01 0.0 1
.
.
.
.
5054054 1716139 2012-01-12 0.0 2
5054055 1716347 2012-01-18 28.0 1
5054056 1807501 2012-01-21 0.0 1
there are 0 values data which appears in values column and it appears on different days. I tried to group all second_val values for each month when the values column data at that time equal to zero to plot them properly and I did it by using
Jan10 = df.second_val[df['timestamp'].str.contains('2010-01')][df['values']==0].sum()
Feb10 = df.second_val[df['timestamp'].str.contains('2010-02')][df['values']==0].sum()
Mar10 = df.second_val[df['timestamp'].str.contains('2010-03')][df['values']==0].sum()
.
.
.
.
Jan12 = df.second_val[df['timestamp'].str.contains('2012-01')][df['values']==0].sum()
Feb12 = df.second_val[df['timestamp'].str.contains('2012-02')][df['values']==0].sum()
Months = ['2010-01', '2010-02', '2010-03', '2010-04' . . . . ., '2012-01', '2012-02']
Months_Orders = [Jan10, Feb10, Mar10, Apr10, . . . . .. ., Jan12, Feb12]
plt.figure(figsize=(15,8))
plt.scatter(x = Months, y = Months_Orders)
like if 0 appear for 10 days in jan10 and sum of second_val data is 20. then it should give me 20 for January month
e.g
cus_id timestamp values second_val
0 10173 2010-01-10 0.0 1
.
.
13 95062 2010-01-11 0.0 2
34 171081 2010-01-23 0.0 1
Is there any way to make better by writing in a function or any built-in pandas way. I tried my previous question solution but it was different and didn't work properly for me so I use this hard coded and it seems inefficient. Thanks
IIUC
df.timestamp=pd.to_datetime(df.timestamp)
df=df[df['values']==0]# filter it before groupby
df.groupby(df.timestamp.dt.strftime('%Y-%m')).second_val.sum()# using groupby after filter to get what you need, group key is format %Y-%m

Is it possible to write and read multiple DataFrames to/from one single file?

I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')

Categories