nsprcomp is on CRAN
When we published our 2008 ICML paper on sparse and non-negative PCA, I thought it might be worthwhile to provide a Matlab implementation of our algorithm as well. Since then, I've received several requests and questions about the code. While the core functionality is there, the implementation lacks a friendly interface and some additional functionality, such as easy random restarts.
After two recent inquiries about using constrained PCA for portfolio optimization and combustion modeling, I decided to fill the gaps in the implementation. Because R has become my primary programming language, this provided an opportunity to learn about package writing and documentation for a public audience, with the goal of submitting the result to CRAN.
The nsprcomp package provides two algorithms: nsprcomp
implements our emPCA algorithm as discussed in the paper, but with several deflation options for computing multiple components (motivated by Mackey, 2009). nscumcomp
is a novel algorithm based on the same form of expectation-maximization, but it jointly computes all components such that the cumulative variance is maximized. One benefit of nscumcomp
over nsprcomp
is that the number of features can be specified as a total, instead of having to specify the cardinality of each principal axis individually. Setting the total is more natural in an exploratory data analysis, where the number of features that make up a component is not known in advance. A drawback of the joint optimization is that the M-step no longer has a closed form solution (that I know of), so the numerical optimization using L-BFGS iterations increases the computational load.
A small example from the domain of portfolio optimization can demonstrate the usefulness of both algorithms, but keep in mind that my knowledge about the subject essentially consist of reading Markowitz (1952) and a handful of related papers.
Assume that the variance in asset returns can be explained (approximately) using a linear combination of a small number of hidden factors, which of course suggests a principal component analysis. The drawback of classical PCA is that each factor consists of a linear combination of all assets, and with mixed signs. Enforcing non-negativity and sparsity of the loadings can result in a more meaningful analysis, as non-negative loadings correspond to long positions in the portfolio, and sparsity limits the number of positions in the portfolio.
We analyze NYSE daily returns for the year 2005, from a data set which used to be available at infochimps.com. After some pre-processing, the data matrix X
contains 260 daily returns for 2365 stocks.
nsprcomp
Calling nsprcomp(X, ncomp=4, k=10, nneg=TRUE)
returns four non-negative components with the top ten stocks each:
[[1]]
weight symbol name sector industry
0.4434287 AKS AK Steel Holding Corporation Basic Industries Steel/Iron Ore
0.3080191 ATI Allegheny Technologies Incorporated Basic Industries Steel/Iron Ore
0.3304431 CLF Cliffs Natural Resources Inc. Basic Industries Precious Metals
0.3228908 CMC Commercial Metals Company Basic Industries Steel/Iron Ore
0.2646535 CRS Carpenter Technology Corporation Basic Industries Steel/Iron Ore
0.3076497 GPK Graphic Packaging Holding Company Consumer Durables Containers/Packaging
0.2925628 GTI GrafTech International Ltd Energy Industrial Machinery/Components
0.2947678 MT ArcelorMittal Basic Industries Steel/Iron Ore
0.2643825 NUE Nucor Corporation Basic Industries Steel/Iron Ore
0.2966061 USU USEC Inc. Basic Industries Mining & Quarrying of Nonmetallic Minerals (No Fuels)
[[2]]
weight symbol name sector industry
0.2697512 ACO Amcol International Corporation Basic Industries Mining & Quarrying of Nonmetallic Minerals (No Fuels)
0.3291779 AIT Applied Industrial Technologies, Inc. Consumer Durables Industrial Specialties
0.2878995 BKI Buckeye Technologies, Inc. Basic Industries Paper
0.3173853 CCC Calgon Carbon Corporation Basic Industries Major Chemicals
0.3397139 CIA Citizens, Inc. Finance Life Insurance
0.3100144 CNC Centene Corporation Health Care Medical Specialities
0.2972055 ENS Enersys Consumer Non-Durables Telecommunications Equipment
0.3109916 NL NL Industries, Inc. Basic Industries Major Chemicals
0.4116398 PRM NA NA NA
0.2631501 RKT Rock-Tenn Company Consumer Durables Containers/Packaging
[[3]]
weight symbol name sector industry
0.3182184 BZH Beazer Homes USA, Inc. Capital Goods Homebuilding
0.3324353 DHI D.R. Horton, Inc. Capital Goods Homebuilding
0.3177888 HOV Hovnanian Enterprises Inc Capital Goods Homebuilding
0.2986497 KBH KB Home Capital Goods Homebuilding
0.2702869 LEN Lennar Corporation Basic Industries Homebuilding
0.3632209 MTH Meritage Corporation Capital Goods Homebuilding
0.3124587 PHM PulteGroup, Inc. Capital Goods Homebuilding
0.2837270 RYL Ryland Group, Inc. (The) Capital Goods Homebuilding
0.3117444 SPF Standard Pacific Corp Capital Goods Homebuilding
0.3431356 TOL Toll Brothers Inc. Capital Goods Homebuilding
[[4]]
weight symbol name sector industry
0.3072652 ARD NA NA NA
0.2883218 CHK Chesapeake Energy Corporation Energy Oil & Gas Production
0.2663459 EAC NA NA NA
0.3019455 FTO NA NA NA
0.3332282 GMXR NA NA NA
0.4095810 NGS Natural Gas Services Group, Inc. Energy Oilfield Services/Equipment
0.3135460 PQ Petroquest Energy Inc Energy Oil & Gas Production
0.3120241 SWN Southwestern Energy Company Energy Oil & Gas Production
0.3038500 TSO Tesoro Corporation Energy Integrated oil Companies
0.3058673 UPL Ultra Petroleum Corp. Energy Oil & Gas Production
The first component mostly consists of mining and steel companies. The second component is all over the place, but the third and fourth components are again very homogeneous (the NA
s in the second and fourth component are due to unresolved stock symbols). The weights all have a similar magnitude and none of them is close to zero, which suggests that the per-component cardinality k
could be increased to include more stocks in each factor (the NYSE energy sector for example contains over two hundred stocks). However, it would take a number of trial runs to optimize the cardinality of each component, such that all relevant and few spurious stocks are included in the analysis.
nscumcomp
Calling nscumcomp(X, ncomp=4, k=150, nneg=TRUE, gamma=1)
instead lets the algorithm determine the cardinality of each component. The magnitude of the orthogonality penalty is not sensitive in this example, and a value of gamma=1
results in essentially orthogonal components:
[[1]]
weight symbol name sector industry
0.0532875561 ACI Arch Coal, Inc. Energy Coal Mining
0.0694850581 ALY NA NA NA
0.0284222815 ANR Alpha Natural Resources, inc. Energy Coal Mining
0.0836213427 APA Apache Corporation Energy Oil & Gas Production
0.0481126384 APC Anadarko Petroleum Corporation Energy Oil & Gas Production
0.1178876782 ARD NA NA NA
0.1393779752 ATW Atwood Oceanics, Inc. Energy Oil & Gas Production
0.0190883361 BJS NA NA NA
0.0454705431 BPT BP Prudhoe Bay Royalty Trust Energy Integrated oil Companies
0.1322468235 BRY Berry Petroleum Company Energy Oil & Gas Production
0.1083120714 BTU Peabody Energy Corporation Energy Coal Mining
0.1932201309 CHK Chesapeake Energy Corporation Energy Oil & Gas Production
0.1346712775 CNQ Canadian Natural Resources Limited Energy Oil & Gas Production
0.0708790208 CNX CONSOL Energy Inc. Energy Coal Mining
0.1469680348 COG Cabot Oil & Gas Corporation Energy Oil & Gas Production
0.0398795193 COP ConocoPhillips Energy Integrated oil Companies
0.0473333886 CPE Callon Petroleum Company Energy Oil & Gas Production
0.0977452639 CRK Comstock Resources, Inc. Energy Oil & Gas Production
0.0154340470 CRR Carbo Ceramics, Inc. Capital Goods Industrial Machinery/Components
0.1081947055 DNR Denbury Resources Inc. Energy Oil & Gas Production
0.1181309078 DO Diamond Offshore Drilling, Inc. Energy Oil & Gas Production
0.1192592682 DRQ Dril-Quip, Inc. Energy Metal Fabrications
0.1043657129 DVN Devon Energy Corporation Energy Oil & Gas Production
0.1597735987 EAC NA NA NA
0.1396755779 ECA Encana Corporation Energy Oil & Gas Production
0.1329824122 EOG EOG Resources, Inc. Energy Oil & Gas Production
0.0938604480 ESV ENSCO plc Energy Oil & Gas Production
0.0249781513 FRO Frontline Ltd. Transportation Marine Transportation
0.0757529529 FST Forest Oil Corporation Energy Oil & Gas Production
0.1836804006 FTO NA NA NA
0.1837984763 GDP Goodrich Petroleum Corporation Energy Oil & Gas Production
0.0671665209 GLF GulfMark Offshore, Inc. Energy Metal Fabrications
0.1445269400 GMXR NA NA NA
0.0837276244 HAL Halliburton Company Energy Oilfield Services/Equipment
0.0627795330 HES Hess Corporation Energy Integrated oil Companies
0.0036551657 HGT Hugoton Royalty Trust Energy Oil & Gas Production
0.1085047020 HLX Helix Energy Solutions Group, Inc. Energy Oilfield Services/Equipment
0.1106579862 HOC NA NA NA
0.0991632214 HOS Hornbeck Offshore Services Consumer Services Marine Transportation
0.0607473586 HP Helmerich & Payne, Inc. Energy Oil & Gas Production
0.0116382620 INT World Fuel Services Corporation Energy Oil Refining/Marketing
0.0645058955 IOC InterOil Corporation Energy Oil & Gas Production
0.0132567145 IYE NA NA NA
0.1853276399 KWK Quicksilver Resources Inc. Energy Oil & Gas Production
0.0013747206 MDR McDermott International, Inc. Capital Goods Metal Fabrications
0.0264661046 MEE NA NA NA
0.0729227215 MRO Marathon Oil Corporation Energy Oil & Gas Production
0.0023811506 MUR Murphy Oil Corporation Energy Integrated oil Companies
0.0412584481 NBL Noble Energy Inc. Energy Oil & Gas Production
0.0313167250 NBR Nabors Industries Ltd. Energy Oil & Gas Production
0.0563314301 NE Noble Corporation Energy Oil & Gas Production
0.1034268654 NFX Newfield Exploration Company Energy Oil & Gas Production
0.1926936035 NGS Natural Gas Services Group, Inc. Energy Oilfield Services/Equipment
0.1008675352 NOV National Oilwel Varcol, Inc. Energy Metal Fabrications
0.0991873145 NXY NA NA NA
0.0508974751 OII Oceaneering International, Inc. Energy Oilfield Services/Equipment
0.0993800241 OIS Oil States International, Inc. Energy Metal Fabrications
0.0405026646 OXY Occidental Petroleum Corporation Energy Oil & Gas Production
0.0495609266 PDE NA NA NA
0.1062543807 PKD Parker Drilling Company Energy Oil & Gas Production
0.1877221709 PQ Petroquest Energy Inc Energy Oil & Gas Production
0.0670664696 PVA Penn Virginia Corporation Energy Oil & Gas Production
0.0527486302 PXD Pioneer Natural Resources Company Energy Oil & Gas Production
0.1055507210 PXP Plains Exploration & Production Company Energy Oil & Gas Production
0.1053442278 RDC Rowan Companies plc Energy Oil & Gas Production
0.0690536207 RES RPC, Inc. Energy Oilfield Services/Equipment
0.1143500050 RIG Transocean Ltd. Energy Oil & Gas Production
0.1532407298 RRC Range Resources Corporation Energy Oil & Gas Production
0.1101877769 SFY Swift Energy Company Energy Oil & Gas Production
0.0715891522 SGY Stone Energy Corporation Energy Oil & Gas Production
0.0223880237 SII NA NA NA
0.0534550114 SJT San Juan Basin Royalty Trust Energy Oil & Gas Production
0.1063493615 SM SM Energy Company Energy Oil & Gas Production
0.1062401621 SPN Superior Energy Services, Inc. Energy Oilfield Services/Equipment
0.1152037963 SU Suncor Energy Inc. Energy Integrated oil Companies
0.1026210927 SUN NA NA NA
0.2121555712 SWN Southwestern Energy Company Energy Oil & Gas Production
0.0702312918 TDW Tidewater Inc. Consumer Services Marine Transportation
0.1125170113 TLM Talisman Energy Inc. Energy Oil & Gas Production
0.0005959932 TMR NA NA NA
0.0131588866 TS Tenaris S.A. Basic Industries Steel/Iron Ore
0.2152177841 TSO Tesoro Corporation Energy Integrated oil Companies
0.0679062447 TTI Tetra Technologies, Inc. Energy Oil & Gas Production
0.1174076026 UNT Unit Corporation Energy Oil & Gas Production
0.2088031326 UPL Ultra Petroleum Corp. Energy Oil & Gas Production
0.0273802381 USU USEC Inc. Basic Industries Mining & Quarrying of Nonmetallic Minerals (No Fuels)
0.1610340003 VLO Valero Energy Corporation Energy Integrated oil Companies
0.0068010930 WFT Weatherford International, Ltd Energy Oil & Gas Production
0.0952688259 WLL Whiting Petroleum Corporation Energy Oil & Gas Production
0.0668893737 WMB Williams Companies, Inc. (The) Public Utilities Natural Gas Distribution
0.0138315166 XEC Cimarex Energy Co Energy Oil & Gas Production
0.1427017005 XTO NA NA NA
weight symbol name sector industry
[[2]]
weight symbol name sector industry
0.12797710 ABX Barrick Gold Corporation Basic Industries Precious Metals
0.17644436 AEM Agnico Eagle Mines Limited Basic Industries Precious Metals
0.11593957 ASA ASA Gold and Precious Metals Limited n/a n/a
0.15675978 AU AngloGold Ashanti Limited Basic Industries Precious Metals
0.24193015 AUY Yamana Gold Inc. Basic Industries Precious Metals
0.17403800 BVN Buenaventura Mining Company Inc. Basic Industries Precious Metals
0.01358500 CCJ Cameco Corporation Basic Industries Precious Metals
0.40112842 CDE Coeur Mining, Inc. Basic Industries Precious Metals
0.32590838 EGO Eldorado Gold Corporation Basic Industries Precious Metals
0.07069435 FCX Freeport-McMoran Copper & Gold, Inc. Basic Industries Precious Metals
0.16373700 FDG NA NA NA
0.20941091 GFI Gold Fields Limited Basic Industries Precious Metals
0.20107216 GG Goldcorp Inc. Basic Industries Precious Metals
0.10346601 GRS NA NA NA
0.39352265 HL Hecla Mining Company Basic Industries Mining & Quarrying of Nonmetallic Minerals (No Fuels)
0.25090552 HMY Harmony Gold Mining Company Limited Basic Industries Precious Metals
0.16994222 IAG Iamgold Corporation Basic Industries Precious Metals
0.01370509 IVN NA NA NA
0.25820797 KGC Kinross Gold Corporation Basic Industries Precious Metals
0.15124020 NEM Newmont Mining Corporation Basic Industries Precious Metals
0.08798759 RZ NA NA NA
0.03131700 SLW Silver Wheaton Corp Basic Industries Precious Metals
0.29394621 SWC Stillwater Mining Company Basic Industries Precious Metals
weight symbol name sector industry
[[3]]
weight symbol name sector industry
0.096875336 ABV Companhia de Bebidas das Americas - AmBev Consumer Non-Durables Beverages (Production/Distribution)
0.326311357 BAK Braskem S.A. Basic Industries Major Chemicals
0.250725988 BBD Banco Bradesco Sa Finance Major Banks
0.166407433 BRFS BRF S.A. Consumer Non-Durables Meat/Poultry/Fish
0.307049892 BTM NA NA NA
0.089442235 CBD Companhia Brasileira de Distribuicao Consumer Services Food Chains
0.321156820 CIG Comp En De Mn Cemig ADS Public Utilities Electric Utilities: Central
0.116259858 CPL CPFL Energia S.A. Public Utilities Electric Utilities: Central
0.144458384 CYD China Yuchai International Limited Energy Industrial Machinery/Components
0.309964150 ELP Companhia Paranaense de Energia (COPEL) Public Utilities Electric Utilities: Central
0.001868534 ERJ Embraer-Empresa Brasileira de Aeronautica Capital Goods Aerospace
0.181906131 EWZ NA NA NA
0.087424791 FBR Fibria Celulose S.A. Basic Industries Paper
0.274446999 GGB Gerdau S.A. Capital Goods Steel/Iron Ore
0.050549732 GOL Gol Linhas Aereas Inteligentes S.A. Transportation Air Freight/Delivery Services
0.031581881 GPK Graphic Packaging Holding Company Consumer Durables Containers/Packaging
0.002041877 GTI GrafTech International Ltd Energy Industrial Machinery/Components
0.035948636 ILF NA NA NA
0.230162376 ITUB Itau Unibanco Banco Holding SA Finance Major Banks
0.166407433 PDA NA NA NA
0.177097305 SBS Companhia de saneamento Basico Do Estado De Sao Paulo - Sabesp Public Utilities Water Supply
0.026647018 TAR NA NA NA
0.199002737 TBH NA NA NA
0.211402025 TNE NA NA NA
0.104866320 TSP NA NA NA
0.239683225 TSU TIM Participacoes S.A. Public Utilities Telecommunications Equipment
0.094630001 VALE VALE S.A. Basic Industries Precious Metals
0.249716544 VIV Telefonica Brasil S.A. Public Utilities Telecommunications Equipment
weight symbol name sector industry
[[4]]
weight symbol name sector industry
0.001306744 AIT Applied Industrial Technologies, Inc. Consumer Durables Industrial Specialties
0.003829347 ALY NA NA NA
0.023222920 ELN Elan Corporation, plc Consumer Durables Major Pharmaceuticals
0.151106602 KKD Krispy Kreme Doughnuts, Inc. Consumer Non-Durables Food Chains
0.000329151 MAG NA NA NA
0.988199944 SGU Star Gas Partners, L.P. Consumer Services Other Specialty Stores
0.008478258 TGI Triumph Group, Inc. Capital Goods Aerospace
The cardinality of each component varies greatly in this analysis: there are 92 stocks in the first component, which consists mostly of oil and gas production and some coal companies. There are 23 stocks in the second component, which is about gold and other precious metals. The stocks of the third component come from various sectors, but the common factor seems to be that most of them are Brazilian. No structure is apparent in the fourth component, which essentially consists of Star Gas Partners and Krispy Kreme -- maybe the Star Gas employees like doughnuts?
Different random seeds reveal different sets of components with similar cumulative variances (including the mining and homebuilding components from the first analysis), but I stop here and leave further analysis to people who know what they are doing.
References
L. W. Mackey (2009). Deflation Methods for Sparse PCA. Advances in Neural Information Processing Systems, Vol. 21.
H. Markowitz (1952). Portfolio Selection. The Journal of Finance, Vol. 7, No. 1, pp. 77-91.