There is a whole cottage industry in showing neural networks are reasonably universal function approximators with fairly general nonlinearities as activations, under fairly general conditions. Nonetheless, you might like to play with the precise form of the nonlinearities, even making them themselves directly learnable, because some function shapes might have better approximation properties in a sense I will not trouble to make rigorous now, vague hand-waving arguments being the whole point of deep learning.

Nonetheless, here are a some handy references.

## Refs

- HZRS15a: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015a) Deep Residual Learning for Image Recognition
- GlBB11: Xavier Glorot, Antoine Bordes, Yoshua Bengio (2011) Deep Sparse Rectifier Neural Networks. In Aistats (Vol. 15, p. 275).
- HZRS15b: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015b) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
*ArXiv:1502.01852 [Cs]*. - ClUH16: Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter (2016) Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In Proceedings of ICLR.
- WPHL16: Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, Les Atlas (2016) Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 4880–4888).
- HBFS01: Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. In A field guide to dynamical recurrent neural networks. IEEE Press
- SrGS15: Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber (2015) Highway Networks. In arXiv:1505.00387 [cs].
- HZRS16: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016) Identity Mappings in Deep Residual Networks. In arXiv:1603.05027 [cs].
- AHSB15: Forest Agostinelli, Matthew Hoffman, Peter Sadowski, Pierre Baldi (2015) Learning Activation Functions to Improve Deep Neural Networks. In Proceedings of International Conference on Learning Representations (ICLR) 2015.
- GWMC13: Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio (2013) Maxout Networks. In ICML (3) (Vol. 28, pp. 1319–1327).
- PaMB13: Razvan Pascanu, Tomas Mikolov, Yoshua Bengio (2013) On the difficulty of training Recurrent Neural Networks. In arXiv:1211.5063 [cs] (pp. 1310–1318).
- MaHN13: Andrew L. Maas, Awni Y. Hannun, Andrew Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proceedings of ICML (Vol. 30).
- KUMH17: Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter (2017) Self-Normalizing Neural Networks.
*ArXiv:1706.02515 [Cs, Stat]*. - BFLL17: David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, Brian McWilliams (2017) The Shattered Gradients Problem: If resnets are the answer, then what is the question? In PMLR (pp. 342–350).
- Hoch98: Sepp Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions.
*International Journal of Uncertainty Fuzziness and Knowledge Based Systems*, 6, 107–115. - GlBe10: Xavier Glorot, Yoshua Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249–256).
- ArSB16: Martin Arjovsky, Amar Shah, Yoshua Bengio (2016) Unitary Evolution Recurrent Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning – Volume 48 (pp. 1120–1128). New York, NY, USA: JMLR.org