Feature Engineering, Aggregation, and RAM
Recent Posts
No matter how powerful my computer’s hardware specifications are, my machine learning use cases always expand beyond the computer’s resource limits. I have two PCs, one Linux and one Windows, each with 64 GB of RAM. Neither is lightweight in computing power. Yet even when my raw dataset is only 1 GB, I ran out of RAM when doing aggregation for feature engineering!
Why should we aggregate data? Aggregation is essential because it allows us to align the unit of analysis with the specific use case at hand. For instance, when we want to model customer churn, we need to examine features at the customer level or higher. Real-life data often involves complex one-to-many relationships. For example, a bank customer may have numerous transactions associated with their multiple accounts. To make sense of this data, we must aggregate the features from the detailed transaction level up to the desired unit of analysis, such as the customer level. This aggregation can be achieved using various techniques, including simple operations like SUM, COUNT, MAX, or even more sophisticated aggregations.
When it comes to performing data aggregation, pandas stands out as one of the most popular tools utilized by data scientists. With its built-in functions for aggregation, pandas provides a convenient and powerful solution. However, it’s important to note that pandas does aggregation in-memory, meaning that all the data is processed at once. While this approach offers speed and efficiency, it also demands a substantial amount of RAM resources. As a result, the default methodology in pandas, although intuitive and fast, may encounter scalability issues when dealing with larger datasets.
In the following example, I create a single, simple aggregation feature for only 10,000 bank customers using pandas aggregation.
While the input data is only 200 KB, the aggregation transformation RAM usage peaks at 6 GB! In this example, pandas aggregation used 30 times more memory than the source data file. A bank with millions of customers can’t use this approach.
When we examine alternatives to pandas aggregation, a few options emerge. One approach is to avoid processing all the data at once, which can help mitigate some of the resource demands. However, this also introduces a trade-off in terms of processing speed. Another alternative involves pushing the aggregation task to the database. While this can be effective, it often requires writing complex and convoluted SQL queries, commonly referred to as “spaghetti SQL.” Lastly, caching the aggregated values is another possibility. This approach involves storing pre-calculated aggregated results, but it can result in significant storage space consumption within the database.
In conclusion, when seeking a feature platform for data aggregation, it is crucial to consider certain factors. Look for a platform that offers the simplicity and abstraction found in pandas, enabling ease of use and straightforward operations. Additionally, prioritize platforms that leverage the power of the database by automatically generating optimized SQL queries to offload processing. Finally, an ideal platform should intelligently cache feature values to optimize both processing speed and storage efficiency, striking a balance between performance and resource consumption.
Looking for a feature platform with pandas-like syntax, in-database computing, and optimized caching? Download the free featurebyte SDK https://docs.featurebyte.com/latest/