So i got this DB-server which does nothing but work out NetFlow files with flow-tools. Since i want that data to be as live as possible for my customers, i’ve decided to go with a 15-minute rotation interval. This leaves me with 96 rows per day per IP. Since i got a /20, we’ve got to multiply 96 rows with 4096 IP-Addresses.
So i get 393.216 rows per day. That leaves me with ~12 million rows per month.
“What’s the issue?” you may ask. Well when you have a customer interface where you want to show live-traffic, selecting over (let’s say mid-month) 6 million rows, summing input, output and total bytes takes quite a moment.
Sure, i could throw hardware after this problem, but why waste money when you can gain performance by updating your schema.
I came up with the following PostgreSQL Trigger:
What this does? It takes the rows that get inserted into the table we put this trigger on, and updates (or creates) rows in the table traffic_$YEAR_$MONTH_per_ip to match the latest state of the traffic.
After that you have both, 15-minute-intervals and a cumulated total over the current month for every IP, where as the cumulated stuff from traffic_$YEAR_$MONTH_per_ip is a LOT faster than sum() input, output and total.
Why not a simple update-query in my code where i insert? Well in my humble opinion it’s the DBs job to do such things. Since the data already gets pushed into the server, why not let the DB handle the update/create-queries? This is just good DB-design.
Surely i will NOT save you a few weeks of work and paste my rotate.pl which handles everything. But hey, where’s the fun in not learning new stuff? ;)