Python Vaex implementation to quickly analyze 100G large data volume

Limitations of pandas in handling big data

Nowadays, data science competitions provide more and more data volume, moving dozens of G, or even hundreds of G, which will test the machine performance and data processing capabilities.

Python pandas is commonly used data processing tools, can cope with larger data sets (ten million rows level), but when the amount of data to reach the level of billions of tens of billions of rows, pandas processing up a little overwhelmed, can be said to be very slow.

There will be performance factors such as computer memory, but pandas' own data processing mechanisms (which rely on memory) also limit its ability to handle big data.

Of course pandas can read data in batches through chunks, but the disadvantage of this is that data processing is more complex, and each step of the analysis will consume memory and time.

The following use pandas to read 3.7 G data set (hdf5 format), which has 4 columns and 100 million rows, and calculate the average value of the first row. My computer CPU is i7-8550U and 8G RAM, let's see how much time this loading and calculation process takes.

Dataset:

Read and calculate using pandas:

Looking at the process above, it took 15 seconds to load the data and 3.5 seconds to calculate the average, for a total of 18.5 seconds.

Here with the hdf5 file, hdf5 is a file storage format, compared to csv is more suitable for storing large amounts of data, high compression, and read, write faster.

Switching to vaex, today's protagonist, how long does it take to read the same data and do the same average calculation?

Read and calculate using vaex:

The file read took 9ms, which is negligible, and the average calculation took 1s, for a total of 1s.

Also reading a 100 million row hdfs dataset, why does pandas take more than 10 seconds while vaex takes close to 0?

The main reason here is because pandas reads the data into memory and then uses it for processing and computation. While vaex will only memory map the data, not really read the data into memory, this is the same as spark's lazy loading, it will only load it when it is used, not when it is declared.

So it doesn't matter how much data is loaded, 10GB, 100GB... For vaex it's an instant fix. The downside is that vaex's lazy loading only supports HDF5, Apache Arrow, Parquet, FITS, etc., and does not support text files such as csv, because there is no way to memory map text files.

Maybe some partners do not quite understand the memory mapping, the following put a paragraph to explain, specific to figure out also have to figure out:

Memory mapping is the one-to-one correspondence between the location of a file on the hard disk and an area of the same size in the logical address space of a process. This correspondence is purely a logical concept and does not exist physically, because the logical address space of the process itself does not exist. In the process of memory mapping, there is no actual copy of the data, the file is not loaded into memory, but only logically placed into memory, specifically in the code, is to create and initialize the relevant data structure (struct address_space).

What is vaex

The previous comparison of the speed of vaex and pandas in processing big data shows that vaex has a clear advantage. Although the ability is outstanding, no more than pandas is a household name, vaex is still a newcomer just out of the circle.

vaex is also a python-based third-party library for data processing using thepipIt's ready to install.

The official website's description of vaex can be summarized in three points:

vaex is a data table tool for processing and presenting data, similar to pandas;
vaex takes memory-mapped, inert computations that do not consume memory and are suitable for processing big data;
vaex can perform statistical analysis and visualization presentations in seconds on tens of billions of datasets;

The vaex is superior:

Performance: Handles massive amounts of data, 109 lines/second;
Inertia: fast calculations, no memory usage;
Zero memory replication: no memory is replicated when filtering/converting/calculating and streaming is done when needed;
Visualization: Visualization components are included;
API: similar to pandas, with rich data processing and calculation functions;
Interoperable: Use with Jupyter notebook for flexible interactive visualization;

Install vaex

Use pip or conda for installation:

retrieve data

vaex supports reading hdf5, csv, parquet and other files using the read method. hdf5 can be read inertly, while csv can only be read into memory.

vaex data reading function:

data processing

Sometimes we need to do all sorts of transformations, filters, calculations, etc. on data, and each step of pandas processing consumes memory and is time-costly. Unless we say we use chaining, but then the process is very unclear.

vaex has zero memory for the whole process. Because its processing only generates expression (expression), expression is a logical representation, will not be executed, only to the final stage of generating results will be executed. And the whole process is streaming data, there is no memory backlog.

You can see that there are two processes above, screening and calculation, neither of which copies the memory, where a delayed calculation, or inert mechanism, is used. If each process is really calculated, the consumption of memory, not to mention the time cost alone is very large.

Statistical calculation function for vaex:

Visualization

vaex also allows for fast visualization and presentation, even with tens of billions of datasets, it can still produce graphs in seconds.

vaex visualization function:

reach a verdict

vaex is somewhat similar to a combination of spark and pandas, the larger the amount of data the more it can show its advantages. As long as your hard drive can hold as much data as it can, it can quickly analyze that data.

vaex is still in rapid development, integrating more and more pandas features, it has a star count of 5k on github, with huge growth potential.

Attachment: hdf5 dataset generation code (4 columns 100 million rows of data)

import pandas as pd
import vaex
df = ((100000000,4),columns=['col_1','col_2','col_3','col_4'])
df.to_csv('',index=False)
('',convert='example1.hdf5')

Note here do not use pandas to generate hdf5 directly, its format will be incompatible with vaex.

This article on Python Vaex to quickly analyze the amount of 100G large data is introduced to this article, more related Python Vaex analyze the amount of 100G large data content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!