Detailed explanation of the three ways tensorflow loads data

There are three ways to read Tensorflow data:

Preloaded data: Preloaded data
Feeding: Python generates data and then feeds it to the backend.
Reading from file: Reading directly from file.

What is the difference between these three have read methods? We first need to know how TensorFlow (TF) works.

The core of TF is written in C++, which has the advantage of running fast and the disadvantage of inflexible calls. Python is the opposite, so combines the advantages of both languages. The core algorithms and operational frameworks involved in computation are written in C++ and provide APIs to Python, which calls these APIs, designs the training model (Graph), and then gives the designed Graph to the back-end to execute. In short, Python's role is Design and C++ is Run.

I. Preloading data:

import tensorflow as tf 
# Design Graph
x1 = ([2, 3, 4]) 
x2 = ([4, 0, 1]) 
y = (x1, x2) 
# Open a session --> calculate y
with () as sess: 
  print (y)

Second, python generates data and then feeds it to the backend

import tensorflow as tf 
# Design Graph
x1 = (tf.int16) 
x2 = (tf.int16) 
y = (x1, x2) 
# Generate data with Python
li1 = [2, 3, 4] 
li2 = [4, 0, 1] 
# Open a session --> feed data --> calculate y
with () as sess: 
  print (y, feed_dict={x1: li1, x2: li2})

Note: Here x1, x2 are just placeholders with no specific values, so where do you go to get the values when running? This is where the feed_dict parameter in () is used to feed the data generated by Python to the backend and compute y.

Disadvantages of both programs:

1、Pre-loading: embed the data directly into Graph, and then pass Graph into Session to run. When the amount of data is relatively large, the transmission of Graph will encounter efficiency problems.

2. Replace the data with placeholders and fill the data when you are ready to run.

The first two methods are convenient, but it will be a struggle when encountering large data, even with Feeding, the increase in intermediate links is not a small amount of overhead, such as data type conversion and so on. The optimal solution is to define the file reading method in Graph, and let TF read the data from the file by itself and decode it into a usable sample set.

Third, read from the file, simply put, the data reading module to build the map

1. Prepare the data and construct three files,,,.

$ echo -e "Alpha1,A1\nAlpha2,A2\nAlpha3,A3" >  
$ echo -e "Bee1,B1\nBee2,B2\nBee3,B3" >  
$ echo -e "Sea1,C1\nSea2,C2\nSea3,C3" >

2、Single Reader, single sample

#-*- coding:utf-8 -*- 
import tensorflow as tf 
# Generate a FIFO queue and a QueueRunner, generate a filename queue.
filenames = ['', '', ''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
# Define Reader
reader = () 
key, value = (filename_queue) 
# Define Decoder
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']]) 
#example_batch, label_batch = .shuffle_batch([example,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2) 
# Run Graph
with () as sess: 
  coord = () #Create a coordinator to manage threads
  threads = .start_queue_runners(coord=coord) # Start QueueRunner, the filename queue is now queued.
  for i in range(10): 
    print (),() 
  coord.request_stop() 
  (threads)

Note: .shuffle_batch is not used here, which will result in the generated samples and labels not corresponding to each other, and the order is messed up. The result is as follows:

Alpha1 A2
Alpha3 B1
Bee2 B3
Sea1 C2
Sea3 A1
Alpha2 A3
Bee1 B2
Bee3 C1
Sea2 C3
Alpha1 A2

Solution: Use .shuffle_batch, then the generated results will be able to correspond.

#-*- coding:utf-8 -*- 
import tensorflow as tf 
# Generate a FIFO queue and a QueueRunner, generate a filename queue.
filenames = ['', '', ''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
# Define Reader
reader = () 
key, value = (filename_queue) 
# Define Decoder
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']]) 
example_batch, label_batch = .shuffle_batch([example,label], batch_size=1, capacity=200, min_after_dequeue=100, num_threads=2) 
# Run Graph
with () as sess: 
  coord = () #Create a coordinator to manage threads
  threads = .start_queue_runners(coord=coord) # Start QueueRunner, the filename queue is now queued.
  for i in range(10): 
    e_val,l_val = ([example_batch, label_batch]) 
    print e_val,l_val 
  coord.request_stop() 
  (threads)

3, a single Reader, multiple samples, mainly also through the .shuffle_batch to realize the

#-*- coding:utf-8 -*- 
import tensorflow as tf 
filenames = ['', '', ''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
reader = () 
key, value = (filename_queue) 
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']]) 
# Use() adds an extra sample queue and a QueueRunner.
The data goes into this queue after #Decoder solves it, and then batch out.
# Although there is only one Reader, but you can set up multiple threads, accordingly increase the number of threads will improve the reading speed, but not the more threads the better.
example_batch, label_batch = ( 
   [example, label], batch_size=5) 
with () as sess: 
  coord = () 
  threads = .start_queue_runners(coord=coord) 
  for i in range(10): 
    e_val,l_val = ([example_batch,label_batch]) 
    print e_val,l_val 
  coord.request_stop() 
  (threads)

Note: The following way of writing, extracted batch_size samples, features and labels are also not synchronized with each other

#-*- coding:utf-8 -*- 
import tensorflow as tf 
filenames = ['', '', ''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
reader = () 
key, value = (filename_queue) 
example, label = tf.decode_csv(value, record_defaults=[['null'], ['null']]) 
# Use() adds an extra sample queue and a QueueRunner.
The data goes into this queue after #Decoder solves it, and then batch out.
# Although there is only one Reader, but you can set up multiple threads, accordingly increase the number of threads will improve the reading speed, but not the more threads the better.
example_batch, label_batch = ( 
   [example, label], batch_size=5) 
with () as sess: 
  coord = () 
  threads = .start_queue_runners(coord=coord) 
  for i in range(10): 
    print example_batch.eval(), label_batch.eval() 
  coord.request_stop() 
  (threads)

Note: The output is as follows: you can see that there is no correspondence between feature and label

['Alpha1' 'Alpha2' 'Alpha3' 'Bee1' 'Bee2'] ['B3' 'C1' 'C2' 'C3' 'A1']
['Alpha2' 'Alpha3' 'Bee1' 'Bee2' 'Bee3'] ['C1' 'C2' 'C3' 'A1' 'A2']
['Alpha3' 'Bee1' 'Bee2' 'Bee3' 'Sea1'] ['C2' 'C3' 'A1' 'A2' 'A3']

4、Multiple readers, multiple samples

#-*- coding:utf-8 -*- 
import tensorflow as tf 
filenames = ['', '', ''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
reader = () 
key, value = (filename_queue) 
record_defaults = [['null'], ['null']] 
# Define multiple decoders, each connected to a reader.
example_list = [tf.decode_csv(value, record_defaults=record_defaults) 
         for _ in range(2)] # Reader set to 2
# Using .batch_join(), you can use multiple readers and read data in parallel. Use one thread per Reader.
example_batch, label_batch = .batch_join( 
   example_list, batch_size=5) 
with () as sess: 
  coord = () 
  threads = .start_queue_runners(coord=coord) 
  for i in range(10): 
    e_val,l_val = ([example_batch,label_batch]) 
    print e_val,l_val 
  coord.request_stop() 
  (threads)

The .shuffle_batch function with .shuffle_batch is a single Reader read, but can be multi-threaded. .batch_join with .shuffle_batch_join can be set up for multi-Reader reading, using one thread per Reader. As for the efficiency of both methods, with single Reader, 2 threads reaches the speed limit. In multi-Reader, 2 Readers will reach the limit. So it's not the more threads the faster, even more threads will make the efficiency drop instead.

5, iteration control, set the epoch parameter, specify how many rounds of our samples can only be used in training time

#-*- coding:utf-8 -*- 
import tensorflow as tf 
filenames = ['', '', ''] 
#num_epoch: set the number of iterations
filename_queue = .string_input_producer(filenames, shuffle=False,num_epochs=3) 
reader = () 
key, value = (filename_queue) 
record_defaults = [['null'], ['null']] 
# Define multiple decoders, each connected to a reader.
example_list = [tf.decode_csv(value, record_defaults=record_defaults) 
         for _ in range(2)] # Reader set to 2
# Using .batch_join(), you can use multiple readers and read data in parallel. Use one thread per Reader.
example_batch, label_batch = .batch_join( 
   example_list, batch_size=1) 
# Initialize local variables
init_local_op = tf.initialize_local_variables() 
with () as sess: 
  (init_local_op) 
  coord = () 
  threads = .start_queue_runners(coord=coord) 
  try: 
    while not coord.should_stop(): 
      e_val,l_val = ([example_batch,label_batch]) 
      print e_val,l_val 
  except : 
      print('Epochs Complete!') 
  finally: 
      coord.request_stop() 
  (threads) 
  coord.request_stop() 
  (threads)

In the iteration control, remember to add tf.initialize_local_variables(), the official tutorial doesn't explain it, but if you don't initialize it, the run will report an error.

For traditional machine learning, let's say a classification problem, [x1 x2 x3] is the feature, for a binary classification problem, the label will be [0,1] or [1,0] after one-hot coding. In general, we will consider organizing the data in a csv file, one line represents a sample, and then use a queue to read the data.

Explanation: For this data, the first three columns represent the feature, because it is a classification problem, and the last two columns are the labels obtained after one-hot coding.

The code to read this csv file using the queue is as follows:

#-*- coding:utf-8 -*- 
import tensorflow as tf 
# Generate a FIFO queue and a QueueRunner, generate a filename queue.
filenames = [''] 
filename_queue = .string_input_producer(filenames, shuffle=False) 
# Define Reader
reader = () 
key, value = (filename_queue) 
# Define Decoder
record_defaults = [[1], [1], [1], [1], [1]] 
col1, col2, col3, col4, col5 = tf.decode_csv(value,record_defaults=record_defaults) 
features = ([col1, col2, col3]) 
label = ([col4,col5]) 
example_batch, label_batch = .shuffle_batch([features,label], batch_size=2, capacity=200, min_after_dequeue=100, num_threads=2) 
# Run Graph
with () as sess: 
  coord = () #Create a coordinator to manage threads
  threads = .start_queue_runners(coord=coord) # Start QueueRunner, the filename queue is now queued.
  for i in range(10): 
    e_val,l_val = ([example_batch, label_batch]) 
    print e_val,l_val 
  coord.request_stop() 
  (threads)

The output is as follows:

Description:

record_defaults = [[1], [1], [1], [1], [1]]

represents the template for parsing, each sample has 5 columns, which are separated by ',' in the data by default, and then the parsing criterion is [1], which means that the values in each column are parsed as integers. [1.0] is parsed as floating point, ['null'] is parsed as string.

This is the whole content of this article.