Extracting genome-specific positional sequences using Python scripts

introductory

In genome analysis, we often have a need to extract some sequences out of a fasta file. Sometimes these sequences are complete sequences, and sometimes they are only part of a sequence in the original fasta file. Especially when there is a lot of data, using the naked eye to select the sequence will be very difficult, then this time we can go through a simple programming to achieve.

Here, for example, the whole genome sequence of a species (0-refer/ Bacillus_subtilis.), and its genome gff annotation file (0-refer/ Bacillus_subtilis.) are given in the web attachment.

Suppose here we study the species and locate some specific genes through the gene function description field in the gff annotation file, together with a review of relevant information, etc.

Next we expect to find and extract these genes in the whole genome sequence fasta file based on the description of their locations in the gff file to get a new fasta file that contains only the target gene sequences.

Please write a script using python3 that will do this.

typical example

A sample script is shown below (see the web attachment "seq_select1.py").

In order to accomplish the above, we first need to prepare a txt file (hereafter referred to as a list file, an example can be found in the web attachment), and based on the gene location information recorded in the gff file, populate it with something like the following (columns are separated by tabs).

# Save the following to
gene46   NC_000964.3  42917  43660  +
NP_387934.1  NC_000964.3    59504  60070    +
yfmC  NC_000964.3  825787   826734  -
cds821  NC_000964.3  885844  886173 -

In column 1, give a name to the new sequence to be acquired;

Column 2, the original sequence ID where the sequence to be fetched is located;

Column 3, the starting position of the sequence to be acquired in the original sequence;

Column 4, the termination position of the sequence to be acquired in the original sequence;

In column 5, the sequence to be acquired is located in the positive (+) or negative (-) strand of the original sequence.

After that the py script is edited based on the input files, i.e. the input fasta file and the contents of the list file that records the location of the sequence to be fetched.

Open the fasta file "Bacillus_subtilis.", use a loop to read the sequence ids and base sequences line by line, and merge all the bases of each sequence into one string; store the sequence ids and the merged base sequences in the form of a dictionary (dictionary style {'id' :'base'}).

Open the list file "", read its contents and store them in the dictionary. (The key of the dictionary is the content of column 1 in the list file; the value of the dictionary is the content of columns 2-5 in the list file, and divided by tab to get a list containing 4 characters representing the information in columns 2-5 in the list file respectively).

Finally, according to the sequence position information in the read list file, the target gene sequence is intercepted in the read genome. Since some gene sequences may be located in the negative strand of the genome, we need to get their reverse complementary sequences, so we first define a function rev(), which is used to get the reverse complementary sequences in the subsequent calls. When outputting the sequence name, you can also choose whether to output the position information of the sequence together (name_detail = True/False).

<pre class="r" style="overflow-wrap: break-word; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial; margin-top: 0px; margin-bottom: 10px; padding: 9.5px; border-radius: 4px; background-color: rgb(245, 245, 245); box-sizing: border-box; overflow: auto; font-size: 13px; line-height: 1.42857; color: rgb(51, 51, 51); word-break: break-all; border: 1px solid rgb(204, 204, 204); font-family: "Times New Roman";">#!/usr/bin/env python3

# -*- coding: utf-8 -*-
# Initial pass command
input_file = 'Bacillus_subtilis.'
list_file = ''
output_file = ''
name_detail = True
## Read the file
#Read the genome sequence
seq_file = {}
with open(input_file, 'r') as input_fasta:
    for line in input_fasta:
        line = ()
        if line[0] == '>':
            seq_id = ()[0]
            seq_file[seq_id] = ''
        else:
            seq_file[seq_id] += line
input_fasta.close()
#Read the list file
list_dict = {}
with open(list_file, 'r') as list_table:
    for line in list_table:
        if ():
            line = ().split('\t')
            list_dict[line[0]] = [line[1], int(line[2]) - 1, int(line[3]), line[4]]
list_table.close()
## Intercept the sequence and output it
# Define function to intercept reverse complementary
def rev(seq):
    base_trans = {'A':'T', 'C':'G', 'T':'A', 'G':'C', 'N':'N', 'a':'t', 'c':'g', 't':'a', 'g':'c', 'n':'n'}
    rev_seq = list(reversed(seq))
    rev_seq_list = [base_trans[k] for k in rev_seq]
    rev_seq = ''.join(rev_seq_list)
    return(rev_seq)
# Intercept the sequence and output
output_fasta = open(output_file, 'w')
for key,value in list_dict.items():
    if name_detail:
        print('>' + key, '[' + value[0], value[1] + 1, value[2], value[3] + ']', file = output_fasta)
    else:
        print('>' + key, file = output_fasta)
    seq = seq_file['>' + value[0]][value[1]:value[2]]
    if value[3] == '+':
        print(seq, file = output_fasta)
    elif value[3] == '-':
        seq = rev(seq)
        print(seq, file = output_fasta)
output_fasta.close()</pre>

Edit the script and run it to output a new fasta file "" where the sequence is the sequence of the target gene we want to get.

Expansion:

The attachment "seq_select.py" is a python3 script that adds a command-passing line to do I/O processing of the target file directly in the shell. The script specifies an input fasta sequence file and a list file with the locations of the extracted sequences, and the output is a new fasta file with the extracted sequences.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Import modules, initial pass of commands, variables, etc.
import argparse
parser = (description = '\nThis script is used to intercept sequences at specific locations in the genome, and requires additional input of a list file with information about the intercepted sequences', add_help = False, usage = '\npython3 seq_select.py -i [] -o [] -l [list]\npython3 seq_select.py --input [] --output [] --list [list]')
required = parser.add_argument_group('Mandatory')
optional = parser.add_argument_group('Optional')
required.add_argument('-i', '--input', metavar = '[]', help = 'Input file, fasta format', required = True)
required.add_argument('-o', '--output', metavar = '[]', help = 'Output file, fasta format', required = True)
required.add_argument('-l', '--list', metavar = '[list]', help = 'File for "new sequence name / original sequence ID where sequence is located / sequence start position / sequence end position / positive chain (+) or negative chain (-)", separated by tab', required = True)
optional.add_argument('--detail', action = 'store_true', help = 'If this parameter is present, display information about the position of the sequence in the original fasta in each sequence id of the output fasta', required = False)
optional.add_argument('-h', '--help', action = 'help', help = 'Helpful information')
args = parser.parse_args()
## Read the file
#Read the genome sequence
seq_file = {}
with open(, 'r') as input_fasta:
    for line in input_fasta:
        line = ()
        if line[0] == '>':
            seq_id = ()[0]
            seq_file[seq_id] = ''
        else:
            seq_file[seq_id] += line
input_fasta.close()
#Read the list file
list_dict = {}
with open(, 'r') as list_file:
    for line in list_file:
        if ():
            line = ().split('\t')
            list_dict[line[0]] = [line[1], int(line[2]) - 1, int(line[3]), line[4]]
list_file.close()
## Intercept the sequence and output it
# Define function to intercept reverse complementary
def rev(seq):
    base_trans = {'A':'T', 'C':'G', 'T':'A', 'G':'C', 'a':'t', 'c':'g', 't':'a', 'g':'c'}
    rev_seq = list(reversed(seq))
    rev_seq_list = [base_trans[k] for k in rev_seq]
    rev_seq = ''.join(rev_seq_list)
    return(rev_seq)
# Intercept the sequence and output
output_fasta = open(, 'w')
for key,value in list_dict.items():
    if :
        print('>' + key, '[' + value[0], value[1] + 1, value[2], value[3] + ']', file = output_fasta)
    else:
        print('>' + key, file = output_fasta)
    seq = seq_file['>' + value[0]][value[1]:value[2]]
    if value[3] == '+':
        print(seq, file = output_fasta)
    elif value[3] == '-':
        seq = rev(seq)
        print(seq, file = output_fasta)
output_fasta.close()

Applying the test file from the above example, run the script as follows.

#python3 seq_select.py -h
python3 seq_select.py -i Bacillus_subtilis. -l  -o  --detail

Source Code Extraction Link./s/1kUhBTmpDonCskwmpNIJPkA?pwd=ih9n

Extract code: ih9n

Above is the details of using Python script to extract the genome specified position sequence, more information about python extract genome position sequence please pay attention to my other related articles!