SoFunction
Updated on 2024-11-07

Ways to Filter Placeholder Files Ending in _$folder$ on S3 Using Regular Expressions

When we use the command line to batch copy files from S3 or to count the number of files, we want to exclude files on S3 that start with_$folder$ The placeholder file at the end, how should this regular expression be written?

Shell Implementation

Here are the statistics of the divisions under a certain position in S3_$folder$ The number of files in the ending file:

aws s3 ls --recursive s3://my-s3-location/ | grep -v '.*_\$folder\$' | wc -l

Filtering with grep is relatively simple because grep has a-v,--invert-match Parameter: "reverse match", i.e. filter out the rows on match.

Java implementation

In contrast, if it's a java program, it's a little hard to write, because the java regular interface doesn't have a "reverse match" setting, so this is the way to write the regular:^(?!.*[_]\$folder\$$).*$We'll start withs3-dist-cp This command, for example, has its--srcPattern parameter is a Java regular expression that matches the file to be copied if we want to exclude from the copy those annoying S3_$folder$ The document at the end, should be written like this:

nohup s3-dist-cp \
    -=599 \
    --src=s3://my-hbase-snapshots/usertable-20231205 \
    --dest=hdfs://${SINK_CLUSTER_NAMENODES}:8020/user/hbase/ \
    --srcPattern='^(?!.*[_]\$folder\$$).*$' \
    --multipartUploadChunkSize=1024 &>  &
tail -f 

Supplement:

Regular expression text filtering

grep text filter

The default is to match and display by behavior-based units.

The default match is as long as it contains the pattern character

grep -w is a word-by-word match, which is inconsistent with normal matching

Word separators, numbers plus letters plus underscores all count as part of the word.

grep -f /etc/passwd

Match the line number of the displayed result

grep and relationship and or relationship

1. and grep root /etc/passwd | grep shutdown

2. or grep -e root -e shutdown /etc/passwd

regular expression (math.)

1. Character Matching

. represents an arbitrary character . Placing it inside [] means . itself.

2. Number of matches

Number of occurrences of a character

* :: Indicates that the number of occurrences of the character preceding the * symbol is indeterminate

3. Location anchoring

Beginning of line ^ cannot match the beginning of a string in the middle.

End of line $ Cannot match the end of an intervening string

Word beginnings \<root root is on the leftmost side of the word

Word endings root\> root is on the rightmost side of the word

4. Grouping

1. echo wangwangwangggww | grep "\(wang\)\{3\}"

2. Backward references

The Difference Between Regular Expressions and Wildcards

A regular expression matches the contents of a file or a standard output string, a wildcard matches the name of a file. The two operate on different objects.

Matching String Problems

When the shell executes a command, the regular expression takes the entire output as a string, including the invisible space character.

Some commands output one or more spaces, others do not.

1. \(\) and \{\} must be added before the () symbol and before the {} brackets in an expression.

grep "^\(.*\):.*\1$" /etc/passwd

2. Regular expressions start at the top of the string by default, but if the anchor is at the end of the line, then the regular expression will start at the end.

1. Start searching from the end

2. Start searching from the head

3. Examples of groupings

The first subgroup matches to the string is 7, the last [0-9]*\1 means that it matches to the end of 7 and 7 can contain any number of digits in front of it.

Difference between basic and extended regular

1. Basic Regular Syntax Parentheses and curly braces need to be preceded by the \ symbol for escaping.

grep -w "[0-9]\{2,3\}" /etc/passwd

2. Extended Regular Do not precede parentheses and curly braces with an escape character.

grep -Ew "[0-9]{2,3}" /etc/passwd

egrep -w "[0-9]{2,3}" /etc/passwd

To this regular expression: filter S3 on the _$folder $ end of the placeholder file on the article is introduced to this, more related to regular expression filter placeholder file content, please search for my previous posts or continue to browse the following related articles I hope that you will support me more in the future!