easyfm: An easy software suite for file manipulation of Next Generation Sequencing data on desktops

Storing and manipulating Next Generation Sequencing (NGS) file formats is an essential but difficult task in biological data analysis. The easyfm (easy file manipulation) toolkit (https://github.com/TaekAndBrendan/easyfm) makes manipulating commonly used NGS files more accessible to biologists. It enables them to perform end-to-end reproducible data analyses using a free standalone desktop application (available on Windows, Mac and Linux). Unlike existing tools (e.g. Galaxy), the Graphical User Interface (GUI)-based easyfm is not dependent on any high-performance computing (HPC) system and can be operated without an internet connection. This specific benefit allow easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences. Author summary The analysis and manipulation of NGS data for understanding biological phenomena is an increasingly important aspect in the life sciences. Yet, most methods for analysing, storing and manipulating NGS data require complex command-line tools in HPC or web-based servers and have not yet been implemented in comprehensive, easy-to-use software. This is a major hurdle preventing more general application in the field of NGS data analysis and file manipulation. Here we present easyfm, a free standalone Graphical User Interface (GUI) software with Python support that can be used to facilitate the rapid discovery of target sequences (or user’s interest) in NGS datasets for novice users. For user-friendliness and convenience, easyfm was developed with four work modules and a secondary GUI window (herein secondary window), covering different aspects of NGS data analysis (mainly focusing on FASTA files), including post-processing, filtering, format conversion, generating results, real-time log, and help. In combination with the executable tools (BLAST+ and BLAT) and Python, easyfm allows the user to set analysis parameters, select/extract regions of interest, examine the input and output results, and convert to a wide range of file formats. To help augment the functionality of existing web-based and command-line tools, easyfm, a self-contained program, comes with extensive documentation (hosted at https://github.com/TaekAndBrendan/easyfm) including a comprehensive step-by-step guide.

interest) in NGS datasets for novice users. For user-friendliness and convenience, easyfm was 23 developed with four work modules and a secondary GUI window (herein secondary window), 24 covering different aspects of NGS data analysis (mainly focusing on FASTA files), including 25 post-processing, filtering, format conversion, generating results, real-time log, and help. In  With the broad implementation of NGS technologies in the life sciences, genomics and 3 transcriptomics sequencing data are generated at an unprecedented rate [1][2][3]. Rapid progress 4 in NGS technologies has brought massively high-throughput sequencing data to support 5 research questions across many research fields, enabling a new era of genomic research [2,3]. 6 Simultaneously, this advancement has brought enormous challenges in data analysis, of which 7 efficient, standardized and consistent analysis are fundamental steps for maintaining 8 reproducibility, especially for biologists [1,3]. However, many of the available tools for NGS 9 data analysis require higher-order computational experience (e.g. various 10 programming/scripting languages), expensive infrastructure (adequate HPC facilities and 11 Cloud computing) and lack GUIs, making them inaccessible to many researchers, and 12 cumbersome for even experienced biologists. Thus, the development of user-friendly 13 standalone software for NGS data will accelerate the pace of research for scientists who have 14 limited computer and bioinformatics experience.

15
NGS data processing often involves consecutive steps of trimming (including quality 16 check), assembling, mapping, manipulating, converting and processing large files. FASTA [4] 17 and FASTQ [5] file formats are generated by most NGS platforms, and further SAM/BAM [6], 18 BED [7], GFF/GTF [8], and VCF [9] can be derived using FASTA and FASTQ files depending 19 on the required analysis. The FASTA file, based on simple text, is the most basic format for 20 reporting a sequence and is accepted by almost all sequence analysis programs. Each sequence 21 starts with a ">" followed by the sequence name, a description of the sequence, and the 22 sequence itself (nucleic acids or amino acids). The FASTQ file, a text-based format for storing 23 both a biological sequence (usually nucleotide sequence) and its corresponding quality scores, 24 is the most widely used format in sequence analysis and NGS sequencers. Each sequence 25 requires at least 4 lines starting with "@" followed by the sequence, a "+" sequence identifier, 26 and quality scores. Conveniently, FASTQ files can also be converted to FASTA files, the most 27 commonly used file format for NGS data that enables direct sequencing of target genes. For the last decade, many HPC and Cloud-based NGS command-line programs or web-33 based platforms have wrapped popular high-level analysis and visualisation tools in an intuitive 4 1 and appealing interface [15]. Galaxy (homepage: https://galaxyproject.org, main public server: 2 https://usegalaxy.org, Australia: https://usegalaxy.org.au/) in particular has been successful in 3 establishing itself as an analytics hub and an e-learning platform with global scientists, 4 intending to produce accessible, reproducible and collaborative biological analyses [16,17]. 5 Even with the huge achievements made in many analytical software packages and pipelines, 6 further improvements in user-friendly standalone software are still required to facilitate the 7 rapid discovery of meaningful sequences in very large data sets for novice users. To help 8 augment the functionality of existing tools and allow for user-friendliness and convenience of 9 NGS file manipulation, easyfm enables end-to-end file filtering, extracting and converting 10 (FASTQ to FASTA) with a simple mouse click on desktops.

11
The easyfm, implemented in Python 3.7+, was developed with four work modules querying/manipulating NGS data sources and generating various outcomes. Since everyone 20 can use it from anywhere to analyse data and find target sequences easily without any coding, 21 HPC and/or internet/web-server connection, we hope the usefulness of easyfm can extend its 22 potential use in a wide range of bioinformatics applications in the life sciences including 23 teaching/learning materials in the classroom. to execute, the user has full control over which input (including compressed files: *.gz) and 18 output files/folders can be selected. easyfm also generates several output files (mostly in a tab-19 separated text file) that can be opened with standard text editors or Excel. To support work 20 modules, easyfm also has a secondary window-Project Folder, Help and Log-that integrates 21 with work modules (Fig 1). In addition, further assistance and information can be obtained via   (Stats), Open with Text Editor, Delete, and Create Folder (Fig 1B and 1C). The Help option is 14 a resource intended to provide the end-user with information and support to easyfm work 15 modules including its manual. To access additional information the user can click any of the 16 links in Help (Fig 1D). Furthermore, to combine advanced functionalities with an easy-to-use 17 interface, the Log option provides real-time log reporting and monitoring for every executed 18 job ( Fig 1D). This can aid in effective communication when reporting and resolving any 19 program issues.   Furthermore, purchasing commercial software of a rich GUI-standalone tool (e.g. CLC 21 Genomic Workbench and Geneious) and its licences is too expensive for many researchers and 22 laboratories. To resolve these matters, easyfm provides a new Python-based free GUI for 23 BLAST and more (Fig 2). Users can explore all BLAST+ (v2.11.0) features by creating a local  3 BLAT is one of the alignment algorithms developed for the pairwise analysis and comparison 4 of biological sequences with the primary goal of inferring homology to discover the biological 5 function of genomic sequences [21]. While BLAT is less sensitive than BLAST, BLAT has a 6 few clear advantages over BLAST from a practical standpoint in speed and convenience [23]. 7 Compared to pre-existing pairwise sequence alignment tools, BLAT performed ~500 times 8 faster with mRNA/DNA alignments and ~50 times faster with protein/protein alignments [21]. 9 BLAT can be used either as a web-based server-client program (https://genome.ucsc.edu/cgi-10 bin/hgBlat) or as a standalone command-line program [23], but not a user-friendly GUI.

11
However, easyfm BLAT (v3.2.1) enables users to control all parameters with a simple mouse 12 click ( Fig 3A) that can be a great advantage for novice biologists. Along with freely available 13 easyfm BLAST, easyfm BLAT will simplify distributed computation pipelines to facilitate the 14 rapid discovery of sequence similarities between NGS datasets. However, if the target genome 15 and input sequences are big, using the standalone command-line BLAT in HPC is more suitable 16 for batch runs, and more efficient than the web-and GUI-based BLAT because the standalone 17 command-line in HPC can store more memory.   bioinformatics/data analysis, and to quickly analyse results without being hampered by 20 command line tools and HPC Secure Shell (SSH) connections. 21 Users can import any FASTA/Q files to index and extract the indexed ID with its 22 sequence by double-clicking, matching Prefix ID and selecting a provided text file (Fig 5A).

23
Even the FASTQ file can be converted to the FASTA file and the given FASTA file change its 24 direction via Reverse Complement and Reverse (Fig 5B and 5C with existing tools [14,26], easyfm File Manipulation will provide a stable and modular 31 platform for manipulating sequence data and files to ensure high reproducibility standards in 32 the NGS era. pyfastx, PyQt5 and Biopython (Table 1). More information and the manual may be obtained 7 from the website: https://github.com/TaekAndBrendan/easyfm.

8
In the future, we will continue to update the toolbox with new fast and easy GUI support,