Finding text in files

User commands ]FIND and ]REPLACE in Dyalog APL with regular expressions

Introduction

Very often we want to know where some text is found in some files on disk. Maybe because we’re looking for code in script files. I know I do once in a while. There are a number of 3rd party programs available and even some offered natively by the OS (e.g. “find” in Unix) but none always available from APL. Except for Dyalog APL.

The process

The procedure is fairly simple: first you find all the files matching a specific extension, for example all the TXT files under a known folder and its sub-folders. Then, for each file found, you read them and find where your text is, preferably with some indication of where the match was found, e.g. showing the line numbers.

Finding the files

Finding all the files is not terribly difficult but it involves a few steps and possibly a recursive call. You need to ask the OS the list of files in the wanted folder, extract the ones with the extension(s) desired and call the procedure again on any subfolder, if any. This procedure will vary from one APL to another and even from one version of the same APL to another. It will vary also depending on the OS.

Dyalog came up with a new system function to bypass the OS problem: ⎕NINFO.

⎕NINFO is a Dyalog Version 15 system function that accepts a file path or tie number and returns properties, information about it. Up to 8 pieces of information , depending on the left argument. For example to know the path and the type properties of a file you would use

0 1 ⎕NINFO filename_or_tieno

0 is for the name (path), 1 is for the type. By default the right argument MUST be the name of an existing file or a tie number. The name returned is a string, the type is an integer in the range 0 to 7; 1 means it’s a folder, 2 means it’s a regular file, 4 is a link, etc.

⎕NINFO also accept a variant, ‘Wildcard’, which allows you to use wildcards (patterns) in the filename. It then returns a LIST of property values, one per property requested. For example, to retrieve all the file name and type of the files in folder \temp that have a two-letter extension that starts with an ‘a’ you would use

0 1 (⎕NINFO ⍠ 'Wildcard' 1) '\temp\*.a?'

A ‘*’ is a substitute for any 0 or more characters in a file name or extension; a ‘?’ is a substitute for any single character. It’s like the DOS/Windows wildcard characters. Same meaning.

Here we want to retrieve the name and type of all files in a given folder in order to find out which ones are folders (so we can recurse) so we would specify

(names types) ← 0 1 (⎕NINFO⍠ 'Wildcard' 1) folder,'\*'

Or, because Wildcard is the principal property, we can drop the ‘Wildcard’ string:

(names types) ← 0 1 (⎕NINFO⍠ 1) folder,'\*'

We can’t specify an extension here because we want to recurse (to know which names are folders) and find the type of all files.

So doing

      folders ← (types=1)/names

will tell us which names are folders and we can recurse, gathering the names, until there are no more subfolders.

To find the names with a specific extension we could do something like

      ( ('.',extension)∘≡¨(-1+⍴extension)↑¨names ) / names

but this won’t work if we have more than one extension. Obviously we can loop for each extension but we can also make use of the new ⎕NPARTS system function in Dyalog 15 which splits a file path into its three constituent sections: folder path, basename and extension:

      ⎕NPARTS '\full\path\basename.ext'
┌───────────┬────────┬────┐
│\full\path\│basename│.ext│
└───────────┴────────┴────┘

Applying this function to all our filenames and keeping only the extension as in

      ⎕NPARTS '\full\path\basename.ext'
┌───────────┬────────┬────┐
│\full\path\│basename│.ext│
└───────────┴────────┴────┘

would not only do the trick but be far easier to do.

Reading the files

That one is a bit trickier than it seems.

Most files containing APL code will not be ASCII. They may be using a specific page like Windows-1252. If you are using an APL which is not Unicode you will have to perform some form of translation before making comparisons. Most APLs will have a ⎕NTIE and a ⎕NREAD system function to open a file and read it. And most of the time they will handle the translation for you but this will be limited to the 256 characters in ⎕AV. If the file is simple text with Latin characters only this should be no trouble but if it has some form of encoding like UTF-8 or UCS-2 then you will need to do some work before being able to search them. Even simple .TXT files can be encoded nowadays.

If you have access to Dyalog V15 you can use the system function ⎕NGET which takes care of the nitty gritty details for you. It will decode it need be. For example, doing

      ⎕NGET '/tmp/example.txt'
┌───────────────────────────────┬───────────┬─────┐
│This is a text file            │UTF-8-NOBOM│13 10│
│with 2 lines.                  │           │     │
└───────────────────────────────┴───────────┴─────┘

returns a 3 element array with
1. the text whose lines are always ended with UCS 10
2. the encoding used and
3. the original line endings — since this was on Windows®, it is CR LF, as Unicode points (numbers).

By extracting the first element we get the text. A lot easier than doing this all by yourself.

Finding the text

This one is up to you.

You may want to do a simple Boolean search a la Find (⍷) and display the result any way you want. Your APL may offer more complex search functions. APL2000 has ⍷ and ⎕SS. So does APLX but ⎕SS is also used for regular expressions (regexes) if this is of any interest. Dyalog also has ⍷ and ⎕S for regexes.

You may want to produce a fancier display with line numbers and highlighting matches. It all depends. It’s up to you.

Putting the pieces together

An average programmer should be able to assemble the pieces in a few hours. When this is done a simple )COPY of the utility in your workspace will allow you to search your disks the way you want.

Using library code

In some APLs someone has been through this already.

In Dyalog the user command ]FILE.FIND will do that for you.

And if you want to replace your text by something else the user command ]FILE.REPLACE will do it too.

Default use

These user commands work in conjunction with the ]SETTING workdir which specifies which folders you are working with at the moment.

For example, you may be working on a project which is spread over directories \PX\Main, \utils and \GUI, so your ]SETTINGS WORKDIR would report

      ]SETTINGS WORKDIR
\PX\Main∘\utils∘\GUI

SALT works with .dyalog files to store your code and these commands will look there when searching/replacing.

For example, you may want to find where in your code the word ‘botright’ is found:

      ]find botright
C:\GUI\baadGUI.dyalog

[771] 'f.SF'?WC'SubForm' (sz?size-(top+bottom)4)botright
????????
[919] botright?'Attach'('Top' 'Left' 'Bottom' 'Right')
????????

Total 2 found

The command will show you, for each file where a match is found, the name of the file, which lines contains a match followed by a line of carets (^) indicating where in the line the match was found.

And at the bottom it will report the total number of matches found.

Other usage

You may not want to look into your working directories; you may simply want to look for text elsewhere. You may also want to look in other types of text files, like .txt files.

It can be done.

Both commands accept modifiers to change the folders to use and the file types to use:

-folder accepts a value to specify the folder to use and -types accepts extensions separated by space or commas.

Ex: find the text ‘abc’ recursively in folder \tmp in files of extension txt or mipage:

      ]find abc -folder=/tmp -types=txt,mipage
C:\tmp\goABCgo.txt

[71] sz?abc+123 ? add 123 to abc
??? ???

Total 2 found

Regular expressions

You may be interested in more than a simple text string search. Since version 12.1 Dyalog supports full regular expressions with the PCRE engine via the system operators ⎕S and ⎕R.

To tell ]FIND you want to use regular expressions use the switch -regex. For example to find where all eight letter names beginning with an ‘a’ are assigned a value you could use

      ]find \ba\w{7}← -regex

C:\Program Files (x86)\Dyalog\V15U\SALT\core\SALT.dyalog

[185] allpaths←(getSetting'workdir')∪⊂SALTFOLDER
∧∧∧∧∧∧∧∧∧

… (lines deleted)

C:\Program Files (x86)\Dyalog\V15U\SALT\spice\profile.dyalog

[488] ancestry←0 ⍝ initialize ancestry←0
∧∧∧∧∧∧∧∧∧ ∧∧∧∧∧∧∧∧∧

[497] :If ancestry←'↑'=1↑t
∧∧∧∧∧∧∧∧∧

Total 12 found

This requires that you know a bit about regular expressions.

Here the \ba\w{7}← argument is a regular expression that reads like this:

Step	Description
`\b`	means “find the edge (beginning or end) of a word”
`a`	is taken literally, it means “look for an a”
`\w`	means “look for a word forming character”; in a regular expression it is a Latin letter (a-z), _ or a number (0-9)
`{7}`	means “find the preceding character (here \w) exactly 7 times”
`←`	is taken literally, it means “look for an ?”

Regular expressions are a complex subject but it is worth spending a bit of time on them. They come back regularly(!) and can save you a lot of work/time.

They work with Unicode characters so no extra work is needed on your part.

APL names

You have to pay attention when dealing with APL code. The above example would not detect names that include e.g. a delta (?) and would report false positive on quad names (e.g. ⎕WSID).

To detect an APL name you would have to use this regex

(?i:⍺⍺|⍺|⍵⍵|⍵|(?&lt;!\s:)(?&lt;!^:)(?&lt;![⎕0-9a-z_∆⍙])[a-z_∆⍙][a-z_∆⍙0-9]*)

This says:

Step	Description
`(?i:`	case insensitive mode, names can be in upper or lower case characters
`⍺⍺\|⍺\|⍵⍵\|⍵\|`	look for either `⍺⍺`, `⍺`, `⍵⍵` or `⍵` which are all valid Dyalog APL names (the order is important otherwise `⍺⍺` would never be found if it was AFTER `⍺`) or a name
`(?<!\s:)`	that is not preceded by “space-colon”. This is to eliminate :keywords.
`(?<!^:)`	that is not preceded by “colon” at the beginning of a line (:keywords again)
`(?<![⎕0-9a-z_∆⍙])`	and not preceded by a quad or a name forming character – we don’t want to start in the middle of a name
`[a-z_∆⍙]`	and consists of a letter, an _, a ∆ or ⍙
`[a-z_∆⍙0-9]*`	possibly followed by 0 or more letters, _, ∆, ⍙ or numbers

Quite a mouthful. And that is ignoring all the accented characters!

Ignoring strings and comments

Also, you may want to exclude string constants and/or comments when searching and that can be tricky. Here is a program with both string constants and comments:

∇ SandC
[1] this is regular code ? This is a comment
[2] 'this is a string, a constant' this is not
∇

Here is a way to do that but a warning should be issued: this is not for the faint hearted, you may want to be seated properly before reading further.
Here is an expression to find the characters ABC not within comments or text in the current workdir

]Find "('[^']*'|?.*$)?(?(-1)(*SKIP)(*FAIL)|ABC)" -regex

The trick here is to ask the regex engine to “skip over” strings and comments.
Usually a regex engine trawls from one character to the next until a match is found. It keeps track of where to continue after trying the expression by keeping a pointer to the next position. The PCRE engine has a feature that allows skipping to a specific point upon failure, to change the value of that pointer. We can use this here.
The way it works is like this: we look for a string or comment, if we succeed we ask to skip after what we’ve found should a failure occur and then we provoke one (failure) right after thereby effectively skipping over the string or comment but only if we found one. If we didn’t find a string or comment we look for the wanted string (here ABC). This allows us to bypass strings and comments. Here are the details:

Step	Description
`(`	capture the following
`'`	look for a quote
`[^']*`	followed by 0 or more non quotes
`'`	followed by a quote – this constitutes a string
`\|`	or
`?`	a lamp symbol
`.*`	followed by any string until
`$`	the end of the line – this constitutes a comment
`)?`	0 or 1 time
`(?(-1)`	did we capture anything in the last group of parentheses?
`(*SKIP)`	if so then skip here upon failure, change the tracking pointer to restart here
`(*FAIL)`	and provoke a failure to match so the engine will move to the skipped position. We could have also used `(*F)` instead.
`\|`	or if we did NOT find a string or comment at the current position in the line
`ABC)`	then we look for our expression, `ABC`

Note that because we are looking for a string before a comment there is no problem with lamps within string. Had we tested for comments before we could have run into trouble with lines containing strings that include lamp symbol as the comment would have been deemed to be everything after the symbol, including the closing quote et al. Not what we want.

And now the scary part

Imagine you want to know where variables are assigned a number in your script files but only in the code, not the string constants or comments. You could use

]find "('[^']*'|?.*$)?(?(-1)(*SKIP)(*FAIL)|(?i:??|?|??|?|(?&lt;!\s:)(?&lt;!^:)(?&lt;![?0-9a-z_??])[a-z_??][a-z_??0-9]*))?¯?\d+" -regex

Have you fallen off your chair yet?
Obviously this is a bit far-fetched, in practice you would simply look for word characters followed by assignment of a number and eyeball the result. But it goes to show that regular expressions are to be reckoned with.
Happy regexing.

Notes

There is a related video you can have a look at, the link is https://youtu.be/KfplQOG5SUw , enjoy!

Finding text in files

Introduction

The process

Finding the files

Reading the files

Finding the text

Putting the pieces together

Using library code

Default use

Other usage

Regular expressions

APL names

Ignoring strings and comments

And now the scary part

Notes

About The Author

Dan Baronet

Leave a reply Cancel reply

About

Join our Mailing List

Categories

Sustaining Members

Members please login

footer sidebar left

Finding text in files

Introduction

The process

Finding the files

Reading the files

Finding the text

Putting the pieces together

Using library code

Default use

Other usage

Regular expressions

APL names

Ignoring strings and comments

And now the scary part

Notes

About The Author

Dan Baronet

Leave a reply Cancel reply

About

Join our Mailing List

Categories

Tags

Sustaining Members

Members please login

footer sidebar left