Read HTML text data, read with java, after putting it in list Read the String data in the list, replace it with a regular expression, process it, and save it in all.
Batch processing a large number of similar files The addition to the line-by-line List from reading the file in java is complete.
You can save the data that gets stuck in the same regular expression as another Supports multiple hits The number of data does not shift by performing exception handling when there is no hit
Search all the data of String in list Specifications to replace the data you are looking for and store it in all (String). From what number to store the desired data What to do when there is no desired data can be decided by the argument No operation at all when "" At other times, add the specified character to separate them with commas.
argument | meaning |
---|---|
be | String before regular expression replacement |
af | String after regular expression replacement |
no | Character string when not caught in the search |
s | How many hits do you think from? |
num | How many hits to think about |
be_set | Start the search after the character string entered here is hit |
af_set | End the search when the character string entered here is hit |
--For all the elements of list --Be hit confirmation --Af hit confirmation --When be_flag is true --Check the hit before replacement, replace and add to save (list) --However, when the number of hit confirmation elements is 1, add it to all here and return --Af hit confirmation (break if hit)
--For all elements of save --Save.size = 0 exception handling --Exception handling of the magnitude relationship between save.size and the argument of the number of hits in the search --Add all the number of elements to all
call
sp("<.+><.+-(.+)\"></i></div>"
,"$1q"
,"noq"
,1
,14
,"<.+>Data 1 table</h3>"
,"<.+>Data 2 table</h3>");
In such a case HTML description
Table of data 1
From
Data 2 table
With elements up to
<.+><.+-(.+)\"></i></div>
What hits the regular expression of
$1q
Replace with ($ 1 is the replacement symbol after a hit in regular expression replacement. The hit in parentheses is treated as an element as it is.
When there is no data
noq
This is a process to shift the location of the data cell if it is a blank line when processing it later.
Then a comma is added and the process is finished. As an advantage, even if the formats of data 1 and data 2 are the same, the correct data can be obtained properly.
grobal
ArrayList<String> list = new ArrayList<String>();
String all = "";
String qq = "qqqqqqqqq"; //A string that won't hit
public static void add_all(String index){
all = all + index + kn;
//kn is the data delimiter when adding data when debugging"\n"At the time of release","Recommended
}
over
//Overload with fewer arguments
//qq is a string that doesn't seem to hit
//Always look for one element when there are three arguments
public static void sp(String be,String af,String no){
int s = 1; int num =1;
sp(be,af,no,s,num,qq,qq);
}
public static void sp(String be,String af,String no,int s,int num){
sp(be,af,no,s,num,qq,qq);
}
sp
public static void sp(String be,String af,String no,int s,int num ,String be_set,String af_set){
int i;
boolean be_flag = false;
boolean af_flag = false;
boolean cutset = false;
//If the start / end flag is not entered, perform a full search.
if(be_set.equals(qq) && af_set.equals(qq)){
be_flag=true;
}
//When the number of search hits is one, high-speed processing is possible with the cutset flag.
if(s ==1 && num ==1){
cutset = true;
}
ArrayList<String> save = new ArrayList<String>();
save.clear();//I don't need it, but for the time being
//Repeat for list size
for(i = 0;i < list.size();i++){
//Get list data
String line = list.get(i);
//Start flag operation
if(line.matches(be_set)){
be_flag=true;
}
//End flag operation
if(line.matches(af_set) && be_flag){
break;
}
if(line.matches(be) && be_flag){
line = line.replaceAll(be,af);
save.add(line);
deb(0,line);
//Aiming to speed up processing when only one piece of data is searched
if(cutset){
String tem = save.get(0);
add_all(tem+",");
return;
}
}
}
//After reading all the data
//When there was no hit
//The data after replacement""If not, add the argument no
if(save.size() == 0){
if(!no.equals(""))add_all(no);
//When not
}else{
//Exception handling: Align the number of inputs and the number of hits
if(save.size() < num){
num = save.size();
}
if(save.size() < s){
s = save.size();
}
//Add only the specified amount in the argument to all
for(i=0;i<num;i++){
String tem = save.get(s+i-1);
add_all(tem);
}
}
//Separated by commas after data reflection
if(!no.equals(""))add_all(",");
}
This is the result of adding what I need without thinking about the structure, but for me, I think it's not bad. I used this to read the data from HTML. I think it's okay to use the python library, but I took this format because it wasn't long before I came up with the idea.
Recommended Posts