Help Request - XpathOnUrl or CsQueryOnUrl for Infinite Scroll pages +Work-Around

gromex · February 1, 2018, 7:31am

Working with Infinite Scroll???

Hey, any idea how to run a query on an page with infinite scroll?

I'm trying to pull posts from a user's profile on a mobile app where users share memes. Yeah. MEMES!

The site uses infinite scrolling, so I have to scroll to the bottom as far as many posts on that page exist to be able to view the entire html contents.

Right now, my setup is using an iPhone 7 user-agent in the Global HTTP Settings.

The query I'm using is a simple Dump of a CsQueryOnUrl whose href tag contains a substring, and then returns the href value:

=Dump(CsQueryOnUrl("URL","a[href*='substring']","href"))

The query returns the first set of results only. In my case, exactly 10.

Right now, since I don't really know how to use SeoTools with VBA, which I'm assuming would be my best bet, my work-around is to use a UA Spoofer in chrome & an Autoscroll extension to let the entire page contents load. THE^NN_NᴺᴺN^NN_Nᴺ - I have to save the entire raw html page locally with an xml extension, and point the url to it. But the last time I used this work-around, I was using XpathOnUrl. So using CsQueryOnUrl won't work with just a local html file, right? That's fine. I can switch to Xpath queries. I just need to figure out this Infinite Scroll issue.

Thanks in advance for any help! I'm open to any suggestions

Additional similar request: ↴

Some of these pages I'm pulling may contains over 10K posts. If each scroll loads 10 posts, that's alot of html code.

Any ideas on how to speed things up? Or maybe using the autoscroll extension I use in Chrome, can I tell Chrome somehow to load a page without actually "Loading" the page contents?
I'm not sure if that makes sense or if that's how html works, but if I disable the loading of actual images on the page but load the code only, it should process the code much faster no? I don't actually need the images just yet. I just need the hrefs for them.

These posts are all image based.

Urbi · February 1, 2018, 3:53pm

Can you give ous a URL/name of that app to try find a solutions?

gromex · February 2, 2018, 6:50am

app name is "ifunny :)"

site is https://ifunny.co/

Their homepage has the infinite scroll.

I'm trying to parse particular user profiles. But it's the same setup.

It look like when I save the html, It get's rid of the elements and keep just the javascript.

chilly_bang · February 6, 2018, 9:47am

there is no such string in the source code of ifunny.co

dovydasm · February 7, 2018, 2:03am

Hi,

@diskborste and I created a custom connector for you. Save this to a file named IFunny.xml and put it inside your connectors folder. Restart excel and it will be available under Social -> Ifunny -> Post list. Enter the username of the user and the number of desired links. It will stop either when it reaches your defined limit of links or when there's nothing left to scrape. You might want to add a random delay between requests (in Settings -> Global HTTP settings)

<?xml version="1.0" encoding="utf-8" ?>
<Suite Title="iFunny" Id="IFunny" Category="Social"
       SourceUrl="https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/"
       HelpUrl="http://seotoolsforexcel.com/">

  <Author Name="dovydasm" Url="https://github.com/dovydasm"/>

  <RestConnector Id="PostList" Title="Post list" HelpUrl="https://ifunny.co/">
    <Parameters>
      <Text Id="UserId" Title="Username" Debug.DefaultValue="JelloSenpai" Required="true"/>
    </Parameters>
    <Paging PageSize="10">
      <Parse>
        <XPath Id="NextPageToken" Expr="//li[contains(@@class, 'stream__item')]" Attribute="data-next"></XPath>
      </Parse>
    </Paging>
    <Fetch>
      <HttpSettings>
        <RequestHeaders>
          <Header Name='X-Requested-With'>XMLHttpRequest</Header>
        </RequestHeaders>
      </HttpSettings>
      <Fetch.Url>
        <![CDATA[
        https://ifunny.co/user/@Model.UserId
        @if(Model.PageCursor.Page != 0)
        {
          @: /timeline/@Model.NextPageToken
          @: ?batch=@(Model.PageCursor.Page+1)
          @: &mode=grid
        }
        ]]>
      </Fetch.Url>
    </Fetch>
    <Parse>
      <XPath Expr="//div[@@class='post__media']">
        <Compute Id="Url" Title="URL">
          <Compute.Expr>
            <![CDATA[
              https://ifunny.co@(Model.Path)
            ]]>
          </Compute.Expr>
          <Xpath Id="Path"  Expr="a[contains(@@class, 'grid__link')]" Attribute="href"></Xpath>
        </Compute>
      </XPath>
    </Parse>
  </RestConnector>
</Suite>

gromex · February 7, 2018, 6:53am

@dovydasm Oh wow. That's very nice of you guys. I'll have to try this out first thing tomorrow when I get off work.

The connector looks pretty simple. I'll use this to add more parameters if needed.

Really appreciate this.

BTW, default user is "JelloSenpai"
Is that just a random account you found? lol

Thanks again!!
You too @diskborste !!!

dovydasm · February 7, 2018, 3:38pm

Yeah, it was the first user I found that had a decent amount of posts. If it's annoying to see it every time you load the connector, you can delete this part from the XML file:

Debug.DefaultValue="JelloSenpai"

gromex · March 26, 2018, 10:12am

Hey @dovydasm and @diskborste sorry to bring this up again. Should I start a new thread? My current situation is now with pastebin.com

It's similar to infinite scroll, but this time it's infinite pagination? If that makes sense.

Here's a link for example
https://pastebin.com/search?q=excel

I'm not requesting for another custom connector (even though that would be awesome ) But maybe you can point me in the right direction to configuring either the current "iFunny" connector you attached earlier, or maybe there's another way to go about the data extraction process?

I'm not looking for anything in particular, but if I had to, how can I paginate the link I sent and return all the hrefs from the results?

I understand the Xpath part of the connector but I'm not sure what the process was to compute the expression.

I noticed you guys added "/timeline/" in the CDATA block so when I inspected the page I see the code with that exact path being transformed with every scroll. Can you point me to a guide or discussion on how to manipulate these pages myself?

I'll gladly create my own repository of connectors and share them with anyone who might make use of them. I get content from a lot of different sites so maybe we can start a trend here to have users create as much new content as possible.

Any info would be appreciated! Thanks!

gromex · March 26, 2018, 11:23am

Nevermind, I think I understand this now.

Turns out Google's inspector console tells the whole story.

diskborste · March 27, 2018, 8:47pm

Hi, apologies for the late reply. Did you manage to figure it out? If not, I will give the pastebin site a go and see if I can paginate the results.

Great mindset with the connectors crowdsourcing! That is the purpose of the easy-xml-markup. Also, SeoTools 8.0 is being released as we speak so the Connectors Library should make the sharing process a whole lot simpler

gromex · September 14, 2019, 6:21am

@diskborste Hey.

I was hoping you can help me with the original “IFunny” connector you guys made awhile back. Is there a reason why it wouldn’t be working properly anymore? Did ifunny.co change their source since then? Or am I limited to the number of results returned due to running a trial version of SeoTools for Excel?

Currently the connector is only returning the first 9 parsed URL’s. So i’m assuming that the xpath is still working but the paging needs to be updated. Can you guys help with this? I would really really appreciate an updated xml if that’s what needs to be done. This way I can also see what changes you guys have made and learn from it too. Please and thank you.

I will be purchasing a new license as soon as possible as I am going back into work that will require full time use of SeoTools for Excel.

Let me know if any more information might be needed from my end

diskborste · September 14, 2019, 7:47am

Hi @gromex,

I tried it and still works for me. Just to be on the safe side, I added the attribute EvenPages="false" on line 12:

This will force the pagination to continue if there are less than 10 results per page. Can you try if it works for you?

diskborste · September 15, 2019, 9:37pm

Update - I decided to make a real connector of the iFunny site. I have added it to the connector branch for the next release. This is because the connctor may use code that is not supported in the official version.

Try it out, contains some great stuff

Link to connector, right click and save to Connectors folder.